ChIPseqR: analysis of ChIP-seq experiments
© Humburg et al; licensee BioMed Central Ltd. 2011
Received: 3 January 2010
Accepted: 31 January 2011
Published: 31 January 2011
Skip to main content
© Humburg et al; licensee BioMed Central Ltd. 2011
Received: 3 January 2010
Accepted: 31 January 2011
Published: 31 January 2011
The use of high-throughput sequencing in combination with chromatin immunoprecipitation (ChIP-seq) has enabled the study of genome-wide protein binding at high resolution. While the amount of data generated from such experiments is steadily increasing, the methods available for their analysis remain limited. Although several algorithms for the analysis of ChIP-seq data have been published they focus almost exclusively on transcription factor studies and are usually not well suited for the analysis of other types of experiments.
Here we present ChIPseqR, an algorithm for the analysis of nucleosome positioning and histone modification ChIP-seq experiments. The performance of this novel method is studied on short read sequencing data of Arabidopsis thaliana mononucleosomes as well as on simulated data.
ChIPseqR is shown to improve sensitivity and spatial resolution over existing methods while maintaining high specificity. Further analysis of predicted nucleosomes reveals characteristic patterns in nucleosome sequences and placement.
In a related type of experiment a subset of nucleosomes is targeted through chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing of ChIP fragments [5, 6]. This technique, commonly known as ChIP-seq, has also been used to investigate the binding of transcription factors [7, 8]. Due to the relatively short binding site of most transcription factors these studies typically produce wider, relatively isolated peaks in read counts compared to those obtained from nucleosome sequencing.
The increasing number of ChIP-seq and nucleosome positioning studies has led to the development of various approaches to analyse these data. The analysis typically starts by mapping short sequence reads to a reference genome, ignoring reads with non-unique alignments. To identify protein binding sites from these mapped reads, many commonly used methods generate a strand-independent profile of windowed read counts by recording the total number of reads on both strands that fall into a sliding window, or alternatively the number of overlapping extended reads for each position in the genome [5, 7–11]. In both cases the height of the read count profile corresponds to the total number of reads in the window, irrespective of strand. The read extension method implicitly uses separate windows for the two strands, combining the sequence reads in both windows into a single read count. Explicitly using two distinct windows to construct the read count profile allows for the inclusion of information about the length of the protein binding site . Peaks in the resulting read count profile are usually assessed for significance based on total read counts compared to a control sample [7, 9], background model [8, 10] or permutation of observed read counts . The use of hidden Markov models (HMMs) [5, 11] and kernel density estimators [12, 13] has been suggested as an alternative way to identify significant peaks in the read count profile. Although this strand-independent approach to ChIP-seq analysis has been used frequently with apparent success, it ignores the fact that DNA binding proteins are expected to generate a similar number of reads on both strands adjacent to the binding site.
The need for a new approach to ChIP-seq analysis that incorporates the characteristics of protein binding sites has been recognised and has led to the development of new strand-specific methods. Kharchenko et al.  suggest three different approaches to utilise strand-specific read counts that outperform established strand-independent methods [7, 8]. The SISSRs algorithm of Jothi et al.  considers the difference of read counts on both strands in a sliding window and locates potential protein binding sites by identifying sign changes in this net read count. Although these methods were developed with transcription factor analysis in mind, the methods presented in  are general enough to be extended to a nucleosome related analysis. The SISSRs algorithm assumes that binding sites are isolated, i.e., peaks from neighbouring binding sites do not overlap. This is unlikely to be true for nucleosomes since they are expected to be located close to each other, which restricts SISSRs to the detection of extended regions of enrichment. This makes SISSRs unsuitable for the identification of individual nucleosome positions. A similar issue arises with the method proposed by Zang et al. . Although their method was specifically designed to identify histone modifications from ChIP-seq data it focuses on the detection of broad regions of enrichment rather than individual nucleosome positions. Spyrou et al.  propose a strand specific HMM analysis of ChIP-seq experiments and demonstrate its utility on transcription factor and histone modification data. Although Spyrou et al. make a deliberate effort to obtain results at high resolution the HMM framework limits the resolution that can be achieved by this method, which may be insufficient to distinguish between adjacent nucleosomes.
When sequencing nucleosomal DNA, either to determine nucleosome positioning or to identify specific histone modifications, it is important to realise that the binding site is substantially longer than for a typical transcription factor. While it may be acceptable to ignore the expected gap between peaks on forward and reverse strands when the binding site is small compared to the window used, as is often the case for transcription factor binding sites, this is not necessarily true for nucleosomes. Some studies choose the size of the sliding window to be the average DNA fragment length [5, 8, 9] while others choose a smaller window of 100 bp [7, 10]. It should be noted that these window sizes are substantially larger than the transcription factor binding sites in question, but similar in length to or shorter than nucleosome-bound DNA. The use of 1 kb windows [5, 9, 11] or even 1 Mb windows  has been suggested. Although the increased window size alleviates the need to model the length of binding sites explicitly, it also leads to a notable decrease in resolution.
To locate nucleosomes at high resolution we propose to use an explicit model of protein binding site characteristics and of the resulting read patterns. This requires the model to be adjusted to the details of the experiment under consideration. In particular the impact of changes to the protocol used to isolate and purify nucleosomes has to be considered carefully. We demonstrate our approach to the high resolution analysis of ChIP-seq experiments by introducing ChIPseqR, an algorithm designed for the identification of nucleosomes. This method provides a number of parameters that allow it to be adjusted to different types of experiments but here we focus on the analysis of end-sequenced mononucleosomes after digestion with MNase. ChIPseqR is applied to simulated data on which it is shown to identify nucleosomes at high resolution while achieving better sensitivity and specificity than alternative methods. The favourable performance of ChIPseqR is confirmed through the application to end-sequenced mononucleosomes.
To reliably identify nucleosome positions it is necessary to model the read counts associated with a binding event as well as read counts in the absence of protein binding. A frequently used assumption is that background sequence reads, which are unrelated to binding events, are independently and uniformly distributed throughout the genome [4, 8, 15]. This implies that the number of these background reads X bg starting at each position of the genome follows a Poisson distribution with constant rate parameter λ. Examination of the read density in negative control samples suggests a heterogeneous distribution of background reads . To allow for this heterogeneity the use of a background model that incorporates changes in read density by assuming that λ follows a gamma distribution has been suggested by Ji et al. .
with For each position i the strand specific local background rate can be estimated as where x bg(i) is the observed number of sequence reads in window i on the relevant strand. This background model adapts to locally observed read densities without imposing a distribution on λ.
Confirming the observations made by Kharchenko et al.  we find that the background read density includes large peaks that are seemingly unrelated to the presence of nucleosomes. While the background estimation procedure described above adapts to changes in the background rate, the presence of large isolated peaks will lead to overestimation of the read rate for the surrounding area. A robust estimate of the background read rate is obtained by limiting the change in read rates between adjacent, non-overlapping, windows. If is positive we choose the robust estimate as the largest value j such that and for an appropriately chosen probability λ.
A nucleosome will produce several DNA fragments in the sample. During the sequencing process short reads are produced from the 5'-ends of both strands of the DNA fragment population. The start position of sequence reads generated from this population corresponds to the start and end positions of DNA fragments in the reference genome. Although these may vary due to differences in fragment length and relative position of the nucleosome within the fragment, all read start sites will be located in proximity to the nucleosome but not within the strech of histone bound DNA itself. This leads to a region of increased read density on the forward strand upstream of the nucleosome which is mirrored by a region of increased read density on the reverse strand downstream of the nucleosome. The nucleosome, located between the two peaks in read density, is relatively depleted of sequence reads. This creates a distinctive pattern that can be partitioned into three regions: forward support (fwd), binding (bind) and reverse support (rev) region (Figure 1). Note that the forward and reverse support regions only cover the respective strand, allowing for overlapping peaks from neighbouring binding sites.
Consider a binding site of length b = 2w bind + 1 starting at position i containing X bind(i) sequence reads from both strands, a support region on the forward strand of length s = 2w sup + 1 starting at position i - s containing X fwd(i) forward strand reads and a support region on the reverse strand of the same length starting at position i + b containing X rev(i) reverse strand reads. Here we assume that the read counts in these regions follow Poisson distributions, i.e., X region(i) ~ Poisson(λregion(i)), where region is one of fwd, bind or rev. Using the same assumptions as for the background distribution we consider X region(i) to be the sum of w region random variables Y region (j) ~ Poisson(Λregion (i)).
for a suitably chosen γ to calculate W fwd. The truncated estimate is defined similarly.
Note that this score is only meaningful if there is at least one read in each of the two support regions and .
To assess the significance of nucleosome scores the distribution of S under the null hypothesis that no nucleosome starts at position i has to be determined. The nucleosome score for each position in the genome is the sum of the scores for the three regions of the potential nucleosome position (Equation 5)) and the score for each region is derived from the likelihood ratio statistic W region (Equations 3) and 4)). If the null hypothesis is true This might suggest the use of to model the distribution of binding site scores under the null hypothesis. However, it is important to realise that, even under the null hypothesis, S fwd and S rev are unlikely to be independent due to the use of truncated rate parameter estimates, and the asymptotic results may not hold. To avoid misleading p-values that may result from the use of an incorrectly specified null distribution we generalize the above model to where is the sample median of the nucleosome scores and σ is estimated from the observed nucleosome scores. To obtain a good fit for the null distribution, and thus reliable p-values, it is necessary to identify a subset of scores that is representative of the null hypothesis.
Observing that the presence of nucleosomes will lead to larger nucleosome scores we truncate S at the τth and 50th percentile, where 0 ≤ τ < 50 is chosen to exclude outliers in the lower tail while retaining a sufficiently large number of observations. This ensures that the vast majority of scores relating to protein binding events have no influence on the parameter estimation while also guarding against extreme values in the lower tail, which are observed when calculating nucleosome scores for positions that place unusually large numbers of sequence reads in the binding region. A maximum likelihood estimate for σ is obtained by fitting a truncated half-normal distribution to the selected subset of nucleosome scores.
The p-values for all observed nucleosome scores are calculated based on the estimated null distribution and corrected for multiple testing. To control the false discovery rate (FDR) we use a Benjamini-Hochberg procedure . If the data contain many nucleosomes it may be beneficial to modify this procedure to account for the estimated proportion of true null hypotheses. This is done using the αAFDR approach discussed in , where it is shown to be equivalent to the procedure proposed by Storey et al. .
To assess the performance of our method we compared it to a number of other peak-finding methods on simulated data. The three methods presented in , mirror strand peaks (MSP), mirror tag correlation (MTC) and window tag density (WTD), were chosen for comparison because they allow for strand specific peaks in read counts and employ different strategies to handle the expected gap between peaks. Briefly, WTD calculates a binding score based on the number of forward and reverse strand tags within a window upstream and downstream of a potential binding site. While this utilises the fact that a protein binding event should produce a peak in forward strand read counts upstream and a peak in reverse strand read counts downstream of the binding site, WTD does not require the two peaks to be similar in magnitude nor is the gap between peaks considered. MTC is similar to WTD but uses the correlation between forward strand and reverse strand reads to introduce a measure of similarity between the two peaks. Sequence reads close to the centre of the binding site are ignored when calculating the correlation coefficient which effectively introduces a gap between the two peaks. A Gaussian smoothing kernel is used by MSP to estimate local read densities for both strands and binding sites are required to lie between peaks of comparable magnitude. The R implementation of these methods provides the option to estimate the window size from the cross-correlation between the two strands. We found that this does not produce satisfactory results for the nucleosome data under consideration and used a 160 bp window instead. Unlike the methods proposed by Kharchenko et al. GeneTrack  was specifically designed for the detection of nucleosome positions. Like MSP, GeneTrack uses a Gaussian kernel to smooth read counts. The algorithm then proceeds to score each peak based on its height and reports the highest scoring set of non-overlapping peaks. GeneTrack makes no attempt to assess the significance of peaks, which results in a high number of nucleosome predictions but is likely to lead to many false positives if the data contains regions that are depleted of nucleosomes. To fairly evaluate GeneTrack's performance relative to methods that control the false discovery rate we choose a cut-off that leads to a similar number of significant predictions as the one produced by ChIPseqR. A further peculiarity of GeneTrack is that nucleosome predictions are centred on the peak in read counts. This results in predictions that are shifted by approximately half a nucleosome length compared to the actual position of the nucleosome. This inaccuracy was corrected for the purpose of this comparison. ChIPseqR was used with a binding region of 128 bp, support regions of 17 bp and a background window with w = 1000. These settings were found to perform well on real data (see Methods) and were used here for consistency.
Number of nucleosomes identified on simulated data by different methods at three levels of coverage
3 M reads
6 M reads
10 M reads
To further investigate how reliably different methods identify stable nucleosomes we consider the stable nucleosomes identified from different samples at the same level of coverage (Table 1). For each of the three samples the stable nucleosomes identified by a given method are recorded such that X i (j) = 1 when the j th stable nucleosome was detected in sample i and 0 otherwise. We assess a method's ability to reliably identify the same stable nucleosomes from different samples by computing the total correlation C tot(X 1, X 2, X 3)  between the three samples. This provides a measure of the information shared between the samples, i.e. it will be maximised if the same set of nucleosomes is identified for each sample. To obtain a measure of repeatability that is comparable between methods and different levels of coverage regardless of the number of identified nucleosomes, we divide C tot(X 1, X 2, X 3) by C max(X 1, X 2, X 3) which is calculated as the total correlation of a permutation of X 1, X 2, X 3 that maximises the number of repeated nucleosome detections. This normalises the repeatability such that it lies in . Over 50% of the stable nucleosomes predicted by ChIPseqR from the 10 M read samples were successfully identified in all three samples which corresponds to a repeatability of 0.31. None of the other methods achieved a repeatability of more than 0.16. It is worth noting that while MSP produces a relatively large number of predictions they tend to differ substantially between samples, leading to a very low repeatability.
Another important performance measure is the resolution at which predictions are made, i.e., how close predicted nucleosome positions are to actual nucleosome locations. Peak-finding algorithms like the ones considered here typically involve a smoothing step to elucidate peaks. This comes at the cost of reducing the resolution of the raw data. To measure the resolution achieved by the four methods in this comparison we examined the distance between predicted stable nucleosomes and the underlying stable nucleosome positions (Figure 2). The best resolution was achieved by GeneTrack followed by MTC and ChIPseqR. While WTD appeared to produce predictions at an acceptable resolution, the low sensitivity of this method made an accurate assessment of the resolution difficult. The resolution of MSP was clearly the lowest in this comparison, limiting its usefulness for localising individual nucleosomes.
To further investigate the utility of the proposed method we applied it to end sequenced mononucleosomes from Arabidopsis thaliana. Sequence reads were generated through Solexa sequencing of DNA fragments. Mononucleosome-sized DNA fragments were selected for sequencing after MNase digestion. Approximately 8 million uniquely mapped reads were used for the analysis.
It has been suggested based on theoretical considerations as well as empirical findings that stable nucleosomes are associated with a periodic pattern of dinucleotides [1, 2, 22]. To assess whether similar patterns play a role in nucleosome positioning in Arabidopsis and whether they can be detected based on the stable nucleosomes predicted by our method, we used the over 2200 non-overlapping nucleosome positions identified with b = 147, s = 10 to investigate the properties of nucleosome sequences. To this end the position specific frequency of all dinucleotides was determined in a ±500 bp window around the centre of predicted nucleosomes (see Methods).
Nine genomic regions containing 13 nucleosome predictions with different levels of confidence were selected for qPCR verification. Out of the 13 predicted nucleosomes 11 were verified (see Methods).
The DNA abundance profiles obtained through qPCR show several well defined peaks. Many of these correspond to predicted nucleosomes while others appear to suggest the presence of adjacent nucleosomes that were not predicted by the initial analysis. To investigate this further we identified peaks in DNA abundance that may correspond to positions of additional nucleosomes that do not overlap predicted nucleosomes and tested them for significance using the same procedure as above. This results in the detection of additional nucleosomes in 10 of 14 potential nucleosome positions. The remaining four locations are characterised by flat DNA abundance profiles and may correspond either to nucleosome free regions or areas of delocalised "fuzzy" nucleosomes (see additional file 1: Figure S1).
An R package implementing the method described here is freely available from Bioconductor and from the authors' web page at http://www.bioinformatics.csiro.au/ChIPseqR together with simulated datasets.
Unlike the majority of existing methods for the analysis of ChIP-seq experiments the novel approach presented here is suitable for the detection of nucleosome positions. The explicit modelling of nucleosomes increases the sensitivity and makes it possible to detect their positions at low coverage levels. Combined with the relatively small amount of smoothing required for the analysis this enables ChIPseqR to reliably identify nucleosomes at high resolution. This is especially important for the analysis of nucleosome positions and histone modifications since this typically requires the sequencing of a much larger proportion of the genome than would be required for a transcription factor study, which results in lower coverage for the same amount of sequencing. Furthermore, nucleosome studies aim to distinguish binding sites that are usually located close to each other and may vary in location throughout the sample. It is therefore important to identify binding sites at high resolution and to allow for partially overlapping peaks in read counts associated with adjacent binding sites.
Although all methods compared here use strand specific read counts and all except WTD are able to handle overlapping peaks from adjacent binding sites only ChIPseqR achieved good sensitivity and high resolution. The high resolution achieved by ChIPseqR is further highlighted by the detection of periodic dinucleotide patterns in the DNA sequence of predicted binding sites. Strong evidence of these patterns was found without any further sequence alignment, suggesting that the predicted positions of stable nucleosomes are highly accurate. This result is also indicative of high specificity since false positives would dilute any sequence motifs. It should be noted that all methods in the comparison would benefit from a further increase in coverage. We expect that differences in performance will diminish as coverage saturates but achieving sufficient coverage to reach saturation is still costly which makes it desirable to obtain reliable results from low coverage data.
Another performance measure of interest is an algorithm's ability to produce consistent results for repeated experiments. This is of particular interest when low coverage data is used since this is likely to result in the prediction of only a subset of all binding sites. Despite this one would like to obtain results that do not depend too much on the details of a particular sequencing run, i.e., the analysis should be robust towards moderate changes in read position. This is important to enable researchers to draw reliable conclusions from the analysis that can be expected to hold when the experiment is repeated. Although all methods in the comparison perform better than random, indicating that they all provide a core of consistently predicted nucleosomes, ChIPseqR provides substantially higher repeatability than any other method at all levels of coverage.
The regions chosen for qPCR verification were selected to encompass predictions with a range of FDRs. The fact that all low confidence predictions were verified suggests good specificity even for relatively high values of the nominal FDR. Consequently the FDR cut-off chosen for the initial analysis may be considered conservative.
When discussing the performance of ChIPseqR it is important to realise that this depends on a number of parameters. Apart from the coverage provided by a given dataset, some of the model parameters will impact on the performance characteristics. Choosing b, the length of the binding region, close to the actual binding site length will result in high resolution and is expected to improve specificity. However, this will reduce the method's ability to detect less stable binding sites, thus reducing sensitivity. Such unstable binding sites can be detected more reliably if b is reduced while the length of the support regions is increased. This increases sensitivity but may lead to reduced resolution. It should be noted that the choice of b = 128 that was determined to be optimal for the data considered here is less than the 147 bp that may be expected for a nucleosome. However, this reduced binding region is consistent with the presence of alternative preferred positions of a stable nucleosome located ±10 bp from a central position that have been observed previously .
Another important factor is the estimation of the background read rate λbg. There are at least two different approaches to the estimation of this parameter. The one taken here relies on choosing a large window relative to the binding site which is applied to the same sample that contains the binding sites. The relatively large size of the background window results in an average read rate close to a base level even if there are binding sites within the window. However, the presence of binding sites will increase the estimated background rate above the actual background, thus producing conservative binding site scores. An alternative approach would use a separate control sample to assess the background read rate. This has the potential to produce more accurate results and is generally recommended but we note that it may be difficult and costly to obtain good control samples for this purpose. Even when a suitable control sample is available the issue of an appropriate normalisation method for this type of data has not been settled (see  for a discussion of some problems with current approaches). We therefore consider the single sample method presented here to be more practical at the moment, especially for nucleosome sequencing.
The model proposed here was specifically designed for the analysis of end sequenced nucleosomes after MNase digestion. Although several ChIP-seq analysis algorithms are available for tasks like this they are typically designed for the analysis of transcription factor binding. Our analysis demonstrates that it is beneficial to take the substantial differences between transcription factor binding and nucleosome positioning experiments into account explicitly. Although some existing methods may be general enough to handle both transcription factor ChIP-seq and nucleosome sequencing data they will be outperformed by more specialised methods, especially on low coverage data. Although GeneTrack was designed for the analysis of nucleosome experiments and has been used for this purpose with apparent success  it does not take advantage of the characteristics of nucleosome experiments. This results in a performance similar to the general ChIP-seq methods.
The main advantages of the model presented here are the high resolution of nucleosome predictions and its ability to operate on low coverage data. This makes it especially suitable for the analysis of nucleosome positions in situations where it is difficult to obtain high coverage, i.e. many reads per nucleosome position, because the genome under consideration is large or the sample contains a mixture of many cells (and therefore many different nucleosome configurations).
The most important parameters of our model, b and s, are closely related to the underlying biology and length of DNA fragments used for sequencing. This makes ChIPseqR a flexible and versatile method for the analysis of nucleosome positioning and makes it possible to adapt the method to a range of different experiments. We have demonstrated its favourable performance in the context of nucleosome positioning experiments and similar performance is expected for other nucleosome related experiments, such as histone modification studies. More generally it should be possible to apply the approach demonstrated here to other ChIP-seq experiments. We note however that the binding site model should be adjusted to accommodate the differences in experimental procedures, especially when DNA is fragmented through sonication rather than MNase digestion.
Data were simulated using the ChIPsim R package (available at http://www.bioinformatics.csiro.au/ChIPsim and from Bioconductor) with default parameters. Briefly, ChIPsim generates a sequence of nucleosome features to cover a given genome. Each type of feature has its own characteristics, corresponding to different chromatin structures commonly observed in nucleosome positioning experiments. Here we distinguish between 'stable', 'phased', 'fuzzy' and 'nucleosome free' regions. The simulation assumes that sequence reads are generated from a population of cells, i.e., nucleosome positions are represented by distributions rather than discrete locations.
For this study we generated nucleosome features to cover the Arabidopsis genome. These features include 13,758 stable nucleosomes, 16,313 regions of phased nucleosomes covering 82 Mb, 6,360 regions of fuzzy nucleosomes covering 32 Mb and 11,293 nucleosome free regions covering 1.3 Mb. A dataset containing 30 million sequence reads, each 36 bp long, was generated from the TAIR 8 assembly of the genome ftp://ftp.arabidopsis.org/Sequences/wholechromosomes/. From this smaller datasets with three, six and ten million reads were created with three replicates at each level of coverage. All simulated datasets are available online from http://www.bioinformatics.csiro.au/ChIPseqR.
To generate the mononucleosome fragments for end-sequencing and qPCR verification a lysate was prepared by grinding 1 g total seedling material of Arabidopsis ecotype Columbia to a fine powder in liquid nitrogen. The ground material was added to 10 ml lysis buffer (50 mM HEPES pH7.5, 150 mM NaCl, 1% (v/v) Triton TX-100, 0.1% (w/v) sodium deoxycholate, 0.1% (w/v) sodium dodecyl sulphate, 1× Plant Proteinase Inhibitor mix (Sigma St Louis, MO). 1 ml aliquots of the lysate were digested with 7.5 to 30 units micrococcal nuclease (Fermentas, Vilnius, Lithuania) for 15 minutes at 37°C, followed by proteinase K treatment and phenol extraction. DNA fragments were separated by agarose gel electrophoresis and the fragments in the size range of mononucleosome cores (~150 bp) were excised and purified. The purified mononucleosome cores were then used to generate a library for Illumina GS sequencing (carried out by Geneworks, Adelaide, Australia) or used for qPCR quantification. The data from Illumina GS sequencing are archived at the NCBI Sequence Read Archive (SRA) under accession number SRP001458.
Nine genomic regions containing a subset of predicted nucleosomes were selected for qPCR verification. Regions were chosen to cover predictions with FDR values between 0 and 0.2 and large enough to contain between ten and twelve tiled amplicons. The resulting regions are between 438 bp and 563 bp in length and contain a total of 13 nucleosome predictions. Of these predictions four have p-values between 0 and 0.01, five have p-values between 0.01 and 0.05 and the remaining four predictions have p-values between 0.05 and 0.2.
Number of predicted and verified nucleosomes in three significance bands
DNA protected from micrococcal nuclease digestion was quantified using an Applied Biosystems 7900HT realtime PCR machine. PCR conditions were 1× PCR buffer (Platinum Taq buffer, Invitrogen, CA), 3.5 mM MgCl2 0.2 mM dNTP mix, 0.4 μM oligonucleotide primers, 1:20000 (v/v) SybrGreen, 0.025 U/μl Platinum Taq. Nucleosome core samples were quantified against undigested genomic DNA standards.
The binding site model and scoring procedure described above require the specification of several parameters. In this section we discuss possible choices of parameter values and their impact on the analysis. The parameters of the model itself are the length of the binding site b, the length of the support region s and the length of the background window w. The method used to score potential binding sites introduces a number of additional parameters, namely γbg and γsup to specify the cut-off in difference between rates for the background and supporting region read rate estimates and the cut-off τ used to exclude outliers when fitting the null distribution.
To obtain a better understanding of how these parameters affect the analysis we considered a range of parameter combinations. The performance of different sets of parameter values was assessed based on the number of binding sites identified in the mononucleosome data and accompanying control sample where the latter are assumed to be false positives. We note that the coverage in the control sample is substantially lower than in the treatment. This may have an impact on the number of identified binding sites but we expect the general trends identified here to hold for control samples with higher coverage. In the following analysis we used a nominal false discovery rate of 0.05 and τ = 0.95. Since the length of nucleosome-bound DNA is known to be about 147 bp we initially chose b = 147 and s = 15. To investigate the effect of γbg and γsup we predicted binding sites using values of 0.5, 0.9, 0.99 and 1 for each of these parameters while the width of the background window was set to 1001 bp, 2001 bp and 3001 bp. The number of binding sites predicted in the treatment and control sample for all parameter combinations were recorded and compared to determine a set of parameters that provides high specificity and sensitivity. At this stage of the analysis we are mainly concerned with identifying parameter values that keep false positive predictions to a minimum while maintaining an acceptable number of predicted binding sites in the treatment sample, i.e. the focus is on specificity. To assess specificity we considered the ratio between the number of predicted binding sites in the control and treatment. Lower values of this ratio indicate higher specificity.
If and are indeed equal, the asymptotic distribution of W is known to be
The authors would like to thank Micheal Buckley for his insightful comments on the manuscript. PH is supported by an E-IPRS scholarship from Macquarie University and a top-up scholarship from CSIRO. Additional funding was provided by CSIRO's Transformational Biology Capability Platform.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.