- Research article
- Open Access
Accounting for immunoprecipitation efficiencies in the statistical analysis of ChIP-seq data
© Bao et al.; licensee BioMed Central Ltd. 2013
- Received: 22 October 2012
- Accepted: 21 May 2013
- Published: 30 May 2013
ImmunoPrecipitation (IP) efficiencies may vary largely between different antibodies and between repeated experiments with the same antibody. These differences have a large impact on the quality of ChIP-seq data: a more efficient experiment will necessarily lead to a higher signal to background ratio, and therefore to an apparent larger number of enriched regions, compared to a less efficient experiment. In this paper, we show how IP efficiencies can be explicitly accounted for in the joint statistical modelling of ChIP-seq data.
We fit a latent mixture model to eight experiments on two proteins, from two laboratories where different antibodies are used for the two proteins. We use the model parameters to estimate the efficiencies of individual experiments, and find that these are clearly different for the different laboratories, and amongst technical replicates from the same lab. When we account for ChIP efficiency, we find more regions bound in the more efficient experiments than in the less efficient ones, at the same false discovery rate. A priori knowledge of the same number of binding sites across experiments can also be included in the model for a more robust detection of differentially bound regions among two different proteins.
We propose a statistical model for the detection of enriched and differentially bound regions from multiple ChIP-seq data sets. The framework that we present accounts explicitly for IP efficiencies in ChIP-seq data, and allows to model jointly, rather than individually, replicates and experiments from different proteins, leading to more robust biological conclusions.
- False Discovery Rate
- Mixture Model
- Bayesian Information Criterion
- CREB Binding Protein
- Enrich Region
ChIP-sequencing, also known as ChIP-seq, is a recently established technique to detect protein-DNA interactions in vivo on a genome-wide scale . ChIP-seq combines Chromatin ImmunoPrecipitation (ChIP) with massively parallel DNA sequencing to identify all DNA binding sites of a Transcription Factor (TF) or genomic regions with certain histone modification marks. The ChIP process captures cross linked and sheared DNA-protein complexes using an antibody against a protein of interest. After decrosslinking of the protein-DNA complexes, the final DNA pool is enriched in DNA fragments bound by the protein of interest, but there are always random genomic DNA fragments piggybacking on the specific DNA fragments. The degree of enrichment depends on the ChIP efficiency. A more efficient experiment will induce a higher proportion of protein-bound fragments in the mixture pool, and generate more sequence reads in bound regions and less sequence reads in non-bound regions, than an experiment with lower ChIP efficiency. As a result, the more efficient experiment will have more power to discriminate between bound and non-bound genomic regions and generally show a larger number of bound regions.
The antibody used is the most critical factor affecting ChIP efficiency . However, different ChIP efficiencies are also observed between different batches when using the same antibody, since ChIP protocols are notoriously difficult to standardize and control. In general, we may encounter three relevant scenarios where differences in ChIP efficiencies play a role: (i) the comparison of bound regions between two experimental conditions subjected to ChIPs with the same antibody but with variable efficiencies; (ii) the comparison of bound regions of the same TF or marked with the same histone modification but profiled with different antibodies; (iii) the comparison of bound regions from two different TFs or marked with different histone modifications, profiled with different antibodies. When making comparisons without considering the ChIP efficiencies, the number of overlapping regions may be underestimated while the number of differentially bound regions may be overestimated. A number of methods have been proposed recently for comparative analyses of ChIP-seq data e.g. [3-9]. In general, there is recognition in the literature of different specificities associated to different antibodies used in ChIP-seq experiments, e.g. , and attempts are made to account for these in the analysis. These are often in the form of a pre-selection of regions for the analysis: in [3, 6] only regions with high signal to background ratios are used for further analyses and normalization procedures, in  the normalization is performed only on commonly enriched regions. A control experiment is often used to aid the detection of truly enriched regions (e.g. in PeakSeq  and W-ChIPeaks ). However, overall, there is a shortage of formal definition of ChIP efficiency and a limited focus on how this affects the interpretation of the results and how this should be fully accounted for in the statistical analysis of the data and consequently in the detection of enriched and differentially bound regions. In this paper, we address these issues using ChIP-seq data from a number of experiments conducted by different laboratories on two highly similar but different proteins.
P300 and the CREB binding protein (CBP) are two Histone AcetylTransferases (HATs) which are transcription co-activators for a broad range of genes involved in various multiple cellular processes. P300 and CBP have highly similar roles in transcriptional activation, but also differ in some aspects that are still not fully understood . This is reflected by the large but incomplete overlap in p300 and CBP binding sites in the genome [13, 14]. In the ChIP-seq study of  it is known that the antibody specificity for the p300 experiments is higher than for the CBP experiments. Using a Fisher exact test,  find that the number of regions preferentially bound by p300 is largely greater than the number of regions preferentially bound by CBP. In , two experiments are conducted on the same two proteins, but using a different cell line. In this case, the antibody specificity for the CBP experiment is known to be higher than the one for p300. Consequently, the number of regions preferentially bound by p300 found by this study is much smaller than the number of regions associated only with CBP. Despite the different experimental set-ups of the two studies, these results suggest that the differences in ChIP efficiencies associated with the antibodies used can have a major impact on the findings of regions that are differentially bound by CBP or p300, and may mask the real heterogeneity between the two HATs and the two cell types studied. Hence, there is a need to explicitly account for these in the statistical analysis and interpretation of the results.
A large number of statistical methods have been developed in the last few years for modelling ChIP-seq data. The majority of these concentrate on the detection of peak-type profiles such as the ones generated by DNA-binding TFs. Some others are proposed for detecting genomic regions with broader signals such as those bound by RNA Polymerase II binding  or marked with specific histone modifications [15, 16]. If no control experiment is available (e.g. a ChIP experiment with a non-specific IgG control antibody), a general strategy is to model the background read distribution and then assign a statistical significance cut-off for the detection of candidate peaks or enriched regions using either analytical or simulation approaches. One popular model for the background is given by the Poisson or Negative Binomial (NB) distributions, which are used by a number of available software packages (FindPeak , USeq , CisGenome , SISSRs ). An alternative to the global Poisson or NB models is to use local Poisson models (e.g. MACS  and ChIPseqR ), mixture of Poisson/NB models (e.g. MOSAiCS ) or more advanced hidden Markov mixture or random field models (e.g. BayesPeak , HPeak  and iSeq ).
In this paper, we use a latent mixture model, as described in the Methods section, and show how this model accounts for the ChIP efficiency of an experiment, by modelling an appropriate signal to background ratio. The general idea is that the different components of the mixture model give flexibility to model both well separated signal and background components (i.e. efficient experiments) and more overlapping components (i.e. less efficient experiments). A formal definition of ChIP efficiency is given, which can be easily extended to mixture models of more than two components. Therefore, other methods based on mixture modelling, such as the ones mentioned above, could be used within the same framework described in this paper. The fact that different experiments, even technical replicates from the same lab, can have different IP efficiencies has probably been the main reason why, to date, statistical modelling of ChIP-seq data sets, and corresponding implementations, have been developed for individual experiments. In the presence of technical or biological replicates, the results from the different analyses are subsequently combined to increase the robustness in the detection of regions and circumvent the problem of different signal to background ratios . One major contribution of this paper is to show how a mixture model framework that explicitly account for ChIP efficiencies can be used to perform a joint analysis of ChIP data from multiple experiments on different proteins, aiding to a more robust detection of enriched and differentially bound regions.
Joint modelling of ChIP-seq data with multiple replicates and different IP efficiencies
The analyzed material from the immunoprecipitation step of a ChIP-seq experiment is always a mix of fragments bound by the transcription factor (true signal) and random background fragments (background signal). Furthermore, the majority of regions in the genome is not enriched and should therefore contain only background signal. We would generally expect that the bin counts reflect this mixture pattern. That is, some bins are enriched regions with a lot of tags (possibly a ’peak’ for TF binding) and most other bins are not enriched, containing only few tags. This motivated us to assume a mixture model framework for the counts. The model that we present in this paper does not make any use of peak information and is therefore more suitable for the detection of broad regions, such as those marked with histone modifications.
where p c = P(X mc = 1) is the mixture portion of the signal component and and are the signal and background densities for condition c, antibody j and replicate i, respectively.
Using a mixture model allows to split the signal and background component in the data: this is particularly important when different ChIP efficiencies are observed, as these will induce a different signal to background ratio. The different parameters of the mixture components will allow to capture the different IP efficiencies of individual experiments, whereas the parameter p c , which does not depend on the ChIP efficiencies, allows to properly combine technical and biological replicates with the same or different antibodies. This is not normally done in the literature, rather different analyses are performed for different experiments and the detected regions are further combined at a second stage, e.g. [5, 6]. The constant k c j i is a non-negative value that represents the minimum observable tag count in an enriched region and is used to provide greater flexibility to the two-component mixture model, particularly in the presence of a large proportion of zeros. [19, 23] set this offset equal to some pre-specified value and use the same value for all experiments. However this assumption does not seem to be supported by the data, where the value of the offset k may also depend on the library size and on the different signal and background ratios of the experiments. We therefore opted to keeping this parameter free in our maximum likelihood procedure and estimating it from the data.
We fit this model to the p300 and CBP datasets described in the Methods section, using the EM-procedure outlined in the same section for parameter estimation. The input to the model is count data from all ChIP-seq datasets considered, together with information on which experiments are replicates. The output of the model is the estimates of all the parameters, that is p c , and for all c, j and i. The eight experiments considered in this paper are performed by two different labs. In , two technical replicates are conducted at time 30 for each of the two proteins. In , single experiments are conducted for non-activated T-cells. Given the different cell lines used in the two studies, the experiments from the two different labs cannot be considered as biological replicates. However, the framework described in this paper would be flexible enough to allow for the situation when different replicates are conducted in different labs (and using different antibodies).
Fitting results by mixture of two negative binomial distributions: mixture parameter estimates (second to fifth column), offset value k (sixth column), corresponding estimate of ChIP efficiency (IPE; seventh column) and number of enriched regions at a controlled 0.1% FDR (last column)
# Enriched regions
Quantifying IP efficiencies of ChIP-seq experiments
The mixture model that best fits the data can be further used to derive an estimate of IP efficiency of a ChIP-seq experiment. In the literature, this is often done using informal ad-hoc measurements, e.g.  estimate ChIP efficiency by the ratio of hybridization values at the top 1% of bound sites to the bottom 10%, which are taken to represent background levels of binding, whereas  measure it using the relative level of protein binding with respect to control regions. In general, ChIP efficiency is often thought in terms of a ratio between the total number of counts in the enriched regions versus the total number of counts in the background regions. In the context of our paper, such a quantity can be estimated by taking the ratio of the expected counts in the signal regions, μ S , versus the expected counts in the background regions, μ B . However, such a measure would not account for overdispersion, or, in general, for more complex distributions of the background and signal components. For this reason, we present a more general measure of IP efficiency in terms of separation of the signal and background components of the mixture model. An efficient experiment will generate well separated signal and background components, whereas a less efficient experiment will generate two more overlapping components. In the Methods section, we provide a formal derivation of this IP efficiency estimate.
Table 1 reports the corresponding IP efficiencies for the eight experiments on p300 and CBP. These estimates reflect existing knowledge on the specificities of the antibodies used for the different proteins, e.g. the efficiencies of the experiments for p300 by  are larger than the ones for the CBP experiments, whereas the opposite is observed for the experiments by . Furthermore, it is interesting to note quite a large difference in ChIP efficiency for technical replicates in the study of , which is reflected also in the parameter estimates (e.g. differences in the signal and background means for the CBP technical replicates). These different ChIP efficiencies, if not accounted for, can potentially lead to erroneous biological conclusions.
Accounting for ChIP efficiencies in the detection of enriched regions
ChIP efficiencies need to be properly accounted for in the detection of the regions bound by a protein from the available ChIP-seq data. After fitting a mixture model to count data, the estimates for all the parameters in the model, that is p c , and , are used to select the regions enriched by p300 and CBP, respectively. A common procedure for mixture models is to set a cut-off on the posterior probabilities of non-enrichment, for regions m and condition c. We choose this threshold using a controlled False Discovery Rate (FDR) of 0.1%, as detailed in the Methods section. The last column of Table 1 gives the number of enriched regions for each condition, in terms of the 1000 bp windows used in the analysis. As technical replicates are modelled jointly, a single list of enriched regions is detected for these experiments.
FDR values when the same number of enriched regions is assumed for all eight experiments
31689 bound regions
65768 bound regions
Binding sites for Ramos T30 using separate models for replicates and taking the intersection (top) and the union (bottom) of regions identified by individual analyses at an 0.1% FDR (column 2), compared to a joint analysis of replicates at the same FDR (column 3)
Identified using the intersection
of separate models
using joint model
CBPT301 & CBPT302
p300T301 & p300T302
Identified using the union
of separate models
using joint model
CBPT301 & CBPT302
p300T301 & p300T302
for two replicates Y1 and Y2, sharing the latent binding profiles X, where we estimate the posterior probability P(Y1,Y2|X) from the two separate analyses and we take P(X = 1) as an average of the two estimates from the two separate analyses. When setting an 0.1% FDR cutoff on each individual analysis, this method returns an estimated FDR of 4.0e −8 and 4.5e −8 for CBP and p300, respectively, for the commonly detected regions. We use these FDR values for the joint modelling results of Table 3 (top). Note that these values are smaller than the 0.1% cutoff chosen for Table 1, thus returning a smaller number of enriched regions for the joint modelling approach. Similar results are obtained by taking the union of separate analyses, rather than the intersection, that is by considering regions that are detected by at least one of the two separate analyses (Table 3, bottom). The FDR of the union of regions is similar to that of individual experiments, but the joint modelling approach consistently finds many more regions than the separate analyses.
Detection of differentially bound regions
When we have data on two or more proteins, or on one protein and a control, an interesting question is to find the regions that are differentially bound by the two proteins of interest. These are the regions with a large difference in the probabilities of enrichment, P(X m c =1|y) for the two proteins. Antibody efficiencies also play a role in this as, generally, one would expect to find many regions preferentially bound by a protein for which a more efficient experiment is conducted, than for a protein from the less efficient experiment, simply down to the two different antibodies used. Indeed, this is the case for the two studies by [13, 14] mentioned in the introduction. In the literature, techniques which can detect peaks or enriched regions for a single experiment against a control, e.g. MACS , ChIPDiff  or MOSAiCS , can also be used to detect differentially bound regions for two proteins. Here, the general procedure is to use the experiment from the other protein as a control. However this method lacks formal probability definitions on the difference between the two experiments. Furthermore, it is not implementable for those peak-finder methods that do not use control information. More recent methods, such as ChIPnorm , allow to compare two experiments on two proteins at the same time, but somewhat sidestep the issue of different IP efficiencies by focussing on regions with high signal to background ratio and normalizing the counts on these regions only. Finally, one of the latest methods, DBChIP , allows the inclusion of biological replicates in the model, but does not account for their different IP efficiencies in the detection of enriched and differentially bound regions.
where P(X mc = 0|Y c ) is the probability that the mth bin is enriched for protein c, estimated by the model described above and from all the data on protein c. In this way, all replicates under the same condition are considered in the estimation of the posterior probabilities, returning a more robust set of differentially bound regions.
Number of differentially bound regions at 5% FDR; at T30, where technical replicates are available, the results are given both for the case where the joint model of replicates is used (first column) and for the case where the union of two separate analysis is used (CBPT301 versus p300T301 and CBPT302 versus p300T302, second column)
# Regions bound only by CBP
# Regions bound only by p300
Number of differentially bound regions at 5% FDR when making an assumption of the same number of binding sites for the two proteins ( p 1 = p 2 ), compared to the case when this assumption is not made ( p 1 ≠ p 2 ); the last column reports the difference in the BIC values of the two models (a positive difference means a better fit for the model that assumes p 1 = p 2 )
# Regions bound only by CBP
# Regions bound only by p300
p1 ≠ p2
p1 = p2
p1 ≠ p2
p1 = p2
BIC - BIC
Different antibodies are used for ChIP-seq experiments for different proteins, and these have different levels of specificity. On top of this, different ChIP efficiencies are observed even for replicated experiments on the same protein. This results in different signal to background ratios for ChIP-seq generated data, and consequently, in a different percentage of expected enriched and non-enriched regions. We have used simple arguments to show how this is the case, how the ChIP efficiency of an experiment can be quantified from the data and how different ChIP efficiencies for different experiments can lead to misleading biological conclusions if not accounted for in the statistical analysis. This is shown both for the detection of enriched regions and of differentially binding regions, for which a new test is proposed. In the exposition, we focus on the detection of broad regions, such as those marked with histone modifications, and we do not use any information about peak-shape or reads from opposite strands.
We have used a mixture of negative binomial distributions to present the results in this paper. One important point of the paper is that a mixture model approach, such as the one presented here, allows to account for the ChIP-efficiency of an experiment: less efficient experiments are modelled by more overlapping signal/background mixtures than more efficient experiments. In our results, we fitted this model to count data on 1000 bp-size windows. The relatively large window size is motivated by the fact that the mixture model considered here does not account for Markov properties in the data. More sophisticated statistical models of ChIP-seq data, such as HMMs  or random fields models , or more sophisticated distributions, such as zero-inflated Poisson or negative binomials distributions, e.g. [23, 25], can be used within the same framework described in this paper, and are currently under investigation. Similarly more robust estimates of background distributions can be used, e.g. [3, 23]. Current research is looking at an extension of the joint model approach presented in this paper to one where read-mappability and GC-content are directly included in the model specification. Furthermore, most of the available normalization methods, e.g. [5-7], work with a pre-defined set of enriched regions and often make use of control experimental data to further improve the identification of enriched regions. The regions detected by the method proposed in this paper could be further used as part of existing normalization procedures.
A second important point of the paper is that estimation of the parameters of the mixture model is performed jointly, from all the available data. In particular, the knowledge of experiments being technical or biological replicates puts some constraints in the parameter space: the parameter p c that is discussed in the paper is the same for all technical and biological replicates, as these share naturally the same binding profile. This parameter, as well as all the other parameters in the model, are estimated from data by an expected maximum likelihood approach. Given the parameter estimates, the final point of the paper is to show how these can be appropriately used to make a decision about which regions in the genome are enriched, and which are differentially bound in the case of two proteins.
We use real ChIP-seq data on two histone modifiers, p300 and CBP, to show how a joint modelling approach for ChIP-seq data, which properly accounts for the different ChIP efficiencies, is able to identify a larger number of enriched regions than a standard approach, where individual models are fitted to individual experiments and the results of individual analyses are subsequently combined. The regions identified by the joint modelling approach have been validated by TSS overlap and ChromHMM and have generally shown similar enrichment of chromatin features to the regions detected by individual analyses. Additional a priori biological knowledge, such as the expectation of a same number of binding for two different proteins, can also be included in the model and is found to return more realistic numbers of differentially bound regions, with a smaller number of regions bound by the protein where a more efficient experiment is conducted and therefore an expectation of a smaller number of false positives. Further work will be conducted to validate these regions biologically.
The methods described in this paper are implemented in the R package enRich, which is available in CRAN. The input to the main function in this package is count data for a number of bins and a number of experiments, together with information about which experiments are replicates, which experiments are thought to have the same number of binding profiles, which two proteins (if available) should be compared for differential enrichment, and an FDR cut-off for the selection of regions. The output of the function is a list of enriched regions for each protein and each condition and the list of differentially bound regions at the specified FDR cut-off.
The data: pre-processing and validation
The ChIP-seq data on p300 and CBP analysed in this paper was generated from two different labs [13, 14]. In , CBP and p300 binding is profiled in human T98G cells at time point 0 (T0), where cells are serum starved and where CBP or p300 is restricted to a limit set of genes, and at 30 minutes after stimulation with tetradecanoyl phorbol acetate (T30). For the latter condition, there are two technical replicates (T301 and T302) and it is known that the ChIP efficiency in the second replicate is higher than in the first. In , CBP and p300 binding was evaluated in resting CD4+ T cells. We will use the protein names followed by T0, T301 and T302 to refer to the six experiments of , use T30 for the combination of T30-1 and T30-2 results and use Wang followed by the protein names to refer to the two experiments in .
All sequence reads were aligned to the human genome (build hg18) using BWA version 0.5.9 with default settings. We divide the whole genome into 1000 base pair windows and summarise the raw counts for each window by the number of tags whose first position is in the window. To account for a possible mappability problem , we delete the bins which are not covered by any of the experiments mentioned above, resulting in 7.67% bins deleted in total. Furthermore, we exclude from the analysis genomic regions that have been found to exhibit anomalous or unstructured read counts (http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeMapability/wgEncodeDukeRegionsExcluded.bed6.gz) . The 2,832,221 remaining regions are considered for the analysis. All the results for enriched and differentially bound regions are given in terms of these 1000bp bins and are provided as Additional file 1. These bins could be further processed by joining consecutive bins into regions.
Overlap with Transcription Start Sites (TSSs) was assessed in Galaxy (https://main.g2.bx.psu.edu), using the first (plus strand) and the last (minus strand) positions of UCSC annotated genes. We consider a bin as containing a TSS when there is at least 1bp overlap with an annotated TSS. Enrichment of the detected regions with chromatin features was assessed using ChromHMM . The method is based on a hidden Markov model, which takes as input the binary vector of enriched and not-enriched regions, obtained from the method described in this paper at a specified FDR cutoff, and gives as output the predicted state for each region. We consider a model with 4 states, as we find that these are enough to capture the diversity of the detected regions in terms of chromatin features enrichment. The resulting predicted states are evaluated for enrichment using a number of external annotations. In particular, we use the Broad ChromHMM classification, available from the UCSC genome browser, and select the following categories: RefSeq exons, silent DNA (Heterochromatin), promoters ready to start transcription (PoisedPromoter), active and weak promoters (ActivePromoter and WeakPromoter, respectively), strong and weak enhancers (StrongEnhancer and WeakEnahncer, respectively).
The joint latent mixture model: parameter estimation
We choose a grid of values for the offset k from 0 to some user defined largest minimum observable tag count, for which we set a default of 10. The parameters of the mixture distributions depend on the choice of k.
- 2.Since X m is unobserved, we use an EM algorithm to estimate the parameters Θ = (p,θ S ,θ B ) for a fixed value k. The complete log likelihood for counts Y and unobserved indicators X is given by
Then the E- and M-steps for the tth iteration are as follows:
We calculate the marginal likelihood functions for each pair of offset k and mixture parameters Θ and choose the pair which gives the largest likelihood values.
The special cases of poisson and negative binomial
When analysing deep-sequencing data, it is quite common to consider either a Poisson or a Negative Binomial (NB) distribution for the mixture components. In what follows, we give more details on the EM-algorithm implementation in the case of mixtures of Poisson and NB distributions, respectively.
Given that no closed-form solutions can be found for the ϕ parameters, we use the optim function in R for this optimization.
Combining information from replicates in the detection of enriched regions
In this section, we show how the framework described above can be used for the joint analysis of technical and biological replicates. Since replicates are made at the same condition c, the latent binding profiles X m c are the same for these experiments, and consequently also the parameter p c .
Including this assumption in the model is expected to lead to a more robust detection of the enriched regions, particularly when different IP efficiencies are observed for each experiment. This framework would be suited also to the case when different antibodies are used for the different replicates, such as experiments on the same protein conducted in different laboratories.
Note that a single probability of enrichment is derived under condition c by combining all replicates under this condition.
This allows to account for the different IP efficiencies in the detection of enriched regions.
Detection of differentially bound regions
where P(X mc = 0|Y c ) is the probability that the mth bin is enriched for protein c, estimated by the model described above from all the data on protein c, at the same time point.
W define Z as a random variable indicating the common binding profiles of two proteins, that is Z m = 1 if Xm1 ≠ Xm2 and Z m = 0 if Xm1 = Xm2. Then, P(Z m = 0) = P(Xm1 = Xm2) and a cutoff can be set on the probabilities of differential binding by controlling a predefined FDR value, using the same formula defined in (2).
Estimating ChIP efficiencies
with f B and f S the background and signal densities, respectively, and assuming independence in the counts at different locations.
This quantity varies between 0.5 and 1, namely 0.5 for perfectly overlapping components (inefficient experiment) and 1 for perfectly separated components (efficient experiment). Real estimates will vary between these two extremes, the higher this value, the more efficient the experiment is. The formula can be used to estimate ChIP efficiency for mixture models with any two distributions and could be easily extended to more than two mixture components.
This work was supported by the Biotechnology and Biological Sciences Research Council [BB/H017275/1 to Y.B.]; the European Commission 7th Framework Program GEUVADIS [project nr. 261123 to P.’t H.]; and the Centre for Medical Systems Biology within the framework of the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research. The authors are grateful to the anonymous reviewers for their helpful suggestions which greatly improved the original manuscript.
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith O, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007, 4 (8): 651-657. 10.1038/nmeth1068.View ArticlePubMedGoogle Scholar
- Kidder B, Hu G, Zhao K: ChIP-Seq: technical considerations for obtaining high-quality data. Nat Immunol. 2011, 12 (10): 918-922. 10.1038/ni.2117.PubMed CentralView ArticlePubMedGoogle Scholar
- Diaz A, Park K, Lim D, Song J: Normalization, bias correction, and peak calling for ChIP-seq. Stat Appl Genet Mol Biol. 2012, 11 (3): Article 9-PubMedGoogle Scholar
- Mendoza-Parra MA, Sankar M, Walia M, Gronemeyer H: POLYPHEMUS: R package for comparative analysis of RNA polymerase II ChIP-seq profiles by non-linear normalization. Nucleic Acids Res. 2011, 40 (4): e30-PubMed CentralView ArticlePubMedGoogle Scholar
- Liang K, Keleş S: Detecting differential binding of transcription factors with ChIP-seq. Bioinformatics. 2012, 28: 121-122. 10.1093/bioinformatics/btr605.PubMed CentralView ArticlePubMedGoogle Scholar
- Nair N, Sahu A, Bucher P, Moret B: ChIPnorm: a statistical method for normalizing and identifying differential regions in histone modification ChIP-seq libraries. PLoS ONE. 2012, 7 (8): e39573-10.1371/journal.pone.0039573.PubMed CentralView ArticlePubMedGoogle Scholar
- Shao Z, Zhang Y, Yuan G, Orkin S, Waxman D: MAnorm: a robust model for quantitative comparision of ChIP-Seq data sets. Genome Biol. 2012, 13 (3): R16-10.1186/gb-2012-13-3-r16.PubMed CentralView ArticlePubMedGoogle Scholar
- Song Q, Smith A: Identifying dispersed epigenomic domains from ChIP-seq data. Bioinformatics. 2011, 27 (6): 870-871. 10.1093/bioinformatics/btr030.PubMed CentralView ArticlePubMedGoogle Scholar
- Taslim C, Huang K, Huang T, Lin S: Analyzing ChIP-seq Data: Preprocessing, Normalization, Differential Identification, and Binding Pattern Characterization. Next Generation Microarray Bioinformatics Methods Mol Biol. 2012, 802: 275-291. 10.1007/978-1-61779-400-1_18.View ArticleGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach R, Zhang Z, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein M: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009, 27: 66-75. 10.1038/nbt.1518.PubMed CentralView ArticlePubMedGoogle Scholar
- Lan X, Bonneville R, Apostolos J, Wu W, Jin V: W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. Bioinformatics. 2011, 27 (3): 428-430. 10.1093/bioinformatics/btq669.PubMed CentralView ArticlePubMedGoogle Scholar
- Kalkhoven E: CBP and p300: HATs for different occasions. Biochem Pharmacol. 2004, 68 (6): 1145-55. 10.1016/j.bcp.2004.03.045.View ArticlePubMedGoogle Scholar
- Wang Z, Zang C, Cui K, Schones D, Barski A, Peng W, Zhao K: Genome-wide mapping of HATs and HDACs reveals distinct functions in active and inactive genes. Cell. 2009, 138: 1019-1031. 10.1016/j.cell.2009.06.049.PubMed CentralView ArticlePubMedGoogle Scholar
- Ramos Y, Hestand M, Verlaan M, Krabbendam E, Ariyurek Y, van Dam H, van Ommen G, den Dunnen J, Zantema A, ’t Hoen P: Genome-wide assessment of differential roles for p300 and CBP in transcription regulation. Nucleic Acids Res. 2010, 38 (16): 5396-5408. 10.1093/nar/gkq184.PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbanks E, Facciotti M: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS ONE. 2011, 5 (7): e11471-View ArticleGoogle Scholar
- Micsinai M, Parisi F, Strino F, Asp P, Dynlacht B, Kluger Y: Picking ChIP-Seq peak detectors for analyzing chromatin modification experiments. Nucleic Acids Res. 2012, 40 (9): e70-10.1093/nar/gks048.PubMed CentralView ArticlePubMedGoogle Scholar
- Fejes A, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones S: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24 (15): 1729-1730. 10.1093/bioinformatics/btn305.PubMed CentralView ArticlePubMedGoogle Scholar
- Nix D, Courdy S, Boucher K: Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics. 2008, 9: 523-10.1186/1471-2105-9-523.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson D, Myers R, Wong W: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-seq data. Nucleic Acids Res. 2008, 36 (16): 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, Nussbaum C, Myers R, Brown M, Li W: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 201: R137-View ArticleGoogle Scholar
- Humbrug P, Helliwell C, Bulger D, Stone G: ChIPseqR: analysis of ChIP-seq experiments. BMC Bioinformatics. 2011, 1471-2105 (12): 39-View ArticleGoogle Scholar
- Kuan P, Chung D, Pan G, Thomson J, Stewart R, Keles S: A statistical framework for the analysis of ChIP-Seq data. J Am Stat Assoc. 2011, 106 (495): 891-903. 10.1198/jasa.2011.ap09706.PubMed CentralView ArticlePubMedGoogle Scholar
- Spyrou C, Stark R, Lynch A, Tavare S: BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics. 2009, 10: 299-10.1186/1471-2105-10-299.PubMed CentralView ArticlePubMedGoogle Scholar
- Qin Z, Yu J, Shen J, Maher C, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan A: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-seq data. BMC Bioinformatics. 2010, 11 (369):Google Scholar
- Mo Q: A fully Bayesian hidden Ising model for ChIP-seq data analysis. Biostatistics. 2012, 13: 113-128. 10.1093/biostatistics/kxr029.View ArticlePubMedGoogle Scholar
- Koerber R, Rhee H, Jiang C, Pugh B: Interaction of transcriptional regulators with specific nucleosomes across the Saccharomyces genome. Mol Cell. 2009, 35 (6): 889-902. 10.1016/j.molcel.2009.09.011.PubMed CentralView ArticlePubMedGoogle Scholar
- Fan X, Lamarre-Vincent N, Wang Q, Struhl K: Extensive chromatin fragmentation improves enrichment of protein binding sites in chromatin immunoprecipitation experiments. Nucleic Acids Res. 2008, 36 (19): e125-e125. 10.1093/nar/gkn535.PubMed CentralView ArticlePubMedGoogle Scholar
- Blahnik K, Dou L, O’Geen H, McPhillips T, Xu X, Cao A, Iyengar S, Nicolet C, Ludascher B, Korf I, Farnham P: Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic Acids Res. 2010, 38 (3): e13-10.1093/nar/gkp1012.PubMed CentralView ArticlePubMedGoogle Scholar
- Bardet A, He Q, Zeitlinger J, Stark A: A computational pipeline for comparative ChIP-seq analyses. Nature Protoc. 2012, 7 (1): 45-61.View ArticleGoogle Scholar
- Ernst J, Manolis K: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010, 28 (8): 817-827. 10.1038/nbt.1662.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu H, Wei C, Lin F, Sung W: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24 (20):Google Scholar
- Hoffman M, Ernst J, Wilder KASP, Harris R, Libbrecht M, Giardine B, Ellenbogen P, Bilmes J, Birney E, Hardison R, Dunham I, Kellis M, Noble W: Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 2012, 41 (2): 827-841.PubMed CentralView ArticlePubMedGoogle Scholar
- Broët P, Richardson S: Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model. Bioinformatics. 2006, 22 (8): 911-918. 10.1093/bioinformatics/btl035.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.