- Methodology article
- Open Access
Shape-based peak identification for ChIP-Seq
- Valerie Hower^{1}Email author,
- Steven N Evans^{1, 2} and
- Lior Pachter^{1, 3}Email author
https://doi.org/10.1186/1471-2105-12-15
© Hower et al; licensee BioMed Central Ltd. 2011
- Received: 3 June 2010
- Accepted: 12 January 2011
- Published: 12 January 2011
Abstract
Background
The identification of binding targets for proteins using ChIP-Seq has gained popularity as an alternative to ChIP-chip. Sequencing can, in principle, eliminate artifacts associated with microarrays, and cheap sequencing offers the ability to sequence deeply and obtain a comprehensive survey of binding. A number of algorithms have been developed to call "peaks" representing bound regions from mapped reads. Most current algorithms incorporate multiple heuristics, and despite much work it remains difficult to accurately determine individual peaks corresponding to distinct binding events.
Results
Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is statistically sound and robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We validate our approach using previously published data and show that it can discover previously missed regions.
Conclusions
The difficulty in accurately calling peaks for ChIP-Seq data is partly due to the difficulty in defining peaks, and we demonstrate a novel method that improves on the accuracy of previous methods in resolving peaks. Our introduction of a robust statistical test based on ideas from topological data analysis is also novel. Our methods are implemented in a program called T-PIC (T ree shape P eak I dentification for C hIP-Seq) is available at http://bio.math.berkeley.edu/tpic/.
Keywords
- Coverage Function
- Enrichment Score
- Lattice Path
- Motif Enrichment
- Peak Calling
Background
With rapidly decreasing costs of sequencing, next-generation sequencing assays are increasingly being used for molecular measurements [1]. These techniques generate millions of short reads and massive data sets, making it computationally challenging to properly analyze the data. One such assay, called ChIP-Seq (chromatin immunoprecipitation followed by sequencing), is used to determine DNA binding sites of a protein (see [2, 3] for a review). In ChIP-Seq, protein is first cross-linked to DNA and the fragments subsequently sheared. Following a size selection step that enriches for fragments of specified lengths, the fragments ends are sequenced, and the resulting reads are aligned to the genome. Reads pile up at bound regions referred to as "peaks", but due to mapping challenges and biases in various aspects of existing protocols, identifying peaks is not a straightforward task.
While there are many current algorithms for analyzing ChIP-Seq data (see [4] for a recent review), there is still room for improvement as most rely on adhoc heuristics including coverage thresholds and poorly motivated filters. In particular, while existing methods rely on depth of coverage to determine likely binding sites using statistical methods, the determination of regions of binding, i.e. peak boundaries, is frequently based on heuristics.
We present a novel approach for calling peaks that is based on evaluating the significance of a robust test statistic that measures the extent of pile-up of reads. Specifically, we use define and evaluate the "shape" of putative peaks to differentiate between random and nonrandom fragment placement on the genome. We compare our predictions to two state-of-the-art methods (based on comparisons in [4, 5]) using two published data sets and demonstrate improved performance.
Results and Discussion
Algorithm
Overview of the algorithm
The input to our algorithm consists of the aligned reads for both the sample and input control. We create a 'coverage function'--a map f from the genomic coordinates to the non-negative integers-- by extending each of the aligned sample reads to the average fragment length L. The 'height' f(t) at a nucleotide t is the number of such extended reads that contain t. This piecewise constant function is the data that we analyze.
We will flag peaks in the coverage function that are, in a suitable sense, 'anomalous' as being likely protein binding sites. In order to turn this some-what vague idea into a well-founded statistical inference procedure we require two basic ingredients. Firstly, we need a numerical test statistic that measures some feature of a peak such that peaks which result in extreme values of the test statistic might reasonably be expected to be binding sites. Secondly, in order to calibrate whether a value of the test statistic is so large that it is difficult to explain as simply being the consequence of random fluctuations (and thus indicates the presence of a binding site) we need a stochastic model of the coverage function for the 'null' situation when we are in a region of the genome that doesn't contain a binding site.
A tree shape statistic ℳ to measure "peakness"
The most obvious test statistic is simply the height of a peak. However, such a statistic reflects the depth of coverage at a single site, and ignores valuable information in the form of the coverage depth in the neighboring region. Motivated by current work in topological data analysis (TDA) [6], we propose the following more synoptic measure of a peak's shape that incorporates information in the neighborhood of each site and therefore allows for defining binding regions, and not just sites.
A null stochastic model of the coverage function
Following [7], we model the 'null' or 'background' placement of read starting locations in some region of the genome as a homogeneous Poisson process with rate ρ. That is, we replace the discrete set of nucleotide positions by a continuous interval and suppose that the distances between the starts of successive reads are independent random variables that each have an exponential distribution with mean ${\scriptscriptstyle \frac{1}{\rho}}$. The value of the coverage function at some position t is then just the number of points of the Poisson process that happen to fall in the interval [t - L, t]. This random variable has a Poisson distribution with mean θ = ρL; that is, the mean height of the coverage function at any fixed location is θ.
Even in the absence of binding, some genomic regions systematically receive a large number of fragments while others receive very few [10]. Hence, it would be inappropriate to use the same rate ρ for the entire genome and it is necessary to first divide the genome into regions across each of which we expect the background to be homogeneous and assign an individual rate to each one. We describe our procedure for determining these regions and estimating the local rates later.
The following consequences of this Poisson null model are established in [7].
for k ≥ 1. The quantity p(k) is just the conditional probability that, for any fixed location t, a new read starts somewhere after t before any of the extended reads covering t end, given that there are k such extended reads.
Secondly, the random tree T constructed from an excursion of the coverage function above the level h is a Galton-Watson tree with generation-dependent geometric offspring distributions: the root is at height h, the probability a vertex at height k > h has n offspring (that is, it is connected to n vertices at height k + 1) is p(k)^{ n } (1 - p(k)), n ≥ 0, and these family sizes are independent. We could use this observation to simulate independent copies of T and to obtain a Monte-Carlo approximation of the distribution of the null distribution of ℳ(T). Instead, we simulate independent copies of the appropriate random lattice path and construct copies of T from them; that is, to construct a copy of the random lattice path we start at height h, we move to height h + 1 at the first step, at succeeding steps we move up or down with respective probabilities p(k) and 1 - p(k) when we are at height k, and we stop when we return to height h.
subject to the normalization ${\sum}_{i}\pi \left(i\right)}=1$ [11, §6.4].
Subdividing the genome into regions
for the data along R_{ j } .
Initial filtering of possible peaks
where θ is the estimated expected height of the coverage function on R and C is a user-specified parameter. Note that h_{ R } increases as C decreases. We use C = 7 in our analysis.
Identifying peaks and correcting for multiple hypotheses
For a homogeneous region R, consider a random variable obtained by evaluating our statistic ℳ on a tree built from an excursion of the coverage function above the level h_{ R } under the null model. Let G_{ R } (m) be the probability that such a random variable exceeds m. In order to approximate G_{ R } , we simulate 30,000 random trees with root at height h_{ R } via the method described above of simulating the associated lattice path.
We find the segments in the observed coverage function that correspond to excursions above h_{ R } that are at least 10 base pairs long. We build the lattice path and tree associated with each such excursion. We then compute the value ℳ(T) of our statistic ℳ for each such tree T and assign the 'p-value' G_{ R } (ℳ(T)) to T.
Testing
We tested T-PIC by predicting binding sites for publicly available data sets. Rather than comparing T-PIC to every possible peak caller, we identified PeakSeq [14] and MACS [15] based on previous studies [4, 5] as being the best current programs, and restricted our comparisons to them.
Binding site prediction using published data sets
Samples used in comparison analysis
Samples used in comparison analysis. | ||||
---|---|---|---|---|
Protein | Sample | # of Mapped Reads | # of Input Mapped Reads | Reference |
cad | D. melanogaster | 4,695,843 | 5,275,977 | [16] |
gt | D. melanogaster | 4,702,233 | 13,952,235 | [16] |
hb1 | D. melanogaster | 3,470,895 | 13,952,235 | [16] |
hb1 | D. melanogaster | 3,018,544 | 13,952,235 | [16] |
kr1 | D. melanogaster | 5,175,465 | 5,275,977 | [16] |
kr2 | D. melanogaster | 5,075,323 | 5,275,977 | [16] |
FoxA1 | MCF7 cells | 3,909,805 | 5,233,683 | [15] |
STAT1 | Stimulated Hela S3 cells | 26,731,492 | 19,476,469 | [14] |
Summary of called peaks.
Summary of called peaks. | ||||||
---|---|---|---|---|---|---|
Protein | Peak Caller | Mean Length | # of Peaks | % Found by T-PIC | % Found by MACS | % Found by PeakSeq |
cad | T-PIC | 990.9 | 8136 | 100 | 64.0 | 91.4 |
MACS | 1659.6 | 4601 | 95.7 | 100 | 99.9 | |
PeakSeq | 5278.3 | 11612 | 38.9 | 29.1 | 100 | |
gt | T-PIC | 896.1 | 4502 | 100 | 59.3 | 71.4 |
MACS | 1241.4 | 2929 | 85.6 | 100 | 89.3 | |
PeakSeq | 16030.8 | 3497 | 48.4 | 38.8 | 100 | |
hb1 | T-PIC | 978.5 | 7523 | 100 | 76.7 | 89.9 |
MACS | 1403.4 | 5640 | 93.9 | 100 | 99.9 | |
PeakSeq | 876.3 | 12072 | 57.8 | 53.7 | 100 | |
hb2 | T-PIC | 930.9 | 6392 | 100 | 75.6 | 87.4 |
MACS | 1321.2 | 4849 | 92.4 | 100 | 99.9 | |
PeakSeq | 545 | 11037 | 54.5 | 52.3 | 100 | |
kr1 | T-PIC | 883.0 | 11505 | 100 | 68.0 | 93.9 |
MACS | 1624.3 | 6490 | 98.3 | 100 | 99.9 | |
PeakSeq | 5189.1 | 12924 | 45.9 | 33.8 | 100 | |
kr2 | T-PIC | 884.0 | 11409 | 100 | 67.4 | 94.2 |
MACS | 1588.4 | 6393 | 98.3 | 100 | 100 | |
PeakSeq | 5040.9 | 13540 | 43.9 | 31.5 | 100 | |
FoxA1 | T-PIC | 510.7 | 17619 | 100 | 64.4 | 57.4 |
MACS | 394.1 | 13639 | 83.7 | 100 | 69.6 | |
PeakSeq | 391.6 | 10320 | 97.8 | 91.1 | 100 | |
STAT1 | T-PIC | 857.3 | 84465 | 100 | 36.8 | 62.5 |
MACS | 1342.3 | 29121 | 96.9 | 100 | 97.2 | |
PeakSeq | 573.8 | 62124 | 86.8 | 51.5 | 100 |
Validation of called peaks
Motif Enrichment
Motif Enrichment. | ||||
---|---|---|---|---|
Protein | Binding Motif | T-PIC | MACS | PeakSeq |
cad | ${\text{TTTAT}}_{\text{GA}}^{\text{TG}}$ | 0.805 | 0.971 | 0.895 |
gt | TTACGTAA | 2.347 | 1.59 | 1.042 |
hb1 | TTTTTT | 1.673 | 1.61 | 1.572 |
hb2 | TTTTTT | 1.722 | 1.641 | 1.956 |
kr1 | ${\text{A}}^{\text{G}}\text{ANGGGT}$ | 1.748 | 1.523 | 1.099 |
kr2 | ${\text{A}}^{\text{G}}\text{ANGGGT}$ | 1.732 | 1.508 | 1.01 |
FoxA1 | TGCATG | 2.547 | 1.682 | 1.976 |
STAT1 | TTCNNNGAA | 1.454 | 1.633 | 2.196 |
We then compared the called peaks to the results of independent qPCR experiments for STAT1 and FoxA1 proteins. For FoxA1, we used 26 true positives and 12 true negatives found in [22]. For STAT1, we used 20 true positive regions and 42 true negative regions found in [23]. T-PIC found 15 of 26 positives for FoxA1 and 18 of 20 positive regions for STAT1. MACS finds 14 of 26 positives for FoxA1 and 18 of 20 positive regions for STAT1. PeakSeq finds 13 of 26 positives for FoxA1 and 15 of 20 positive regions for STAT1. In terms of true negatives, T-PIC found 2 of 12 negatives for FoxA1 and 4 of 42 negative regions for STAT1, PeakSeq found 0 of 12 negatives for FoxA1 and 2 of 42 negative regions for STAT1, and MACS found 0 or 12 negatives for FoxA1 and 1 of 42 negative regions for STAT1. These results indicate that T-PIC has high sensitivity, finding more true positives than PeakSeq for both STAT1 and FoxA1 while finding more true positives than MACS for FoxA1. While our Specificity results on this experiment underperformed PeakSeq and MACS by analysis of prediction on true negatives, our results on the Drosophila experiment summarized in Table 1 show that we frequently call fewer peaks than PeakSeq. Moreover, both of the FoxA1 true negatives and 3 of the 4 STAT1 true negatives found by T-PIC pass PeakSeq's first pass of scoring. This means that they are potential peaks based on their height being extreme (and can therefore be considered "borderline" peaks). In general, accurate estimation of Specificity in peak calling is difficult because it is hard to rule out the validity of individual predicted peaks.
Robustness
To test for robustness against replicates, we used the two data sets for hunchback (antibodies 1 and 2) and kruppel (antibodies 1 and 2). For each antibody, we calculated the percentage of peaks that overlapped at least one peak from the other antibody for the same protein. The average percentage for T-PIC was 80.33, while MACS averaged 86.34 and PeakSeq averaged 78.37. We additionally analyzed the ChIP-Seq data for two sample lanes of the STAT1 data [18]. These two lanes came from replicate 2 and had a total of 8,938,780 mapped reads. We compared the predictions to those obtained using the full data set (a total of two replicates, six lanes, and 26,731,492 mapped reads). All three programs found fewer peaks with the smaller data set-- T-PIC predicted 72,778 peaks (13.8% fewer), MACS predicted 19,132 peaks (34.3% fewer), and PeakSeq found 32,232 peaks (48.1% fewer). Of the peaks found using replicate 2, 92.2% of T-PIC's called peaks overlapped peaks found using T-PIC and the entire data set. This compared favorably to both MACS (with 92.0%) and PeakSeq (with 95.1%). and suggests that T-PIC is as robust as other peak calling methods in terms of biological replicates.
Next, we tested for robustness against the input parameter L as during the size selection step, a researcher may not know the true average fragment length. Using the STAT1 data (having L = 200), we ran T-PIC with the additional L values: 150, 175, 225, and 250. On average, the peaks found using different L values overlapped 86.87% of the peaks called using L = 200. The lower values of L (150 and 175) resulted in more peaks than for L ≥ 200 and we found a higher percentage of the L = 200 peaks than the higher values of L (225 and 250). In comparison, PeakSeq also used the input parameter L. On average 93.14% of the PeakSeq's peaks were found by the different L values. Although the true average fragment length for single end sequenced data may not be known, one could determine L if doing paired end sequencing. Our results suggest that this is a good idea regardless of which peak caller is used.
Implementation
Parameters used in T-PIC
Parameters used in T-PIC. | ||
---|---|---|
Parameter | Brief Description | Value used in testing |
L | average fragment length | N/A(varies by experiment) |
minimum length of peak (in bp) | 10 | |
α | significance p-value | 0.01 |
width of interval used to calculate local rate γ(t) | 1,000 | |
K | minimum length of interval for discretizing γ | 10,000 (human) |
5,000 (D. Melano.) | ||
D | used in discretizing γ | 5 |
C | using in selecting height h | 7 |
number of random trees per region in simulation | 30,000 |
Conclusions
We have developed a novel approach to the analysis of ChIP-Seq data, that aims to discover bound regions of DNA by topological analysis of read coverage functions. Our method-T-PIC-is fast and freely available, making it suitable for general use. The approach compares favorably to two popular peak callers: PeakSeq and MACS. We find the majority of their called peaks while detecting additional sites of binding. Although we have focused on ChIP-Seq in this paper, the approach we describe to call peaks could also be of use in the analysis of other sequence based assays like for instance CLIP-Seq for protein-RNA interactions.
Declarations
Acknowledgements
SNE is supported in part by NSF grant DMS-0907630 and VH is funded by NSF fellowship DMS-0902723.
Authors’ Affiliations
References
- Wold B, Myers RM: Sequence census methods for functional genomics. Nat Meth 2008, 5: 19–21. 10.1038/nmeth1157View ArticleGoogle Scholar
- Barski A, Zhao K: Genomic location analysis by ChIP-Seq. Journal of Cellular Biochemistry 2009, 107: 11–18. 10.1002/jcb.22077View ArticlePubMedGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet 2009, 10(10):669–680. 10.1038/nrg2641PubMed CentralView ArticlePubMedGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of Algorithm Performance in ChIP-Seq Peak Detection. PLoS ONE 2010, 5(7):.. 10.1371/journal.pone.0011471PubMed CentralView ArticlePubMedGoogle Scholar
- Laajala T, Raghav S, Tuomela S, Lahesmaa R, Aittokallio T, Elo L: A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics 2009, 10: 618. [http://www.biomedcentral.com/1471–2164/10/618] 10.1186/1471-2164-10-618PubMed CentralView ArticlePubMedGoogle Scholar
- Carlsson G: Topology and data. Bull Amer Math Soc (N.S.) 2009, 46(2):255–308. 10.1090/S0273-0979-09-01249-XView ArticleGoogle Scholar
- Evans S, Hower V, Pachter L: Coverage statistics for sequence census methods. BMC Bioinformatics 2010, 11: 430. 10.1186/1471-2105-11-430PubMed CentralView ArticlePubMedGoogle Scholar
- Evans SN: Probability and real trees, Volume 1920 of Lecture Notes in Mathematics. Berlin: Springer; 2008. Lectures from the 35th Summer School on Probability Theory held in Saint Flour, July 6-23, 2005Google Scholar
- Bhamidi S, Evans SN, Sen A:Spectra of large random trees. 2009. [http://www.citebase.org/abstract?id=oai:arXiv.org:0903.3589]Google Scholar
- Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Meth 2009, 6(11s):S22-S32. 10.1038/nmeth.1371View ArticleGoogle Scholar
- Grimmett GR, Stirzaker DR: Probability and random processes. third edition. New York: Oxford University Press; 2001.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B 1995, 57: 289–300.Google Scholar
- Benjamini Y, Yekutieli D: The control of the false discovery rate in multiple testing under dependency. Ann Statist 2001, 29(4):1165–1188. 10.1214/aos/1013699998View ArticleGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-Seq experiments relative to controls. Nat Biotech 2009, 27: 66–75. 10.1038/nbt.1518View ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, Nussbaum C, Myers R, Brown M, Li W, Liu XS: Model-based Analysis of ChIP-Seq (MACS). Genome Biology 2008, 9(9):R137. [http://genomebiology.com/2008/9/9/R137] 10.1186/gb-2008-9-9-r137PubMed CentralView ArticlePubMedGoogle Scholar
- Bradley RK, Li XY, Trapnell C, Davidson S, Pachter L, Chu HC, Tonkin LA, Biggin MD, Eisen MB: Binding Site Turnover Produces Pervasive Quantitative Changes in Transcription Factor Binding between Closely Related Drosophila Species. PLoS Biol 2010, 8(3):e1000343. 10.1371/journal.pbio.1000343PubMed CentralView ArticlePubMedGoogle Scholar
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Edgar R: NCBI GEO: archive for high-throughput functional genomic data. Nucl Acids Res 2009, 37(suppl 1):D885–890. [http://nar.oxfordjournals.org/cgi/content/abstract/37/suppl_1/D885] 10.1093/nar/gkn764PubMed CentralView ArticlePubMedGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB:Supplemental code and data for PeakSeq: scoring ChIP-seq experiments relative to controls. [http://www.gersteinlab.org/proj/PeakSeq/]
- MACS Sample[http://liulab.dfci.harvard.edu/MACS/Sample.html]
- Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D: The Human Genome Browser at UCSC. Genome Research 2002, 12(6):996–1006. [http://genome.cshlp.org/content/12/6/996.abstract]PubMed CentralView ArticlePubMedGoogle Scholar
- MacArthur S, Li XY, Li J, Brown J, Chu HC, Zeng L, Grondona B, Hechmer A, Simirenko L, Keranen S, Knowles D, Stapleton M, Bickel P, Biggin M, Eisen M: Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions. Genome Biology 2009, 10(7):R80. [http://genomebiology.com/2009/10/7/R80] 10.1186/gb-2009-10-7-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Lupien M, Eeckhoute J, Meyer CA, Wang Q, Zhang Y, Li W, Carroll JS, Liu XS, Brown M: FoxA1 Translates Epigenetic Signatures into Enhancer-Driven Lineage-Specific Transcription. Cell 2008, 132(6):958–970. 10.1016/j.cell.2008.01.018PubMed CentralView ArticlePubMedGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, Euskirchen G, Bernier B, Varhol R, Delaney A, Thiessen N, Griffith OL, He A, Marra M, Snyder M, Jones S: Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Meth 2007, 4(8):651–657. 10.1038/nmeth1068View ArticleGoogle Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2010. [ISBN 3–900051–07–0] [http://www.R-project.org] [ISBN 3-900051-07-0]Google Scholar
- Noyes MB, Meng X, Wakabayashi A, Sinha S, Brodsky MH, Wolfe SA: A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system. Nucleic Acids Research 2008, 36(8):2547–2560. 10.1093/nar/gkn048PubMed CentralView ArticlePubMedGoogle Scholar
- Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature 2008, 456(7221):470–476. 10.1038/nature07509PubMed CentralView ArticlePubMedGoogle Scholar
- Bhinge AA, Kim J, Euskirchen GM, Snyder M, Iyer VR: Mapping the chromosomal targets of STAT1 by Sequence Tag Analysis of Genomic Enrichment (STAGE). Genome Research 2007, 17(6):910–916. [http://genome.cshlp.org/content/17/6/910.abstract] 10.1101/gr.5574907PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.