HiChIP: a high-throughput pipeline for integrative analysis of ChIP-Seq data
© Yan et al.; licensee BioMed Central Ltd. 2014
Received: 18 March 2014
Accepted: 11 August 2014
Published: 15 August 2014
Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research.
We have developed a highly integrative pipeline, termed HiChIP for systematic analysis of ChIP-Seq data. HiChIP incorporates several open source software packages selected based on internal assessments and published comparisons. It also includes a set of tools developed in-house. This workflow enables the analysis of both paired-end and single-end ChIP-Seq reads, with or without replicates for the characterization and annotation of both punctate and diffuse binding sites. The main functionality of HiChIP includes: (a) read quality checking; (b) read mapping and filtering; (c) peak calling and peak consistency analysis; and (d) result visualization. In addition, this pipeline contains modules for generating binding profiles over selected genomic features, de novo motif finding from transcription factor (TF) binding sites and functional annotation of peak associated genes.
HiChIP is a comprehensive analysis pipeline that can be configured to analyze ChIP-Seq data derived from varying antibodies and experiment designs. Using public ChIP-Seq data we demonstrate that HiChIP is a fast and reliable pipeline for processing large amounts of ChIP-Seq data.
KeywordsChIP-Seq Next-generation sequencing Peak calling Duplicate filtering Irreproducible discovery rate
Chromatin immunoprecipitation (ChIP) coupled with next-generation sequencing (ChIP-Seq) represents a powerful approach to identify genome-wide occupancy of transcription factors (TFs) and histone tail modifications . The ENCODE and modENCODE consortia have generated an atlas of TF binding sites and histone modifications for 100+ cell types, including these from human and mouse .
ChIP-Seq data processing starts with the mapping of short reads to a genome reference. The mapped reads (alignments) are then used to generate signal tracks in a variety of formats (Wig, bigWig, bedGraph, or TDF) for data visualization. They are further used to identify regions showing significant enrichment over a control library like an IgG control generated using a non-specific IgG antibody, or an input control without using an antibody . ChIP-Seq data shows three types of binding profiles: punctate binding, diffuse binding, and a mixture of both . Sequence-dependent TFs and some histone modifications (such as H3K4me3) usually exhibit punctate binding sites of a few hundred base pairs in size. Comparatively, some other histone modifications display broad binding profiles that could spread over several hundred kilobases, such as H3K9me3, known to be associated with constitutive heterochromatin, and H3K36me3 associated with transcribed regions. The signals from RNA polymerase II peak at 5’ end of genes, and can extend over the body of transcribed genes, forming a mixture of sharp and diffuse binding profiles.
There are over thirty publicly available programs for peak calling . Most of them focus on punctate binding profiles using either window scanning or aggregation of overlapping reads to identify peaks. A subset of these programs has been extensively evaluated on their sensitivity and specificity [4, 5]. Due to the variation of signal intensity, signal discontinuity within an entire binding domain and insufficient sequencing depth, it has been challenging to define the boundary of diffuse binding domains at high resolution . Currently, only a few packages have been developed to analyze diffuse binding profiles [6–8]. Among them, SICER and RSEG are comparable for experiments with controls . SICER is one of the best programs showing high accuracy in detecting broad binding regions from H3K36me3 .
The ENCODE consortium recommends that ChIP-Seq experiments have two biological replicates in order to assess data reliability. Based on the previous guideline, a ChIP-Seq experiment is considered to be reproducible if at least 75% of the peaks overlap between replicates; or top 40% of the peaks show >80% overlap . A method called irreproducible discovery rate (IDR) has been developed, which measures the consistency between lists of ranked peaks from replicates . It represents a more robust and consistent approach to identify highly reproducible peaks.
Several packages have been developed for downstream analysis of identified peaks. The most common analyses include the assignment of peaks to gene bodies or gene regulatory domains ; the generation of binding profiles over transcription start sites (TSSs) or other key genomic features [12–14]; the coverage of genomic features by peaks ; the testing of functional enrichment for peak-associated genes ; and motif finding . Of these, a peak is usually assigned to a nearby gene based on a pre-defined cutoff for the maximal distance from peak center to gene start, which typically ranges from 2 to 50 kb but can be as far as 1 Mb . This assignment introduces bias towards genes in closer vicinity of peaks and impacts subsequent tests for function enrichment.
A few pipelines have been developed to analyze ChIP-Seq data [13, 14, 16–18]. ChIPpeakAnno and seqMINER focus on the integration of ChIP-Seq data with genomic features [13, 14]. On the other hand, Fish the ChIPs (FC)  and a web server called Nebula  support read mapping; peak calling for punctate binding events; assignment of peaks to genes and data visualization. However, none of them provide functionality for the filtering of mapped reads; the identification of broad binding domains; the assessment of reproducibility; and the analysis of paired-end data.
To address this shortfall, the Highly Integrative Chromatin Immunoprecipitation (HiChIP) pipeline provides comprehensive analysis of ChIP-Seq data. HiChIP has the following features: (a) the analysis of both paired-end and single-end data; (b) filtering of mapped reads based on duplicate level, mapping quality score, genomic uniqueness, insertion size and orientation (for paired-end reads only); (c) the selection of an appropriate peak finder based on binding profile, with MACS  for punctate binding sites and SICER  for broad binding domains; (d) the implementation of the IDR package  to perform consistency analysis of punctate binding sites between replicates; and (e) downstream analysis, such as finding motif(s) from TF binding sites using MEME suite , generating binding profiles over key genomic features and calculating coverage of genomic features by peaks using CEAS , as well as assigning peaks to genes and testing for gene ontology (GO) enrichment using in-house tools. The integrative analysis allows bioinformaticians and investigators to spend less time on low-level data analysis and instead focus on data integration and interpretation.
Since researchers may not always have immediate access to cluster resources, this pipeline allows either parallel processing of a large number of samples in a cluster or serial processing of multiple samples on a single machine. Detailed instructions about how to run HiChIP pipeline and how to use individual tools are described in the user manual available at: http://bioinformaticstools.mayo.edu/. Website containing license agreements for each of the public tools is also provided in the user manual.
To test HiChIP performance, we used five public datasets in human, including single-end ChIP-Seq datasets targeting TFs NFKB and ER and histone mark H3K27me3; a paired-end ChIP-Seq dataset targeting TF RUNX1; and an ER chip-chip dataset. Each of the ChIP-Seq datasets includes both IP and control.
The NFKB datasets are from cell lines GM12878 and GM12891; each with two replicates for both IP and control. The FASTQ sequence files were downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeYaleChIPseq.
The ER ChIP-Seq datasets include 18 libraries from five cell lines (MCF-7, ZR75-1, T-47D, BT-474, and TAM-R). Each cell line had 2–3 replicates for IP and a single control. We downloaded the BWA aligned BAM files from National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) under accession GSE32222 .
The RUNX1 dataset is from an acute myeloid leukemia patient with the t(8;21) translocation . The FASTQ files from one IP (GSM850826) and one control (GSM850828) were downloaded from NCBI GEO. Since the control library had only ~7.3 million pairs of reads, we downsized the total 18.8 million to 8 million pairs for the IP library.
The H3K27me3 datasets are from cell lines GM12878, HeLa S3 and MCF-7. GM12878 had two replicates in IP and one control library, while the other two cell lines each had two replicates for both IP and control. The FASTQ sequence files for GM12878 and HeLa S3 were downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone and these for MCF-7 were from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone.
The ER chip-chip data is generated from the MCF-7 cell line using the Affymetrix human tiling microarray. The dataset was downloaded from: http://research4.dfci.harvard.edu/brownlab//datasets/index.php?dir=ER_MCF7_whole_human_genome/.
Read quality assessment
FastQC is a fast and flexible package for checking overall sequence quality (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). For each sample, FastQC reports the distribution of average per-base and per-read quality, as well as the level of duplication and possible sources of contaminations. If there is indication of abnormality in mapping results, such as low mapping rate, user can review read quality in the FastQC reports and try to improve the mapping rate by trimming low-quality bases or adaptor sequences in the reads.
Read mapping algorithms
Several mapping software packages have been developed to map short reads to the reference genome . BWA is a robust and fast short-read aligner, and has been widely used to map ChIP-Seq reads . Novoalign (http://www.novocraft.com/main/index.php) is slower than BWA but is known to have higher sensitivity . To decide which one to be implemented into the pipeline, we compared mapping rate between Novoalign and BWA on both single-end and paired-end ChIP-Seq data, and further assessed how the mapping difference might impact peak calling.
Post-processing of mapped reads
After initial alignment, the mapped reads need to be further processed in order to improve peak calling sensitivity and specificity. The post-processing steps below address the issues of poorly mapped reads, duplicate reads and reads mapping to multiple locations.
Reads with low mapping quality
It is a common practice to remove reads with low mapping quality. For single-end reads, HiChIP uses samtools  to filter out reads based on a user-defined mapping quality score threshold (default: 20). Mapped paired-end reads have three mapping states: both ends uniquely mapped; one of the ends uniquely mapped; both ends mapping to multiple locations (both have a zero mapping quality score). Samtools does not maintain the pairing information when performing mapping quality-based filtering for paired-end reads. Therefore, we provide a script to remove pairs of reads that have one or two ends below the mapping quality cutoff set by the user. The user can choose not to apply this filtering to the pairs of reads with the two ends mapping to multiple genomics locations (both have a quality score of “0” set by BWA). After the filtering, the proper pairing information will still be maintained.
For ChIP experiments, the sequencing library is mostly generated from a much smaller amount of DNA compared to standard DNA or RNA sequencing. Duplicate reads that map to the same genomic location and strand are frequently present in ChIP-Seq datasets. For many applications, duplicate reads are removed as they are considered likely represent experimental artifacts. However, in the context of a ChIP-Seq experiment duplicate reads can also occur during the sequencing of identical DNA fragments in peak regions. In this case, duplicate reads contribute to peak identification and should not be removed.
Chen et al. reported that duplicate removal could improve the specificity of MACS peak calling . Since the level of duplicate reads as artifacts versus as true signals cannot be well defined, Picard (http://picard.sourceforge.net/) is included in HiChIP to remove duplicate reads by default. A user can specify whether to remove duplicate reads. To reflect the level of duplicate reads, HiChIP uses a custom script to measure library complexity as the ratio between number of duplicate-filtered reads and the total number of uniquely mapped reads. As a guideline, library complexity needs to reach ~0.8 at a sequencing depth of 10 million mapped reads . Low library complexity suggests suboptimal immunoprecipitation efficiency, a lack of sufficient starting material, PCR over-amplification, or a combination of these factors.
Reads mapping to multiple genomic locations
In ChIP-Seq analysis, reads mapping to multiple genomic locations are often discarded . Depending upon the nature of the studied epigenetic mark, this strategy may not be optimal in some cases. For instance, a substantial fraction of the H3K9me3 modification occurs in regions containing repetitive DNA sequences. In a survey of 12 H3K9me3 ChIP-Seq datasets from the ENCODE project (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/), between 16% and 28% of the mapped reads have multiple matches in the genome. It has also been shown that some TF binding sites are located in regions with poor mappability . In such cases, excluding reads mapping to multiple genomic locations will decrease the sensitivity of peak detection in these less mappable regions.
HiChIP allows the user to specify whether to filter out reads matching multiple locations. For single-end reads, only uniquely mapped reads are kept by default, with the option to include one random match for reads mapping to multiple locations. For paired-end reads, we developed an in-house script to filter out undesired pairs. Only mapped pairs with appropriate insertion sizes and correct orientation are kept. Depending on the user’s specification, these reads are further processed to retain pairs belonging to one of the three types: (a) only uniquely mapped pairs; (b) pairs with at least one uniquely mapped end; or (c) pairs with at least one uniquely mapped end, plus a random match if both ends align to multiple locations. No currently available public tool provides equivalent flexibility in the filtering of mapped paired-end ChIP-Seq reads.
There are two major ChIP-Seq binding profiles: punctate binding and diffuse binding. For punctate binding sites, peak calling identifies locations with maximum read density. For diffuse binding sites, the main goal is to define the boundary of individual binding domains. Therefore, different peak callers need to be used to take into account the differences in binding profile.
We used MACS to identify punctuate binding sites because of its high specificity and sensitivity [4, 5, 10]. MACS scans the genome for candidate regions and merges overlapping regions into peaks. It captures local signal fluctuation by modeling the background level as dynamic Poisson distribution.
The consistency of identified peaks can be assessed for punctate binding sites where replicates are available. In HiChIP, we implemented the IDR method to measure the consistency of peaks between replicates . To prepare for IDR analysis, HiChIP combines mapped reads from IP replicates and those from control replicates into two merged datasets following the procedure proposed by Landt et al. . The merged IP dataset is then split into two equally-sized pseudoreplicates after randomization; mapped reads from each IP are also split into two equally-sized pseudoreplicates. The true IP replicates, pseudoreplicates from each IP and merged IP dataset, merged IP and merged control datasets are then used in MACS peak calling. The consistency of resulting peaks is analyzed using the IDR procedure.
As suggested by Landt et al. , if a ChIP-Seq experiment has good reproducibility, the number of consistent peaks between true biological replicates and that between pseudoreplicates from merged IP should not differ by more than a factor of two. Similarly, the number of consistent peaks between pseudoreplicates from biological replicate 1 and that between pseudoreplicates from biological replicate 2 should also be within a factor of two. We added the IDR values (<1) estimated for shared peaks to the 4th column of the MACS output file with the ‘encodePeak’ extension. For replicate-specific peaks, an arbitrary value of ‘1’ is used instead. This will allow an easy extraction of consistent peaks at any user-specified IDR cutoff.
To identify diffuse binding sites, HiChIP leverages a widely used program called SICER . SICER uses a clustering approach to define the boundary of diffuse binding sites. It identifies candidate sites of variable lengths based on a Poisson background model and links neighboring sites together if they are separated by gaps not exceeding a pre-defined gap size cutoff (gap size is 600 bp by default) and the whole domain is significantly enriched over the control . SICER itself only provides filtering of binding regions based on FDR cutoff but not on fold change over the control. HiChIP further filters out candidate regions if the fold change is less than two.
Putative cis-regulated genes
After peak calling, potential cis-regulated genes associated with peaks are identified, which is based on the maximum distance of peaks to the transcriptional start sites (TSSs) or translational end sites (TESs). By default, this distance is set at 10 kilobases.
To enable visual inspection of discovered binding sites and their association with annotated genes or other genomic features, HiChIP generates files that can be visualized in a genome browser like the Integrative Genomics Viewer (IGV) . CEAS needs a Wig file as an input. Since MACS version 2 does not generate a Wig file and SICER generates a Wig file with a relatively large span size (200 bp), we designed a module to generate bedGraph, Wig and tiled data format (TDF) files for data visualization.
To generate the bedGraph file, filtered reads in BAM format are first processed into bed format as follows. Single-end reads are extended by the average fragment length of the library (default 200 bp). For paired-end reads, the HiChIP pipeline keeps the first end and extends by the fragment length estimated from mapping positions of the two ends, rather than by the average fragment length of the library. Given the variability of fragment lengths across a complex genome like human genome, the use of actual coordinates of mapped pairs is expected to achieve better resolution in signal visualization. The bed file is then used to generate a bedGraph file by the genomeCoverageBed command from BEDTools .
The Wig file is generated from the bedGraph file, using an in-house script that computes the extended read coverage at a user-defined step size (default: 20 bp). The extended read coverage is normalized to a library size of one million mapped reads, and converted into the TDF format using the toTDF command from the igvtools package (http://www.broadinstitute.org/software/igv/igvtools). The normalized coverage in TDF format and identified peaks in bed format can be visualized by uploading files to IGV, or by opening the provided igv_session.xml file in IGV.
Peak and binding profile annotation module
HiChIP includes three tools to annotate peaks and binding profiles. We use MEME  for identifying the TF binding motif; CEAS (Cis-regulatory Element Annotation System)  for generating binding profiles over key genomic features and for predicting possible genes regulated by cis-regulatory elements; and an in-house tool for calculating enrichment in gene ontology (GO) terms for peak-associated genes.
HiChIP selects top peaks as input for CEAS and MEME. Peaks used by CEAS are selected based on the pre-defined –log10 (p value) (for MACS peaks) or –log10 (FDR) cutoff (for SICER peaks). Since the detection of binding motif(s) using MEME is dependent upon the set of DNA sequences provided, attention needs to be paid to the cutoff for peak selection. By default the top 10% of peaks with the largest –log10 (p value) will be used. The HiChIP pipeline also allows the user to select a certain number of top peaks for motif discovery. CEAS uses normalized Wig files and peak files (bed format) as inputs, and performs binomial test for enrichment of binding over genomic regions such as gene promoters, gene bodies, exons and introns.
An in-house method is implemented to identify GO terms that are enriched in peak-associated genes. This method uses a similar approach as GREAT  that could not be integrated into our workflow, since the main functionality of GREAT is only available through web services. The lists of human and mouse genes with annotated GO terms were downloaded from the GREAT website (http://bejerano.stanford.edu/help/display/GREAT/Genes). For each gene annotated with at least one ontology term, the HiChIP pipeline first defines its regulatory domain as the region from upstream (U + UE) bp to downstream (D + DE) bp around the TSS, where the region from upstream U bp (default 5000) to downstream D bp (default 1000) represents the proximal regulatory domain, and UE and DE denote the maximum upstream and downstream extension, respectively. Binomial tests are then performed to identify a list of GO terms that are enriched in genes associated with peaks.
Results and discussion
Performance and output summary
We tested the pipeline performance on a Linux platform with an 8-core GenuineIntel CPU at 2.66 GHz. For a typical ChIP-Seq dataset containing a single IP and control library, each with 20–50 million pairs of reads, HiChIP takes 6–14 hours to complete at ~5-8 Gb memory usage.
The summary report provides links to the FastQC output files and an igv_session.xml file for data visualization. It also contains an html document that covers sample information, mapping summary, library complexity (Additional file 1: Table S1), peak summary, as well as histograms showing read pileup distribution within peaks. Depending on the user’s specification for peak calling, the pipeline will generate a list of peaks from MACS, MACS combined with IDR analysis, or SICER.
To help with peak interpretation, the HiChIP pipeline generates a table that reports the closest genes (peak_vs_gene.xls). In addition, CEAS provides a report summarizing the percentage of peaks located in different regions such as promoters, gene upstream and downstream, UTRs, and provides plots showing binding profiles over selected genomic features (Additional file 2). MEME creates an html file that contains the most significant motif(s), and a text file with names of individual sequences from peak regions that contain a motif. Finally, the internally-developed GO enrichment test identifies the most significant terms enriched for peak-associated genes (Additional file 1: Table S4). We have included a word document to describe individual output files (HiCHIP_workflow_summary.doc).
Comparison of BWA and Novoalign mapping
Number of NFKB peaks from BWA and Novoalign mapped reads
Unique peaks w/ motif
Unique peaks w/o motif
BWA and Novoalign mapping of paired-end reads
RUNX1 peaks from BWA and Novoalign mapped reads
Unique peaks w/ motif
Unique peaks w/o motif
Comparison of ChIP-Seq and chip-chip peaks
ER ChIP-Seq and chip-chip peaks in MCF-7 cell line
The impact of duplicate removal on TF peak calling
Number of ER peaks from MCF-7 cell line
IP_1 vs. control
IP_2 vs. control
IP_3 vs. control
Unique reads (million)
Unique peaks w/ motif
Unique peaks w/o motif
Duplicate level in three ER ChIP-Seq libraries
To test whether the abundance of duplicates is correlated with the confidence level of peaks, for each library we split the p-value-sorted peaks into 10 equal-sized groups, with peaks in the first group having the lowest p values. The top 10% of the peaks (in the first group) had a duplication rate between 42.8% to 57%, containing roughly half of the total duplicates (Table 6; Figure 3). In contrast, the bottom 70% of the peaks (groups 4 to 10) had much reduced duplication rate (Figure 3). Our analysis suggested that, while filtering of duplicates contributes to the identification of extra peaks, it reduces the signal intensity to a much greater extent for the most significant peaks. The latter will impact the test for differential binding between different IPs.
We further used the TF RUNX1 dataset to investigate how duplicate removal might impact peak calling from paired-end data. The RUNX1 IP had up to 67% duplicates identified by the Picard MarkDuplicates command. We used MACS to call 931 peaks (Table 3) from duplicate-filtered reads and 16 times more peaks (14,916) from reads prior to duplicate removal. Of the 14,013 peaks not overlapping the 931 peaks, only 73 (0.5%) contained the RUNX1 binding motif. This suggests that the vast majority of the 14,013 unique peaks from reads without duplicate removal represent false positives. This result, together with the ER ChIP-Seq results, supports duplicate removal when analyzing TF ChIP-Seq data.
The impact of duplicate removal on H3K27me3 peak calling
We tested IDR analysis using three ER ChIP-Seq libraries, which include two biological replicates for IP (IP_1 and IP_2) and a single control  (Tables 4 and 5). To call both significant and insignificant peaks in order to identify an appropriate IDR cutoff, we used a less stringent p value cutoff (1e-3) in MACS peak calling. When plotting the number of reproducible peaks over different IDR values, a clear transition was observed from highly reproducible peaks to poorly reproducible peaks (Figure 2). At the IDR cutoff of 0.01 (default), there were 22,382 and 14,286 consistent peaks between the two pseudoreplicates of IP_1 and IP_2, respectively, with a ratio of 1.6 (22382/14286). We identified 26,971 consistent peaks between two pseudoreplicates from merged IP and 21,224 consistent peaks between replicates IP_1 and IP_2, with a ratio of 1.3. In both cases, the ratio is less than 2, indicating good reproducibility between IP_1 and IP_2 (Figure 2; Additional file 1: Table S3). If two replicates show poor reproducibility (ratio >2), then it is necessary to generate a third replicate to validate the reliability of identified peaks.
HiChIP is a comprehensive ChIP-Seq data analysis pipeline with more than 10 functions (Figure 1). It performs read mapping, peak calling for punctate and diffuse binding sites and downstream functional analysis. To enhance the quality of peak calling, HiChIP includes options for filtering out less reliably mapped reads to reduce noise. It also includes IDR analysis to identify a list of reproducible peaks between replicates. It provides a consistent and configurable method to assist the user to run this pipeline.
By applying HiChIP to publicly available single-end ER ChIP-Seq datasets we found that filtering of duplicates increases the sensitivity of MACS peak calling but heavily underestimates enrichment levels for the most significant peaks. For the paired-end RUNX1 ChIP-Seq data, the vast majority of the peaks called only from reads without duplicate removal represent false positives. These results suggest the necessity of enabling duplicate filtering for TF peak calling and using all mapped reads for estimating enrichment level and identifying differential binding sites. In contrast, duplicate filtering has less impact on peak calling from marks showing broad binding profile like H3K27me3.
Although HiChIP has combined several methods to enhance the preprocessing and annotation of ChIP-Seq data, several challenges remain that need to be addressed in the future. For example, it is still difficult to define the boundary of diffuse binding sites at high resolution and to identify the direct target genes of TF binding sites and histone modifications. As new or improved methods become available, the modular design of HiChIP will enable their smooth integration into the existing pipeline.
Availability and requirements
Project name: HiChIP: A high-throughput pipeline for integrative analysis of ChIP-Seq data
Project home page: http://bioinformaticstools.mayo.edu/
Operating system: 64-bit Linux (The program has been tested on Centos)
Programming language: Shell, Perl and R
JAVA version 1.6.0_17 or higher
Perl version 5.10.0 or higher
Python version 2.7 or higher
Cython and Numpy python modules
R version 2.14.0 or higher
FastQC version 0.10 or higher
BWA version 0.5.9 or higher
MACS version 2.0.10 or higher
SICER version 1.1
IGVTools version 2.3.16
Samtools version 0.1.19
MEME version 4.8.1
CEAS version 1.0.2
Picard version 1.97
BEDTools version 2.17.0
We thank Mona Branstad for editing the manuscript. This work was supported by the Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905.
- Furey TS: ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012, 13 (12): 840-852.View ArticlePubMed CentralPubMedGoogle Scholar
- Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C, Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu XS, Ma L, Milosavljevic A, Myers RM, et al: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22 (9): 1813-1831.View ArticlePubMed CentralPubMedGoogle Scholar
- Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009, 6 (11 Suppl): S22-S32.View ArticlePubMed CentralPubMedGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010, 5 (7): e11471-View ArticlePubMed CentralPubMedGoogle Scholar
- Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS: Systematic evaluation of factors influencing ChIP-seq fidelity. Nat Methods. 2012, 9 (6): 609-614.View ArticlePubMed CentralPubMedGoogle Scholar
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009, 25 (15): 1952-1958.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang J, Lunyak VV, Jordan IK: BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets. Bioinformatics. 2013, 29 (4): 492-493.View ArticlePubMedGoogle Scholar
- Song Q, Smith AD: Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics (Oxford, England). 2011, 27 (6): 870-871.View ArticleGoogle Scholar
- Kumar V, Muratani M, Rayan NA, Kraus P, Lufkin T, Ng HH, Prabhakar S: Uniform, optimal signal processing of mapped deep-sequencing data. Nat Biotechnol. 2013, 31 (7): 615-622.View ArticlePubMedGoogle Scholar
- Li Q, Brown JB, Huang H, Bickel PJ: Measuring reproducibility of high-throughput experiments. Ann Appl Stat. 2011, 5 (3): 1752-1779.View ArticleGoogle Scholar
- McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G: GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010, 28 (5): 495-501.View ArticlePubMedGoogle Scholar
- Shin H, Liu T, Manrai AK, Liu XS: CEAS: cis-regulatory element annotation system. Bioinformatics. 2009, 25 (19): 2605-2606.View ArticlePubMedGoogle Scholar
- Ye T, Krebs AR, Choukrallah MA, Keime C, Plewniak F, Davidson I, Tora L: seqMINER: an integrated ChIP-seq data interpretation platform. Nucleic Acids Res. 2011, 39 (6): e35-View ArticlePubMed CentralPubMedGoogle Scholar
- Zhu LJ, Gazin C, Lawson ND, Pages H, Lin SM, Lapointe DS, Green MR: ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics. 2010, 11: 237-View ArticlePubMed CentralPubMedGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009, 37 (Web Server issue): W202-W208.View ArticlePubMed CentralPubMedGoogle Scholar
- Bardet AF, He Q, Zeitlinger J, Stark A: A computational pipeline for comparative ChIP-seq analyses. Nat Protoc. 2012, 7 (1): 45-61.View ArticleGoogle Scholar
- Boeva V, Lermine A, Barette C, Guillouf C, Barillot E: Nebula–a web-server for advanced ChIP-seq data analysis. Bioinformatics. 2012, 28 (19): 2517-2519.View ArticlePubMedGoogle Scholar
- Mercier E, Droit A, Li L, Robertson G, Zhang X, Gottardo R: An integrated pipeline for the genome-wide analysis of transcription factor binding sites from ChIP-Seq. PLoS One. 2011, 6 (2): e16432-View ArticlePubMed CentralPubMedGoogle Scholar
- Barozzi I, Termanini A, Minucci S, Natoli G: Fish the ChIPs: a pipeline for automated genomic annotation of ChIP-Seq data. Biol Direct. 2011, 6: 51-View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-View ArticlePubMed CentralPubMedGoogle Scholar
- Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO, Green AR, Ali S, Chin SF, Palmieri C, Caldas C, Carroll JS: Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature. 2012, 481 (7381): 389-393.PubMed CentralPubMedGoogle Scholar
- Ptasinska A, Assi SA, Mannari D, James SR, Williamson D, Dunne J, Hoogenkamp M, Wu M, Care M, McNeill H, Cauchy P, Cullen M, Tooze RM, Tenen DG, Young BD, Cockerill PN, Westhead DR, Heidenreich O, Bonifer C: Depletion of RUNX1/ETO in t(8;21) AML cells leads to genome-wide changes in chromatin structure and transcription factor binding. Leukemia. 2012, 26 (8): 1829-1841.View ArticlePubMed CentralPubMedGoogle Scholar
- Lunter G, Goodson M: Stampy: a statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011, 21 (6): 936-939.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760.View ArticlePubMed CentralPubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/Map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.View ArticlePubMed CentralPubMedGoogle Scholar
- Chung D, Kuan PF, Li B, Sanalkumar R, Liang K, Bresnick EH, Dewey C, Keles S: Discovering transcription factor binding sites in highly repetitive regions of genomes with multi-read analysis of ChIP-Seq data. PLoS Comput Biol. 2011, 7 (7): e1002111-View ArticlePubMed CentralPubMedGoogle Scholar
- Robinson JT, Thorvaldsdottir H, Winckler W, Guttman M, Lander ES, Getz G, Mesirov JP: Integrative genomics viewer. Nat Biotechnol. 2011, 29 (1): 24-26.View ArticlePubMed CentralPubMedGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842.View ArticlePubMed CentralPubMedGoogle Scholar
- Carroll JS, Meyer CA, Song J, Li W, Geistlinger TR, Eeckhoute J, Brodsky AS, Keeton EK, Fertuck KC, Hall GF, Wang Q, Bekiranov S, Sementchenko V, Fox EA, Silver PA, Gingeras TR, Liu XS, Brown M: Genome-wide analysis of estrogen receptor binding sites. Nat Genet. 2006, 38 (11): 1289-1297.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.