This ChIP-Seq analysis pipeline has been developed by integrating public packages with internally developed tools. It is intended for research purposes only. The core functions include: (a) read quality assessment and read mapping; (b) filtering of mapped reads and estimation of library complexity; (c) peak calling and identification of highly consistent peaks between replicates; (d) signal intensity estimation, normalization and visualization; and (e) annotation of peaks and binding profiles (Figure 1).
Since researchers may not always have immediate access to cluster resources, this pipeline allows either parallel processing of a large number of samples in a cluster or serial processing of multiple samples on a single machine. Detailed instructions about how to run HiChIP pipeline and how to use individual tools are described in the user manual available at: http://bioinformaticstools.mayo.edu/. Website containing license agreements for each of the public tools is also provided in the user manual.
Test datasets
To test HiChIP performance, we used five public datasets in human, including single-end ChIP-Seq datasets targeting TFs NFKB and ER and histone mark H3K27me3; a paired-end ChIP-Seq dataset targeting TF RUNX1; and an ER chip-chip dataset. Each of the ChIP-Seq datasets includes both IP and control.
The NFKB datasets are from cell lines GM12878 and GM12891; each with two replicates for both IP and control. The FASTQ sequence files were downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeYaleChIPseq.
The ER ChIP-Seq datasets include 18 libraries from five cell lines (MCF-7, ZR75-1, T-47D, BT-474, and TAM-R). Each cell line had 2–3 replicates for IP and a single control. We downloaded the BWA aligned BAM files from National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) under accession GSE32222 [21].
The RUNX1 dataset is from an acute myeloid leukemia patient with the t(8;21) translocation [22]. The FASTQ files from one IP (GSM850826) and one control (GSM850828) were downloaded from NCBI GEO. Since the control library had only ~7.3 million pairs of reads, we downsized the total 18.8 million to 8 million pairs for the IP library.
The H3K27me3 datasets are from cell lines GM12878, HeLa S3 and MCF-7. GM12878 had two replicates in IP and one control library, while the other two cell lines each had two replicates for both IP and control. The FASTQ sequence files for GM12878 and HeLa S3 were downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeBroadHistone and these for MCF-7 were from: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhHistone.
The ER chip-chip data is generated from the MCF-7 cell line using the Affymetrix human tiling microarray. The dataset was downloaded from: http://research4.dfci.harvard.edu/brownlab//datasets/index.php?dir=ER_MCF7_whole_human_genome/.
Read quality assessment
FastQC is a fast and flexible package for checking overall sequence quality (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). For each sample, FastQC reports the distribution of average per-base and per-read quality, as well as the level of duplication and possible sources of contaminations. If there is indication of abnormality in mapping results, such as low mapping rate, user can review read quality in the FastQC reports and try to improve the mapping rate by trimming low-quality bases or adaptor sequences in the reads.
Read mapping algorithms
Several mapping software packages have been developed to map short reads to the reference genome [23]. BWA is a robust and fast short-read aligner, and has been widely used to map ChIP-Seq reads [24]. Novoalign (http://www.novocraft.com/main/index.php) is slower than BWA but is known to have higher sensitivity [23]. To decide which one to be implemented into the pipeline, we compared mapping rate between Novoalign and BWA on both single-end and paired-end ChIP-Seq data, and further assessed how the mapping difference might impact peak calling.
Post-processing of mapped reads
After initial alignment, the mapped reads need to be further processed in order to improve peak calling sensitivity and specificity. The post-processing steps below address the issues of poorly mapped reads, duplicate reads and reads mapping to multiple locations.
Reads with low mapping quality
It is a common practice to remove reads with low mapping quality. For single-end reads, HiChIP uses samtools [25] to filter out reads based on a user-defined mapping quality score threshold (default: 20). Mapped paired-end reads have three mapping states: both ends uniquely mapped; one of the ends uniquely mapped; both ends mapping to multiple locations (both have a zero mapping quality score). Samtools does not maintain the pairing information when performing mapping quality-based filtering for paired-end reads. Therefore, we provide a script to remove pairs of reads that have one or two ends below the mapping quality cutoff set by the user. The user can choose not to apply this filtering to the pairs of reads with the two ends mapping to multiple genomics locations (both have a quality score of “0” set by BWA). After the filtering, the proper pairing information will still be maintained.
Duplicate reads
For ChIP experiments, the sequencing library is mostly generated from a much smaller amount of DNA compared to standard DNA or RNA sequencing. Duplicate reads that map to the same genomic location and strand are frequently present in ChIP-Seq datasets. For many applications, duplicate reads are removed as they are considered likely represent experimental artifacts. However, in the context of a ChIP-Seq experiment duplicate reads can also occur during the sequencing of identical DNA fragments in peak regions. In this case, duplicate reads contribute to peak identification and should not be removed.
Chen et al. reported that duplicate removal could improve the specificity of MACS peak calling [5]. Since the level of duplicate reads as artifacts versus as true signals cannot be well defined, Picard (http://picard.sourceforge.net/) is included in HiChIP to remove duplicate reads by default. A user can specify whether to remove duplicate reads. To reflect the level of duplicate reads, HiChIP uses a custom script to measure library complexity as the ratio between number of duplicate-filtered reads and the total number of uniquely mapped reads. As a guideline, library complexity needs to reach ~0.8 at a sequencing depth of 10 million mapped reads [2]. Low library complexity suggests suboptimal immunoprecipitation efficiency, a lack of sufficient starting material, PCR over-amplification, or a combination of these factors.
Reads mapping to multiple genomic locations
In ChIP-Seq analysis, reads mapping to multiple genomic locations are often discarded [26]. Depending upon the nature of the studied epigenetic mark, this strategy may not be optimal in some cases. For instance, a substantial fraction of the H3K9me3 modification occurs in regions containing repetitive DNA sequences. In a survey of 12 H3K9me3 ChIP-Seq datasets from the ENCODE project (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/), between 16% and 28% of the mapped reads have multiple matches in the genome. It has also been shown that some TF binding sites are located in regions with poor mappability [26]. In such cases, excluding reads mapping to multiple genomic locations will decrease the sensitivity of peak detection in these less mappable regions.
HiChIP allows the user to specify whether to filter out reads matching multiple locations. For single-end reads, only uniquely mapped reads are kept by default, with the option to include one random match for reads mapping to multiple locations. For paired-end reads, we developed an in-house script to filter out undesired pairs. Only mapped pairs with appropriate insertion sizes and correct orientation are kept. Depending on the user’s specification, these reads are further processed to retain pairs belonging to one of the three types: (a) only uniquely mapped pairs; (b) pairs with at least one uniquely mapped end; or (c) pairs with at least one uniquely mapped end, plus a random match if both ends align to multiple locations. No currently available public tool provides equivalent flexibility in the filtering of mapped paired-end ChIP-Seq reads.
Peak calling
There are two major ChIP-Seq binding profiles: punctate binding and diffuse binding. For punctate binding sites, peak calling identifies locations with maximum read density. For diffuse binding sites, the main goal is to define the boundary of individual binding domains. Therefore, different peak callers need to be used to take into account the differences in binding profile.
We used MACS to identify punctuate binding sites because of its high specificity and sensitivity [4, 5, 10]. MACS scans the genome for candidate regions and merges overlapping regions into peaks. It captures local signal fluctuation by modeling the background level as dynamic Poisson distribution.
The consistency of identified peaks can be assessed for punctate binding sites where replicates are available. In HiChIP, we implemented the IDR method to measure the consistency of peaks between replicates [10]. To prepare for IDR analysis, HiChIP combines mapped reads from IP replicates and those from control replicates into two merged datasets following the procedure proposed by Landt et al. [2]. The merged IP dataset is then split into two equally-sized pseudoreplicates after randomization; mapped reads from each IP are also split into two equally-sized pseudoreplicates. The true IP replicates, pseudoreplicates from each IP and merged IP dataset, merged IP and merged control datasets are then used in MACS peak calling. The consistency of resulting peaks is analyzed using the IDR procedure.
As suggested by Landt et al. [2], if a ChIP-Seq experiment has good reproducibility, the number of consistent peaks between true biological replicates and that between pseudoreplicates from merged IP should not differ by more than a factor of two. Similarly, the number of consistent peaks between pseudoreplicates from biological replicate 1 and that between pseudoreplicates from biological replicate 2 should also be within a factor of two. We added the IDR values (<1) estimated for shared peaks to the 4th column of the MACS output file with the ‘encodePeak’ extension. For replicate-specific peaks, an arbitrary value of ‘1’ is used instead. This will allow an easy extraction of consistent peaks at any user-specified IDR cutoff.
To identify diffuse binding sites, HiChIP leverages a widely used program called SICER [6]. SICER uses a clustering approach to define the boundary of diffuse binding sites. It identifies candidate sites of variable lengths based on a Poisson background model and links neighboring sites together if they are separated by gaps not exceeding a pre-defined gap size cutoff (gap size is 600 bp by default) and the whole domain is significantly enriched over the control [6]. SICER itself only provides filtering of binding regions based on FDR cutoff but not on fold change over the control. HiChIP further filters out candidate regions if the fold change is less than two.
Putative cis-regulated genes
After peak calling, potential cis-regulated genes associated with peaks are identified, which is based on the maximum distance of peaks to the transcriptional start sites (TSSs) or translational end sites (TESs). By default, this distance is set at 10 kilobases.
Data visualization
To enable visual inspection of discovered binding sites and their association with annotated genes or other genomic features, HiChIP generates files that can be visualized in a genome browser like the Integrative Genomics Viewer (IGV) [27]. CEAS needs a Wig file as an input. Since MACS version 2 does not generate a Wig file and SICER generates a Wig file with a relatively large span size (200 bp), we designed a module to generate bedGraph, Wig and tiled data format (TDF) files for data visualization.
To generate the bedGraph file, filtered reads in BAM format are first processed into bed format as follows. Single-end reads are extended by the average fragment length of the library (default 200 bp). For paired-end reads, the HiChIP pipeline keeps the first end and extends by the fragment length estimated from mapping positions of the two ends, rather than by the average fragment length of the library. Given the variability of fragment lengths across a complex genome like human genome, the use of actual coordinates of mapped pairs is expected to achieve better resolution in signal visualization. The bed file is then used to generate a bedGraph file by the genomeCoverageBed command from BEDTools [28].
The Wig file is generated from the bedGraph file, using an in-house script that computes the extended read coverage at a user-defined step size (default: 20 bp). The extended read coverage is normalized to a library size of one million mapped reads, and converted into the TDF format using the toTDF command from the igvtools package (http://www.broadinstitute.org/software/igv/igvtools). The normalized coverage in TDF format and identified peaks in bed format can be visualized by uploading files to IGV, or by opening the provided igv_session.xml file in IGV.
Peak and binding profile annotation module
HiChIP includes three tools to annotate peaks and binding profiles. We use MEME [15] for identifying the TF binding motif; CEAS (Cis-regulatory Element Annotation System) [12] for generating binding profiles over key genomic features and for predicting possible genes regulated by cis-regulatory elements; and an in-house tool for calculating enrichment in gene ontology (GO) terms for peak-associated genes.
HiChIP selects top peaks as input for CEAS and MEME. Peaks used by CEAS are selected based on the pre-defined –log10 (p value) (for MACS peaks) or –log10 (FDR) cutoff (for SICER peaks). Since the detection of binding motif(s) using MEME is dependent upon the set of DNA sequences provided, attention needs to be paid to the cutoff for peak selection. By default the top 10% of peaks with the largest –log10 (p value) will be used. The HiChIP pipeline also allows the user to select a certain number of top peaks for motif discovery. CEAS uses normalized Wig files and peak files (bed format) as inputs, and performs binomial test for enrichment of binding over genomic regions such as gene promoters, gene bodies, exons and introns.
An in-house method is implemented to identify GO terms that are enriched in peak-associated genes. This method uses a similar approach as GREAT [11] that could not be integrated into our workflow, since the main functionality of GREAT is only available through web services. The lists of human and mouse genes with annotated GO terms were downloaded from the GREAT website (http://bejerano.stanford.edu/help/display/GREAT/Genes). For each gene annotated with at least one ontology term, the HiChIP pipeline first defines its regulatory domain as the region from upstream (U + UE) bp to downstream (D + DE) bp around the TSS, where the region from upstream U bp (default 5000) to downstream D bp (default 1000) represents the proximal regulatory domain, and UE and DE denote the maximum upstream and downstream extension, respectively. Binomial tests are then performed to identify a list of GO terms that are enriched in genes associated with peaks.