ChIP-BIT2: a software tool to detect weak binding events using a Bayesian integration approach

Transcription factor binding events play important functional roles in gene regulation. It is, however, a challenging task to detect weak binding events since the ambiguity in differentiation of weak binding signals from background signals. We present a software package, ChIP-BIT2, to identify weak binding events using a Bayesian integration approach. By integrating signals from sample and input ChIP-seq data, ChIP-BIT2 can detect both strong and weak binding events at gene promoter, enhancer or the whole genome effectively. The ChIP-BIT2 package has been extensively tested on ChIP-seq data, demonstrating its wide applicability in ChIP-seq data analysis. Availability and Implementation The ChIP-BIT2 package is available at http://sourceforge.net/projects/chipbitc/.


Introduction
In order to identify transcription factor (TF) binding events or enrichment of histone markers (HMs), biologists run a sample ChIP-seq experiment to capture binding signals and background signals, and a second input ChIP-seq experiment to capture background signals only. The regulation strength of a TF or HM is often cell-type and condition specific. In some cases, the strength of a functional binding may be weak with a low read count observation. It is a challenging task to identify those weak binding events (with relatively low enrichment in the sample experiment but still much higher than that of input experiment) since they are more easily to be mixed with background signals also measured in the sample experiment. Recent studies have shown that functional effects of weak bindings can be very significant on gene transcription (1).
A number of ChIP-seq peak detection tools were developed using both sample and input ChIP-seq data (2)(3)(4)(5)(6)(7). Read count is the most widely used signal format, which can be modeled to follow a Poisson distribution in sequencing data analysis (8)(9)(10).
Although mathematically convenient to use read count, it gives rise to the difficulty in detecting weak binding events in ChIP-seq data because a ChIP-seq profile includes both binding and background signals and the medium or low read counts of weak binding events are often close to the level of background signals. Those tools developed based on read count are easily to classify weak binding events as background signals. Therefore, a Bayesian method, namely ChIP-BIT, has been proposed to reliably identify weak binding events at gene promoter regions (7).
Application of ChIP-BIT is limited since it can only be used to detect binding events of TFs at promoter regions. We present a new software package, ChIP-BIT2, to detect binding events of both strong and week signals from ChIP-seq data. ChIP-BIT2 expands the capability of ChIP-BIT for detecting binding events of TFs or HMs at any regions including gene promoter, enhancer or simply the whole genome. Moreover, ChIP-BIT2 is a C/C++ implementation. Compared to the original ChIP-BIT package (implemented in MATLAB ), a significant speed improvement (~40%) can be gained for a pair of matched sample and input ChIP-seq profiles.

Description
The challenge in detecting weak binding events lies in the ambiguity in differentiation of the binding signals from background signals. Background signal is not random noise. Its amplitude can be as high as a true binding signal. Using read intensity (log transform of average read coverage (7)), as illustrated in Fig. 1, ChIP-BIT2 will shrink the 'distance' between strong and weak binding signals and enlarge the 'distance' between weak and background signals, making it easier to detect weak bindings. The core functions of the algorithm are implemented in C/C++ as depicted in Fig. 2. ChIP-BIT2 can detect binding events either proximal to transcription starting sites (TSSs) or located at distal enhancer regions. As shown in Fig. 1 and Fig. 2, candidate ChIP-seq signal enriched regions are first identified through a data pre-processing step (with a similar strategy in PeakSeq (3)). Input background signal is then properly normalized against sample ChIP-seq signal. After loading annotated enhancer or promoter regions, ChIP-BIT2 can run in either 'promoter mode' or 'enhancer mode'. Each promoter or enhancer region is first extended to a user-defined range and then partitioned into 200 bps bins. If no annotation region is provided, ChIP-BIT2 will detect all possibly enriched regions and partition the into bins as candidate regions. For each bin, sample read intensity, input read intensity and binding location (for promoter only) are calculated for each bin. A probability for binding occurrence is estimated using the Expectation-Maximization (EM) method. Consecutive bins with probabilities higher than a predefined threshold will be merged and finally outputted as enriched peaks. Using bin-based model, there is no need to set peak size limit, which makes ChIP-BIT2 flexible to detect narrow or wide peaks for TFs or HMs.

Results
We have compared ChIP-BIT2 and MACS2 using a breast cancer MCF-7 ChIP-seq data set with 39 TFs (as listed in Table 1). BAM files of selected TFs and their matched input data are downloaded from the ENCODE website (https://www.encodeproject.org/) and GEO database (https://www.ncbi.nlm.nih.gov/geo/). We used ChIP-BIT2 to process each TF and its match input data and predicted binding events at enhancer or gene promoter regions. Gene promoter regions were extracted from human reference genome hg19 as ±10k bps around each TSS. In total, we obtained 25,802 promoter regions regardless of potential overlap. Breast cancer MCF7 enhancer like regions were downloaded from ENCODE (https://www.encodeproject.org/data/annotations/). In total, there are non-overlap 33,957 enhancer regions (referred to hg19). We extended or pruned enhancer regions as ±1k bps around the original middle points. Finally, we used ChIP-BIT2 to call TFBSs at promoter and enhancer regions, respectively. ChIP-BIT2 and its previous MATLAB version were tested under CentOS Linux 7.3 on DELL T7600 workstation with 3.1GHz CPU (32 cores) and 128GB RAM. ChIP-BIT2 only uses one CPU core for one TF's ChIP-seq data processing and requires less than 2GB memory, so the running speed should be similar on any desktop or laptop with a similar CPU speed. The running speed of ChIP-BIT2 is summarized in Table. 2. Compared to its previous MATLAB version (supporting 'promoter mode' only), ChIP-BIT2 has gained a speed improvement of ~40%. Its speed in 'enhancer mode' is very similar to that in 'promoter mode', mainly determined by the number of reads in sample and input experiments.
We then compared ChIP-BIT2-detected peaks against those predicted by MACS2 (2) at selected promoter or enhancer regions associated with active genes (highly expressed) in MCF-7 cells. As shown in Fig. S3, a high proportion (73% for promoter or 54% for enhancer) of MACS2-detected binding events were captured by ChIP-BIT2. ChIP-BIT2 detected additional weak binding events (probability >0.9) together with those strong ones. Weak bindings at enhancer regions may play a more important role in gene regulation because an enhancer region is likely to be activated by a set of TFs, including both major activators and co-factors; the binding strength of cofactors may be weak (11). Therefore, ChIP-BIT2, by modeling read intensity instead of read count, can better eliminate background signals to effectively reduce false positive predictions, achieving an improved accuracy in detecting weak binding events.