A flexible ChIP-sequencing simulation toolkit

Background A major challenge in evaluating quantitative ChIP-seq analyses, such as peak calling and differential binding, is a lack of reliable ground truth data. Accurate simulation of ChIP-seq data can mitigate this challenge, but existing frameworks are either too cumbersome to apply genome-wide or unable to model a number of important experimental conditions in ChIP-seq. Results We present ChIPs, a toolkit for rapidly simulating ChIP-seq data using statistical models of key experimental steps. We demonstrate how ChIPs can be used for a range of applications, including benchmarking analysis tools and evaluating the impact of various experimental parameters. ChIPs is implemented as a standalone command-line program written in C++ and is available from https://github.com/gymreklab/chips. Conclusions ChIPs is an efficient ChIP-seq simulation framework that generates realistic datasets over a flexible range of experimental conditions. It can serve as an important component in various ChIP-seq analyses where ground truth data are needed. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04097-5.

In the case of paired end reads, fragment lengths can be determined trivially from the mapping locations of paired reads. The observed fragment length (X i ) for each read pair i can be computed based on the mapping coordinates of the two reads. The learn module randomly selects 5,000 uniquely aligned read pairs from the input BAM for fitting a gamma distribution. Read pairs are filtered to remove fragments that are unaligned, not properly paired, marked as duplicates or marked as secondary alignments. Read pairs are further filtered to remove fragments with length greater than 3 times the median length of selected fragments. The mean fragment length is easily computed as µ = n i=1 X i n , where n is the number of fragments remaining after filtering. We then use the method of moments to find maximum likelihood estimates of the gamma distribution shape (k) and scale parameters (θ): For single end reads, individual fragment lengths are not directly observed. We outline an approximate method for estimating fragment length distributions from single end data in the section "Inferring fragment lengths from single end reads" below.
Step 2: immunoprecipitation We use B i to denote that fragment i is bound, B i to denote that fragment i is unbound, and D i to denote that fragment i is pulled down. For fragment i, assuming the probability that a given bound fragment is pulled down is approximately constant over all fragments, the probability of being pulled down can then be written as: where P (D|B) denotes the probability that a given bound fragment will be pulled down and P (D|B) denotes the probability that a given unbound fragment will be pulled down.
For any fragment i, we set the probability of it being bound P (B i ) based on the scores of peaks it overlaps: where R i is the set of all peak regions overlapping fragment i and S(r) is the probability that peak r is bound. A method for estimating S(r) is detailed below.
We compute conditional pulldown probabilities using Bayes' Rule: P (D|B) = P (B|D)P (D) P (B) (4) P (D|B) = P (B|D)P (D) P (B) (5) where here P (B|D) and P (B|D) represent averages across all fragments. We do not have a straightforward way to compute P (D), the average probability that a fragment is pulled down, using only observed ChIPseq reads since we do not actually observe fragments that are not pulled down. Thus, we do not attempt to compute these conditional probabilities directly. Instead, we take the ratio which cancels P (D): α = P (D|B) P (D|B) = P (B|D)P (B) P (B|D)P (B) (6) P (B), or the probability that a fragment is bound on average, is equal to f , the fraction of the genome bound by the factor of interest. We can approximate f as the sum of the lengths of all peaks R weighted by their binding probabilities divided by the total length of the genome G: where G is the size of the reference genome, S(r) is the probability peak region r is bound, and l r is the length of peak r, assuming no overlap between peaks.
P (B|D), or the probability that a fragment is bound given that it is pulled down, is a measure of the specificity of the antibody. Assuming the majority of reads falling in peaks are truly bound, we can roughly approximate this as the percent of fragments falling within peaks, which we denote as s.
Using these two metrics, f , and s (Supplementary Table 1), we can simplify the ratio α as: Since we only know the ratio α, for simplicity in simulations we set P (D|B) = 1 and P (D|B) = 1 α . Substituting into the equation for P (D i ) above, we compute the probability that fragment i is pulled down as: where P (B i ) is based on peak scores as defined above.
Note, in reality, P (D|B) is likely to be much less than 1, since the pulldown process is inefficient and many fragments are lost. Further, the number is likely to vary largely across different experiments. Estimating the absolute value of P (D|B) from real datasets is a topic of future work.

Inferring fragment lengths from single-end reads
To estimate the fragment length distribution from single-end reads, we assume the length distribution follows a gamma distribution with mean µ and variance v, and use reads located inside ChIP-seq peaks (provided as input) to estimate µ and v which are used to compute k and θ. This is a heuristic method that provides reasonable estimates of the fragment size distribution, which is sufficient for most applications.
For each peak peak i , we keep track of two lists, {start} peak i and {end} peak i . For each read overlapping peak i , if the read is on the forward strand we add its start coordinate to {start} peak i . If the read is on the reverse strand we add its start coordinate to {end} peak i . The center point of this peak is calculated as: For every peak i we offset the coordinates in {start} peak i and {end} peak i by center peak i , so that the coordinates of start points and end points are symmetric around zero. We then concatenate lists from each peak to form {start} and {end}: The mean fragment length µ can be estimated as: We calculate the probability density functions, cumulative density functions and expected density functions for both {start} and {end}. The expected density function EDF (x) is defined as the expected deviation of a random element in the list to x: where S is a random element in {start} and E is a random element in {end}.
After we compute µ, we can reduce the density function of the fragment length distribution from p µ,v (x) to p v (x), the probability density function of the fragment length variance. We construct a score function F (v) as shown below. Intuitively, if we have a correct guess of v, F (v) should be equal to zero.
where L represents a randomly chosen fragment length.
To find an optimal v that minimizes |F (v)|, we conduct a binary search between 1000 and 10,000.
In practice, we slightly offset the last two items in the score function in Equation 14 to compute the score below, which results in slightly more accurate estimation of v on real data. This may be due to the fact that fragment length distributions are truncated on the left end, with little or no fragments with lengths less than 100bp observed, and thus do not follow a true gamma distribution.
Examples of inferred fragment length distributions from single-end data compared to actual fragment length distributions for a variety of datasets is shown in Supplementary Figure 1.

ChIPs implemenetation details Peak scores
A peak score S(r) is defined for each input peak, where S(r) gives the probability that a fragment overlapping the region is bound. Note we cannot directly estimate this probability from bulk ChIP-seq data, but assume variability in intensity across peaks is representative of different relative binding probabilities. Based on user input, ChIPs computes these binding probabilities either based on peak intensities given in an input BED file or based on read counts from an existing BAM file. If peak intensities are provided, S(r) is computed as the score of peak r divided by the maximum peak intensity. If a BAM file is provided, S(r) is defined as the number of reads overlapping peak r divided by the maximum number of reads overlapping any peak. In both cases, resulting scores S(r) are between 0 and 1.
By specifiying the option --no-scale, users may directly input binding scores that will be treated directly as probabilities and will not be rescaled. Users may also specify --scale-outliers to remove peak intensities greater than 3 times the median score before rescaling peak intensities to reduce the effect of outlier peaks.
In this case all peaks with scores greater than 3 times the median will be set to 1.

Learn implementation
The ChIPs learn module takes a set of input peaks (BED) and aligned reads (BAM) and learns parameters for (1) fragment length distribution (k, θ), (2) pull-down efficiency (f , s), and (3) PCR efficiency (p) (Supplementary Table 1). BAM files must have PCR duplicates marked using a tool such as Picard [1] to enable accurate estimation of PCR parameters. The learn module outputs model parameters to a JSON file that can be used as input to the simreads module. We found that the key pulldown parameter, α, learned, is relatively robust to the peak caller used to generate the input peak dataset (Supplementary Figure 2).

Simulation implementation
The simreads module takes in a set of peaks, model parameters (e.g., from the learn module), experimental parameters (including read length, number of reads), and number of simulation rounds, and simulates raw sequencing reads in FASTQ format. First, each copy of the genome (equivalent to one simulation round) is randomly fragmented based on the specified fragment length gamma distribution. Next, we decide whether to pull down each fragment based on its overlap with input peaks based on P (B i ) described above. Finally, we generate reads from the resulting pool of fragments based on input parameters (using the specified mode read length, number of reads, and mode [single/paired]). We output "PCR duplicates" of each sequenced read based on the input PCR rate p.
In practice, in each round the genome is processed in bins (default size 100kb) to avoid storing large fragment pools in memory. In a preprocessing step, we determine how many reads to generate from each simulation round based on the target number of output reads. ChIPs is parallelized by performing different simulation rounds on separate threads simultaneously.

Benchmarking experiments
Evaluating ChIPs performance To evaluate ChIPs, we compared simulated reads to reads from real ChIP-seq experiments using read alignments and peaks generated by ENCODE [2]. All ENCODE accessions for reads and peaks are given in Supplementary Table 3.
We first ran the chips learn module to learn model parameters based on the ENCODE BAM and BED files for each TF or HM. We used learn parameters -c 7 -t bed --scale-outliers. The json model file output by the learn module was used as input to the simulate module. We performed various runs using an increasing number of simulation rounds (chips simulate argument --nc 1, 5, 10, 50, 100, 500, 1000, 5000, or 10000). We simulated single end reads for chr22 using the same read length as the corresponding ENCODE dataset and setting the number of reads to twice the number of aligned reads for chr22 in the real dataset since not all simulated reads will uniquely align back to the genome. Resulting reads were aligned to the hg19 reference genome using BWA-MEM [3] v0.7.12-r1039, and duplicates were flagged using Picard [1] v2.18.11.
We used the bedtools [4] (v2.27.1) makewindows command to generate a list of non-overlapping windows across chr22 and the bedtools multicov command to count the number of reads from both simulated or ENCODE BAM files falling in each 1kb window. We used the bedtools intersect command to determine the intersection of each bin with the input peak files. For each bin, we determined the Pearson correlation between log10 read counts in each simulated vs. ENCODE dataset after adding a pseudocount of 1 read to each bin.
Timing experiments were performed in a Linux environment running Centos 7.4.1708 on a server with 28 cores (Intel® Xeon® CPU E5-2660 v4 @ 2.00GHz) and 125 GB RAM using the UNIX "time" command and are based on the "sys" time reported.

Comparison to ChIPulate
We compared performance of ChIPulate [5] to ChIPs based on the datasets described in Supplementary  Table 3  We first computed binding energies for each peak required as input. Binding probabilities S(r) for each peak r were computed as described above by scaling peak scores based on ENCODE data to be between 0 and 1, similarly setting outlier peaks with scores higher than twice the median to have score 1. Then we computed binding energies for each region r as E unbound r and chemical potential are provided as command line inputs to ChIPulate. We set chemical potential to 0, which provided the best dynamic range across peak scores and highest correlation with real data. Background binding energy was set to the default (1.0).
ChIPulate parameters were set to achieve the same fragment length distributions, number of reads, and read lengths as in ENCODE data. Fragment length (--fragment-length) was set to the mean fragment size inferred by ChIPs learn. Read length (--read-length) was set to the read length of each dataset, ranging from 36bp to 101bp. Depth (-d) was set to the ratio of the number of reads in the ENCODE dataset mapped to chr22 divided by the number of peaks on chr22. We additionally set: p amp=0.50, p ext=0.54, --mu-A=0 based on examples shown in ChIPulate documentation. We used the parameter -n to vary the number of copies (simulation rounds) from 10 to 10,000. Runs with -n set to 1 or 5 failed and so were excluded from analysis. All other parameters were set to default values. Read counts in windows of 1kb were compared between simulated and ENCODE datasets using bedtools multicov as described above.

Comparison to isChIP
We similarly evaluated performance of isChIP [6] v1.0 based on the datasets described in Supplementary  Table 3 and shown in Supplementary Figures 4-5, setting parameters to mimic those of real datasets. Read length (-r) and the maximum number of reads (--rd-lim) were set based on read lengths and the number of reads aligned to chr22 in ENCODE data. The log of the mean fragment length (-L) was set based on the mean fragment length inferred by ChIPs learn. We provided ENCODE peaks as input with option --bscore 0. Using ENCODE peak scores directly (--bscore 7) resulted in very poor correlation with real data. We used the parameter --cells to vary the number of cells from 1 to 10,000.
Since we did not have a way to estimate foreground and background parameters (--ground option), we evaluated two settings. In the first, we set foreground to the spot score and background to the default value of 1. In the second, labeled "HighBG", we increased the background noise to 4.
Unless otherwise specified, we set the number of PCR cycles to 0 (--pcr 0). Attempts to run with even small numbers of PCR cycles resulted in unreliable output. Results from setting --pcr 10 are shown in Supplementary Figure 4.

Simulating benchmarking peaks
For experiments shown in Figure 2, we generated two sets of datasets meant to represent characteristics of either histone modification (HM) or transcription factor (TF) ChIP-seq datasets. We simulated a peak file for TF and HM each, with SP1 (bam=ENCFF001TYZ) and H3K27ac (bam=ENCFF411MHX) as templates. To generate peaks, we randomly placed peaks on the hg19 genome sampling peak lengths and scores from distributions observed on real data. In total, we simulated 29,579 non-overlapping peaks for the TF dataset and 77,413 for the HM dataset. Results in Figure 2 are based on 350 TF peaks and 697 HM peaks on chr21.

Evaluating peak callers
We conducted multiple sets of experiments simulating paired-end ChIP-seq datasets with different SPOT scores (s ranging 0.01-0.5 for TF data and 0.01-0.75 for HM data). For each simulated dataset we ran ChIPs simreads using the HM and TF model parameters described above and with additional options -paired --region chr21:1-48129895. We benchmarked peakcallers macs2 (v2.2.6), GEM (v3.4), MUSIC, BCP (peakranger v1.18), and HOMER. When running these peak callers, we applied their default parameter settings except flags indicating the types of input data (i.e. TF or HM).
To quantitatively evaluate the quality of the predicted peaks, we followed the metrics used in [7] and trimmed each peak to 200bp around its summit. Then, we calculated the number of real peaks that peak callers retrieved as well as the number of predicted peaks that overlap with real peaks.

Supplementary Note
Below we compare how key steps of the ChIP-seq process are modeled by ChIPs in comparison to other recently developed ChIP-seq simulators (ChIPulate [5] and isChIP [6]). These differences are also summarized in Supplementary Table 1.
• isChIP: models variable fragment lengths drawn from a log-normal distribution with an optional size-selection step.
• ChIPs: models variable fragment lengths drawn from a gamma distribution. Gamma and lognormal distributions often appear highly similar, and in practice we have not found this difference in fragment length distributions to have a significant impact on simulation results.

Pulldown and binding efficiency
• isChIP: models the probability of loss of selected foreground vs. background fragments compared to generated fragments.
• ChIPulate: models binding efficiencies of each binding site, but does not consider background regions outside of specified binding sites.
• ChIPs: models pulldown efficiency using a Bayesian model based on the SPOT score (s) and fraction of the genome found (f ), both of which can be inferred from real datasets and peaks generated by a variety of peak-calling algorithms (Supplementary Figure 2)

PCR
• ChIPulate: models the amplification efficiency of each region as well as the number of PCR cycles. Of the three tools, the ChIPulate PCR model is most advanced but also most computationally intensive.
• isChIP: copies each fragment 2 n times, where n is the number of PCR cyles. In practice, because the total number of fragments exponentially increases, this process explodes after several cycles. In our simulation experiments, we have found that using even small numbers of PCR cycles (e.g. 10) results in unreliable output.
• ChIPs: models the distribution of the number of duplicate reads. This parameter can be easily inferred from existing data and used to simulate realistic PCR duplicate patterns.

Sequencing
• ChIPulate: does not model sequencing errors • isChIP: does not model sequencing errors • ChIPs: models base substitution and indel rates based on real Illumina data.

Supplementary Figures
Supplementary Figure 1

Frequency
Inferring fragment length distributions from single-end reads. Green bars show a histogram of lengths of 10,000 randomly chosen fragments from GM12878 paired end ChIP-seq experiments. Red lines give the best fit gamma distribution learned using observed fragment lengths. Blue lines gives the fit inferred ignoring pair information using our novel method for learning fragment length distributions from single end data. ENCODE accessions are given in Supplementary Table 3.  Evaluation of pullodwn parameter estimation with input TF and HM peaks from different peak callers. Each dot in the plots indicates an individual experiment and shows the comparison between estimated and expected alpha. Alpha is the ratio of pulldown probabilities of bound regions to unbound regions as described in the Supplementary Methods. The expected alpha was used in simulating reads and the estimated alpha was computed using our ChIPs learn module. Blue=BCP; orange=GEM; green=MACS2; red=MUSIC; purple=HOMER. In a. (HM), α estimated based on MUSIC peaks becomes unreliable for simulated datasets with high s (SPOT) scores. This issue can be mitigated by applying tne "scale-outliers" option in the learn module which reduces the effect of outlier peaks. In b. (TF), the estimated alpha based on HOMER peaks is higher than expected. This is because HOMER is more stringent in deciding peak boundaries: the peak length from HOMER (mean=323.72bp) is smaller than the others (mean=746.22bp) and the ground truth (mean=448.80bp), resulting in the f value being underestimated and the α value being overestimated.   Normalized read count /1kb bins Normalized read count /1kb bins Benchmarking of ChIPs against existing simulators. Evaluation of ChIPs against existing methods was performed by comparing the simulation results of each to real ChIP-sequencing datasets covering a range of dataset types (both histone modifications and transcription factors), sequencing settings (single vs. paired end reads), and data qualities as measured by SPOT scores (s). a. The Pearson correlation between read counts of simulated vs. real (ENCODE) datasets for histone modifications and transcription factors was measured across a range of simulation rounds. isChIP was run with multiple parameter choices (isChIP: low background and no PCR, isChIP w/PCR: low background and 10 PCR cycles, isChIP HighBG: high background and no PCR). For datasets where paired-end data was available (BLCAF1 and IZKF1), we evaluated ChIPs using models learned from paired end data (dashed lines) and ignoring paired end information (solid lines). All other datasets are single-end. b. Box plots of the relative number of reads in peaks vs. background regions in simulated vs. real data per 1kb bins. Left (blue) are the peak regions, right (red) are the background regions. The median of the real (ENCODE) data is denoted as a dashed line in each plot. Read counts in each bin were normalized by the total number of reads aligned to chr22 for each dataset.