Peak Finder Metaserver - a novel application for finding peaks in ChIP-seq data
© Kruczyk et al.; licensee BioMed Central Ltd. 2013
Received: 27 March 2013
Accepted: 10 September 2013
Published: 23 September 2013
Finding peaks in ChIP-seq is an important process in biological inference. In some cases, such as positioning nucleosomes with specific histone modifications or finding transcription factor binding specificities, the precision of the detected peak plays a significant role. There are several applications for finding peaks (called peak finders) based on different algorithms (e.g. MACS, Erange and HPeak). Benchmark studies have shown that the existing peak finders identify different peaks for the same dataset and it is not known which one is the most accurate. We present the first meta-server called Peak Finder MetaServer (PFMS) that collects results from several peak finders and produces consensus peaks. Our application accepts three standard ChIP-seq data formats: BED, BAM, and SAM.
Sensitivity and specificity of seven widely used peak finders were examined. For the experiments we used three previously studied Transcription Factors (TF) ChIP-seq datasets and identified three of the selected peak finders that returned results with high specificity and very good sensitivity compared to the remaining four. We also ran PFMS using the three selected peak finders on the same TF datasets and achieved higher specificity and sensitivity than the peak finders individually.
We show that combining outputs from up to seven peak finders yields better results than individual peak finders. In addition, three of the seven peak finders outperform the remaining four, and running PFMS with these three returns even more accurate results. Another added value of PFMS is a separate report of the peaks returned by each of the included peak finders.
KeywordsTranscription factor Peak finder ChIP-seq Metaserver
The aim of peak finding in ChIP-seq analysis is identification of genomic regions with a high density of mapped sequence tags relative to a measured or estimated background. A simple approach to achieving this goal consists of two steps. Firstly, a sequence of mapped tags along the genome is extracted. Secondly, every contiguous sequence of base pairs with more than a predefined threshold number of tags is selected as an enriched region or binding site. However, the experimental noise and inherent complexities of the tags require more sophisticated algorithms. Numerous solutions have been designed following different statistical models and enrichment measures, including window-based models such as Erange , Hidden Markov Model-based methods such as HPeak  and others, for example, FindPeaks . Differences in the characteristics of the algorithms result in identification of different sets of peaks for the same ChIP-seq data set. These algorithmic diversities provide an opportunity to analyze ChIP-seq datasets under different conditions, but the problem of deciding which method is optimal for a given data set remains unsolved . Here, in the spirit of protein structure prediction meta-servers, e.g. , we present the first meta-solution as a method that combines results from peak-finders chosen by the user.
The implemented meta-server collects results from several peak finders and from these extracts the final result. The peak finders currently included in PFMS are: MACS v1.3.7 , CisGenome v2.0 , SISSRs v1.4 , Erange v2.1 , SeqSite v1.0 , FindPeaks v18.104.22.168 , and HPeak v1.1 . The system can be configured to include any combination of these peak finders. PFMS is implemented in a multi-threading manner. The peak finders can be run in parallel, or sequentially depending on user’s hardware.
PFMS supports three input data formats: BED, BAM, and SAM. Internally, BAM and SAM formats are converted to the BED format in the initial step of the analysis in PFMS. The conversion is performed employing BEDTools  and samtools . The application analyzes one chromosome at a time which implies extraction of the tags of the specified chromosome in a given data set. It is achieved by employment of FindPeaks split tool . PFMS repeats the analysis process for all the chromosomes included in the dataset unless specified otherwise. Although the BED format is widely used, some of the included peak finders require the data set to be given in custom formats (e.g. Erange ). Therefore, PFMS internally converts the input data according to the required format of the peak finders.
Seven peak identification methods are currently integrated in PFMS. They are run separately to identify lists of putative peaks. The peaks are collected, normalized and converted to either BED or WIG, as selected by the user. The BED option is more convenient for additional downstream analysis, while WIG is better for visualization purposes since it retains information about the peak shapes.
PFMS selects an integrated list of significant peaks from the combined results by any of the following methods: voting mechanism, minFP or minFN. The first method does not use any normalization, while the other two use normalized scores of peaks obtained from the individual peak finders. The normalization is carried out using one of five different normalization methods described below.
The peaks detected by the individual peak finders are usually scored by various enrichment measures. Due to different ranges of the scores, weighting of the selected peaks may be biased to the peaks ranked with higher scaled scores. To overcome this drawback, the output from each peak finder is normalized; five normalization methods are implemented.
where: normalized_peak_score - score of the peak after Normal Normalization peak_score - original score of the peak - the mean score in the results of the currently normalized peak finder sd - standard deviation of scores in the results of the currently normalized peak finder normal_shift - the mean value of the scores after normalization max_no_of_peaks - maximal number of peaks returned by the peak finders
For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value was 6 and the standard deviation was 4. After applying Normal Normalization, the SISSR scores were normalized to (20, 25, 25, 35, 45). This normalization type may be used with BED option only.
Naive quantile normalization
where: normalized_peak_score - score of the peak after Naive Quantile Normalization peak_score - original score of the peak quantile n - the n-th quantile of the peak scores from the currently normalized peak finder max_no_of_peaks - maximal number of peaks returned by the peak finders
As an example, SISSR returned five peaks with scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. By using quantile 0.8, the peak scores of SISSR were normalized to (2.5, 5, 5, 10, 15), and rounded up to (3, 5, 5, 10, 15). Naive Quantile Normalization type may be used with the BED and WIG options.
where: normalized_peak_score - score of the peak after Average Normalization peak_score - original score of the peak - mean peak score in the currently normalized peak finder results max_no_of_peaks - maximal number of peaks returned by the peak finders
As an example, SISSR returned five scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value of SISSR peak scores, which was 6, was used to normalize the peak scores of SISSR to (3.33, 6.66, 6.66, 13.33, 20), after rounding (3, 7, 7, 13, 20). Average Normalization type may be used with the BED and WIG options.
where: Cluster_score i - rank (score) for all peaks in the cluster i no_of_peaks i - the number of peaks in cluster i max_no_of_peaks - maximal number of peaks returned by the peak finders no_of_peaks_from_PF - number of peaks returned by the currently normalizing peak finder
For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying the rank normalization method the SISSR peak scores were normalized to 1, 4, 4, 7, 9. Rank Normalization type may only be used with BED option.
Top rank normalization
where: Cluster_score i - rank (score) for all peaks in the cluster i no_of_peaks i - the number of peaks in cluster i max_no_of_peaks - maximal number of peaks returned by the peak finders
For example, five peaks with scores (12, 8, 4, 4, 2) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying Top Rank Normalization method the SISSR peak-scores were normalized to 10, 9, 7.5, 7.5, 6, and rounded up to 10, 9, 8, 8, 6. Top Rank Normalization may be used with BED option only.
Prior to peak-selection, overlapping regions of the peaks obtained from each peak finder are aggregated and weighted by summing up the normalized scores of the regions. Apart from the aggregated score, the number of votes for each region is kept. The number of votes is defined as the number of peak finders that called the region. Finally, the consensus peaks are selected based on one of the following methods.
In-degree Centrality Voting Mechanism
minFP peak selection method
Attempts to minimize the number of false positive peaks, at the cost of losing some true positive peaks. In this method the user defines min_rank, which is the minimum number of votes that a region has to receive in order to be identified as a putative peak. We calculate MaxP, the highest score of a peak having fewer than min_rank votes. Each peak that has a score higher than MaxP is reported as a putative binding site.
minFN peak selection method
Attempts to minimize the number of false negative peaks, at the cost of including some false positive peaks. min_rank is defined as in minFP. MinP is the lowest score of the peak having min_rank or more votes. Each peak that has a score higher than MinP is reported as a putative binding site.
With the WIG option, peaks are divided into steps, each characterized by a score. The number of scores and the step size determine the width of the peak. All step sizes are unified to the smallest value among all WIG files returned by individual peak finders. All identified peaks are collected and the overlapping regions integrated. The integrated regions are weighted with the normalized score of the overlapping peaks. Optionally, the highest weighted peaks may be selected from the list of the integrated regions by setting a cut-off value. The goal is to select the regions that are called by the majority of the peak finders and that have high scores. This leads to a set of enriched regions that are easy to visualize.
Results and discussion
In order to clearly compare the performance of single peak finders and various configurations of PFMS three different measures were applied. Firstly, we calculated Euclidian distance of the points on the Figure 3 to point (0,1) for each of the investigated TFs. Secondly, for each configuration, we calculated Average Euclidean Distance for the three TFs results (see Additional file 1: Table S1). However, in the first measure the largest impact on the overall output have the results for the most 'difficult’ TF. A ranked-based approach was applied to compensate for this factor. We ordered increasingly the Euclidean distances separately for each TF. For each configuration we summed up the ranking position from all three TFs obtaining Combined Ranking Score. The third measure implemented in order to compare the performance of the peak finders and various configurations of PFMS was the Average Normalized Distance. To calculate this measure we applied studentization on the distances for each of the TF datasets. Then, for each peak finder and PFMS configuration we calculated mean value of the measure for the three datasets. For all three measures, the lower the score, the better configuration is.
In our validation analysis, the peak finders were ran with the default settings. However, tuning settings of the peak finders for the given datasets may improve the results. Configuration of the individual peak finders may be easily modified within PFMS (for detailed information see http://bioinf.icm.uu.se/~pfms/). CisGenome, MACS and SISSR were found to perform much better than the remaining four peak finders. Not surprisingly, PFMS performed best with only the three peak finders included. The top three choices for transcription factor dataset analysis utilize only these three peak finders with the min_rank parameter set to 2. We recommend either voting or minFP with Rank Normalization or Top Rank Normalization as these three configurations outperform each of the single peak finders as well as any other configuration of PFMS. Depending on the quality measure, one of the three configurations was the optimal one. However using any of the three quality measures the above mentioned configurations make the top three choices. As predicted, the minFP option returns results with only few False Positives but quite often misses True Positives. Apparently, this leads to a very high specificity and can be useful for applications such as selecting the strongest putative TF target genes for biological validation . We recommend using the minFP option with CisGenome, MACS, SISSR and optionally with SeqSite.
In contrast, minFN has very high sensitivity, i.e. it returns most of the True Positives at the cost of including a considerable number of False Positives. This option also performs best with the four peak finders. The choice of normalization type did not seem to be crucial for the quality of the results when using minFN or minFP. Nevertheless, we do not recommend using the Normal Normalization unless one is certain that the peak scores obtained from all peak finders have a bell-shaped distribution. Otherwise, it is much safer to use Naive Quantile Normalization (e.g. quantile 0.75), Average or Rank Normalization. These methods do not require any assumptions about the distribution of the scores.
The configurations described in this section proved to be the best for transcription factor ChIP-seq datasets. Other types of ChIP-seq data such as histone modifications were not tested. Further investigation needs to be carried out and more validated datasets have to be provided to reveal the optimal settings for both PFMS itself as well as the individual peak finders. As an example, histone modifications ChIP-seq datasets are likely to have varying peak widths and shapes depending on the pattern of the modification (e.g. single or consecutive nucleosomes) and the density of the chromatin. Therefore, different approaches and options might be better for different cases.
PFMS can be used as a single interface for analyzing ChIP-seq datasets employing several peak finders simultaneously since users may choose a set of peak finders amongst the ones currently integrated in this application. In addition to the list of putative peaks identified by PFMS, the results of each peak finder may be stored in the output directory of PFMS.
We present Peak Finder Metaserver - a novel tool for finding peaks in ChIP-seq data. The tool combines the results from various widely used methods and generates consensus results. We investigated seven peak finders and identified three that perform best for transcription factor ChIP-seq datasets, i.e. CisGenome, MACS and SISSR. Applying only these three peak finders and setting the voting peak selection method and the minrank parameter to 2. To the best of our knowledge this is the best method of finding peaks in transcription factor ChIP-seq datasets. Meta-Server approach proved to be successful and PFMS with the above mentioned settings generates results of a better quality than any of the individual peak finders. Different configurations of our tool can be optimal for different types of analyses, but identification of optimal settings requires other validated datasets.
Availability and requirements
Project name: Peak Finder MetaServer
Project homepage: http://bioinf.icm.uu.se/~pfms/
Operating system(s): Linux, MacOS
Programming language: python
Other requirements: Python 2.6 or higher, GCC compiler, Perl, JRE 1.6
Any restrictions to use by non-academics: none
Chromatin immunoprecipitation sequencing
Peak Finder Metaserver
We want to thank an anonymous referee for a critical and insightful review and inspiration to create measures to compare our tool with other peak finders.
Studies supported by Uppsala University, and Foundation for Polish Science, International PhD Projects (MPD) program (MK), and a grant N301 239536 from the Polish Ministry of Science and Higher Education (JK).
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
- Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics. 2010, 11: 369-10.1186/1471-2105-11-369.PubMed CentralView ArticlePubMedGoogle Scholar
- Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJ: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24 (15): 1729-1730. 10.1093/bioinformatics/btn305.PubMed CentralView ArticlePubMedGoogle Scholar
- Pepke S, Wold B, Mortazavi A: Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009, 6: S22-S32. 10.1038/nmeth.1371.PubMed CentralView ArticlePubMedGoogle Scholar
- Bujnicki JM: Protein-structure prediction by recombination of fragments. Chembiochem. 2005, 7: 19-27.View ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, et al: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36 (16): 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang X, Zhang X: Pinpointing transcription factor binding sites from ChIP-seq data with SeqSite. BMC Syst Biol. 2011, 5 (Suppl 2): S3-10.1186/1752-0509-5-S2-S3.View ArticleGoogle Scholar
- Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010, 26 (6): 841-842. 10.1093/bioinformatics/btq033.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, et al: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Rye MB, Sætrom P, Drabløs F: A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res. 2011, 39 (4): e25-e25. 10.1093/nar/gkq1187.PubMed CentralView ArticlePubMedGoogle Scholar
- Polman JAE, Welten JE, Bosch DS, de Jonge RT, Balog J, van der Maarel SM, de Kloet ER, Datson NA: A genome-wide signature of glucocorticoid receptor binding in neuronal PC12 cells. BMC Neurosci. 2012, 13: 118-10.1186/1471-2202-13-118.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.