The implemented meta-server collects results from several peak finders and from these extracts the final result. The peak finders currently included in PFMS are: MACS v1.3.7 [6], CisGenome v2.0 [7], SISSRs v1.4 [8], Erange v2.1 [1], SeqSite v1.0 [9], FindPeaks v3.1.9.2 [3], and HPeak v1.1 [2]. The system can be configured to include any combination of these peak finders. PFMS is implemented in a multi-threading manner. The peak finders can be run in parallel, or sequentially depending on user’s hardware.
Input
PFMS supports three input data formats: BED, BAM, and SAM. Internally, BAM and SAM formats are converted to the BED format in the initial step of the analysis in PFMS. The conversion is performed employing BEDTools [10] and samtools [11]. The application analyzes one chromosome at a time which implies extraction of the tags of the specified chromosome in a given data set. It is achieved by employment of FindPeaks split tool [3]. PFMS repeats the analysis process for all the chromosomes included in the dataset unless specified otherwise. Although the BED format is widely used, some of the included peak finders require the data set to be given in custom formats (e.g. Erange [1]). Therefore, PFMS internally converts the input data according to the required format of the peak finders.
Data processing
Seven peak identification methods are currently integrated in PFMS. They are run separately to identify lists of putative peaks. The peaks are collected, normalized and converted to either BED or WIG, as selected by the user. The BED option is more convenient for additional downstream analysis, while WIG is better for visualization purposes since it retains information about the peak shapes.
BED option
PFMS selects an integrated list of significant peaks from the combined results by any of the following methods: voting mechanism, minFP or minFN. The first method does not use any normalization, while the other two use normalized scores of peaks obtained from the individual peak finders. The normalization is carried out using one of five different normalization methods described below.
WIG option
WIG files store information about the shape of the peaks. Firstly, peaks are divided into steps and any operation on the file is in fact done at the level of steps, not the whole peaks. Therefore PFMS normalizes the peak scores using the Average or Naive Quantile method as described below. Ranking the steps or normalizing them using Normal method does not make sense from the statistical point of view, thus these two methods have not been implemented for WIG option. Finally, PFMS integrates the overlapping peaks and sums up their scores. The results are easy to visualize using genome browsers, such as UCSC Genome Browser either in WIG or BED format. An overview of the implemented procedure is shown in Figure 1.
Peak-score normalization
The peaks detected by the individual peak finders are usually scored by various enrichment measures. Due to different ranges of the scores, weighting of the selected peaks may be biased to the peaks ranked with higher scaled scores. To overcome this drawback, the output from each peak finder is normalized; five normalization methods are implemented.
Normal normalization
The program calculates the mean value and the standard deviation of the peaks scores identified by the selected peak finders. The peak scores are then transformed to have mean value of the normal_shift parameter and unit standard deviation. The remaining negative peak scores are set to 0, while they are kept in the down-stream analysis for voting. Any default value for the normal_shift parameter is set to 3 (1). The negative peak scores are considered as outliers since they are more than 3 (or the given normal_shift value) standard deviations from the mean value and can be set to 0 without any significant impact on the analysis. The fraction of such peaks for normal_shift of 3 is around 0.13%, provided the scores in the file have normal distribution. However, in real datasets, peak scores rarely exhibit normal distribution, and so few if any peak scores are set to 0. The scores are then multiplied by the maximum number of peaks taken from all peak finders and rounded up.
(1)
where: normalized_peak_score - score of the peak after Normal Normalization peak_score - original score of the peak - the mean score in the results of the currently normalized peak finder sd - standard deviation of scores in the results of the currently normalized peak finder normal_shift - the mean value of the scores after normalization max_no_of_peaks - maximal number of peaks returned by the peak finders
For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value was 6 and the standard deviation was 4. After applying Normal Normalization, the SISSR scores were normalized to (20, 25, 25, 35, 45). This normalization type may be used with BED option only.
Naive quantile normalization
The user defines which quantile is to be used for normalization (e.g. quantile 0.75.) The peak scores are sorted within the files with original scores from selected peak finders and the specified quantile is selected. Then, each peak score from the currently normalized peak finder is divided by the value of the selected quantile and, as in the previous method, multiplied by the maximum number of peaks in the results of the peak finders and rounded up (2).
(2)
where: normalized_peak_score - score of the peak after Naive Quantile Normalization peak_score - original score of the peak quantile
n
- the n-th quantile of the peak scores from the currently normalized peak finder max_no_of_peaks - maximal number of peaks returned by the peak finders
As an example, SISSR returned five peaks with scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. By using quantile 0.8, the peak scores of SISSR were normalized to (2.5, 5, 5, 10, 15), and rounded up to (3, 5, 5, 10, 15). Naive Quantile Normalization type may be used with the BED and WIG options.
Average normalization
This normalization is very similar to the Naive Quantile Normalization. The only difference is that instead of selecting a certain quantile, the mean value of all the scores is used. Average Normalization is performed according to 3.
(3)
where: normalized_peak_score - score of the peak after Average Normalization peak_score - original score of the peak - mean peak score in the currently normalized peak finder results max_no_of_peaks - maximal number of peaks returned by the peak finders
As an example, SISSR returned five scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value of SISSR peak scores, which was 6, was used to normalize the peak scores of SISSR to (3.33, 6.66, 6.66, 13.33, 20), after rounding (3, 7, 7, 13, 20). Average Normalization type may be used with the BED and WIG options.
Rank normalization
After sorting the lists of the peaks detected by the peak finders, the peaks are clustered using their scores, so that all peaks with the same score are in the same cluster. Next, each cluster gets a rank computed according to the following equation (4).
(4)
where: Cluster_score
i
- rank (score) for all peaks in the cluster i no_of_peaks
i
- the number of peaks in cluster i max_no_of_peaks - maximal number of peaks returned by the peak finders no_of_peaks_from_PF - number of peaks returned by the currently normalizing peak finder
For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying the rank normalization method the SISSR peak scores were normalized to 1, 4, 4, 7, 9. Rank Normalization type may only be used with BED option.
Top rank normalization
In contrast to the previous normalization methods, Top Rank Normalization assumes that the methods that return fewer peaks also return a smaller fraction of False Positives. In this method, similarly to Rank Normalization, after sorting the lists of the peaks detected by the peak finders, the peaks are clustered using their scores, so that all peaks with the same score are in the same cluster. The clusters are sorted starting with the highest-scoring cluster and ending with the lowest-scoring one. Next, each cluster gets a rank computed according to the following equation (5).
(5)
where: Cluster_score
i
- rank (score) for all peaks in the cluster i no_of_peaks
i
- the number of peaks in cluster i max_no_of_peaks - maximal number of peaks returned by the peak finders
For example, five peaks with scores (12, 8, 4, 4, 2) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying Top Rank Normalization method the SISSR peak-scores were normalized to 10, 9, 7.5, 7.5, 6, and rounded up to 10, 9, 8, 8, 6. Top Rank Normalization may be used with BED option only.
Result generation
BED option
Prior to peak-selection, overlapping regions of the peaks obtained from each peak finder are aggregated and weighted by summing up the normalized scores of the regions. Apart from the aggregated score, the number of votes for each region is kept. The number of votes is defined as the number of peak finders that called the region. Finally, the consensus peaks are selected based on one of the following methods.
In-degree Centrality Voting Mechanism
Is our simplest mechanism of peak selection. All the aggregated regions are considered as candidate sites. A putative binding site is reported if the number of votes for a region is higher than or equal to a predefined threshold (min_rank). Note, that in this method the scores generated by the peak finders are not utilized. For example, given a set of regions 1A: [0 - 7], 1B: [9 - 15], 2: [5 - 25], and 3: [10 - 20] generated by PF 1, PF1, PF2 and PF3, respectively, with the threshold value 2, the regions [5 - 7] and [9 - 20] are reported as binding sites. If the threshold was set to 3, the output region would be [10 - 15] (Figure 2).
minFP peak selection method
Attempts to minimize the number of false positive peaks, at the cost of losing some true positive peaks. In this method the user defines min_rank, which is the minimum number of votes that a region has to receive in order to be identified as a putative peak. We calculate MaxP, the highest score of a peak having fewer than min_rank votes. Each peak that has a score higher than MaxP is reported as a putative binding site.
minFN peak selection method
Attempts to minimize the number of false negative peaks, at the cost of including some false positive peaks. min_rank is defined as in minFP. MinP is the lowest score of the peak having min_rank or more votes. Each peak that has a score higher than MinP is reported as a putative binding site.
WIG option
With the WIG option, peaks are divided into steps, each characterized by a score. The number of scores and the step size determine the width of the peak. All step sizes are unified to the smallest value among all WIG files returned by individual peak finders. All identified peaks are collected and the overlapping regions integrated. The integrated regions are weighted with the normalized score of the overlapping peaks. Optionally, the highest weighted peaks may be selected from the list of the integrated regions by setting a cut-off value. The goal is to select the regions that are called by the majority of the peak finders and that have high scores. This leads to a set of enriched regions that are easy to visualize.