The implemented meta-server collects results from several peak finders and from these extracts the final result. The peak finders currently included in PFMS are: MACS v1.3.7 [6], CisGenome v2.0 [7], SISSRs v1.4 [8], Erange v2.1 [1], SeqSite v1.0 [9], FindPeaks v3.1.9.2 [3], and HPeak v1.1 [2]. The system can be configured to include any combination of these peak finders. PFMS is implemented in a multi-threading manner. The peak finders can be run in parallel, or sequentially depending on user’s hardware.

### Input

PFMS supports three input data formats: BED, BAM, and SAM. Internally, BAM and SAM formats are converted to the BED format in the initial step of the analysis in PFMS. The conversion is performed employing BEDTools [10] and samtools [11]. The application analyzes one chromosome at a time which implies extraction of the tags of the specified chromosome in a given data set. It is achieved by employment of FindPeaks split tool [3]. PFMS repeats the analysis process for all the chromosomes included in the dataset unless specified otherwise. Although the BED format is widely used, some of the included peak finders require the data set to be given in custom formats (e.g. Erange [1]). Therefore, PFMS internally converts the input data according to the required format of the peak finders.

### Data processing

Seven peak identification methods are currently integrated in PFMS. They are run separately to identify lists of putative peaks. The peaks are collected, normalized and converted to either BED or WIG, as selected by the user. The BED option is more convenient for additional downstream analysis, while WIG is better for visualization purposes since it retains information about the peak shapes.

#### BED option

PFMS selects an integrated list of significant peaks from the combined results by any of the following methods: voting mechanism, minFP or minFN. The first method does not use any normalization, while the other two use normalized scores of peaks obtained from the individual peak finders. The normalization is carried out using one of five different normalization methods described below.

#### WIG option

WIG files store information about the shape of the peaks. Firstly, peaks are divided into steps and any operation on the file is in fact done at the level of steps, not the whole peaks. Therefore PFMS normalizes the peak scores using the Average or Naive Quantile method as described below. Ranking the steps or normalizing them using Normal method does not make sense from the statistical point of view, thus these two methods have not been implemented for WIG option. Finally, PFMS integrates the overlapping peaks and sums up their scores. The results are easy to visualize using genome browsers, such as UCSC Genome Browser either in WIG or BED format. An overview of the implemented procedure is shown in Figure 1.

### Peak-score normalization

The peaks detected by the individual peak finders are usually scored by various enrichment measures. Due to different ranges of the scores, weighting of the selected peaks may be biased to the peaks ranked with higher scaled scores. To overcome this drawback, the output from each peak finder is normalized; five normalization methods are implemented.

#### Normal normalization

The program calculates the mean value and the standard deviation of the peaks scores identified by the selected peak finders. The peak scores are then transformed to have mean value of the *normal_shift* parameter and unit standard deviation. The remaining negative peak scores are set to 0, while they are kept in the down-stream analysis for voting. Any default value for the *normal_shift* parameter is set to 3 (1). The negative peak scores are considered as outliers since they are more than 3 (or the given *normal_shift* value) standard deviations from the mean value and can be set to 0 without any significant impact on the analysis. The fraction of such peaks for *normal_shift* of 3 is around 0.13%, provided the scores in the file have normal distribution. However, in real datasets, peak scores rarely exhibit normal distribution, and so few if any peak scores are set to 0. The scores are then multiplied by the maximum number of peaks taken from all peak finders and rounded up.

\begin{array}{ll}\hfill \mathit{\text{normalized\_peak\_score}}=& \left[\frac{\mathit{\text{peak\_score}}-\overline{\mathit{\text{peak\_score}}}}{\mathit{\text{sd}}}\right.\\ \left(\right)close="]">+\mathit{\text{normal\_shift}}\phantom{\rule{0.3em}{0ex}}& \ast \mathit{\text{max\_no\_of\_peaks}}\end{array}\n

(1)

where: **normalized_peak_score** **-** score of the peak after Normal Normalization **peak_score** **-** original score of the peak \overline{\mathit{\text{peak\_score}}}**-** the mean score in the results of the currently normalized peak finder **sd** **-** standard deviation of scores in the results of the currently normalized peak finder **normal_shift** **-** the mean value of the scores after normalization **max_no_of_peaks** **-** maximal number of peaks returned by the peak finders

For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value was 6 and the standard deviation was 4. After applying Normal Normalization, the SISSR scores were normalized to (20, 25, 25, 35, 45). This normalization type may be used with BED option only.

#### Naive quantile normalization

The user defines which quantile is to be used for normalization (e.g. quantile 0.75.) The peak scores are sorted within the files with original scores from selected peak finders and the specified quantile is selected. Then, each peak score from the currently normalized peak finder is divided by the value of the selected quantile and, as in the previous method, multiplied by the maximum number of peaks in the results of the peak finders and rounded up (2).

\begin{array}{ll}\hfill \mathit{\text{normalized\_peak\_score}}=& \frac{\mathit{\text{peak\_score}}}{\mathit{\text{quantil}}{e}_{n}}\\ \ast \mathit{\text{max\_number\_of\_peaks}}\end{array}

(2)

where: **normalized_peak_score -** score of the peak after Naive Quantile Normalization **peak_score** **-** original score of the peak **quantile**_{
n
}**-** the n-th quantile of the peak scores from the currently normalized peak finder **max_no_of_peaks** **-** maximal number of peaks returned by the peak finders

As an example, SISSR returned five peaks with scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. By using quantile 0.8, the peak scores of SISSR were normalized to (2.5, 5, 5, 10, 15), and rounded up to (3, 5, 5, 10, 15). Naive Quantile Normalization type may be used with the BED and WIG options.

#### Average normalization

This normalization is very similar to the Naive Quantile Normalization. The only difference is that instead of selecting a certain quantile, the mean value of all the scores is used. Average Normalization is performed according to 3.

\begin{array}{ll}\hfill \mathit{\text{normalized\_peak\_score}}=& \frac{\mathit{\text{peak\_score}}}{\overline{\mathit{\text{peak\_score}}}}\\ \ast \mathit{\text{max\_number\_of\_peaks}}\end{array}

(3)

where: **normalized_peak_score -** score of the peak after Average Normalization **peak_score -** original score of the peak \overline{\mathit{\text{peak\_score}}}**-** mean peak score in the currently normalized peak finder results **max_no_of_peaks** **-** maximal number of peaks returned by the peak finders

As an example, SISSR returned five scores (2, 4, 4, 8, 12), while 10 peaks were obtained from MACS, which was the highest number of peaks from a single peak finder. The mean value of SISSR peak scores, which was 6, was used to normalize the peak scores of SISSR to (3.33, 6.66, 6.66, 13.33, 20), after rounding (3, 7, 7, 13, 20). Average Normalization type may be used with the BED and WIG options.

#### Rank normalization

After sorting the lists of the peaks detected by the peak finders, the peaks are clustered using their scores, so that all peaks with the same score are in the same cluster. Next, each cluster gets a rank computed according to the following equation (4).

\begin{array}{ll}\hfill \mathit{\text{Cluster\_scor}}{e}_{i}=& \left[\frac{\mathit{\text{no\_of\_peak}}{s}_{i}}{2}+\sum _{k=0}^{i-1}\mathit{\text{no\_of\_peak}}{s}_{k}\right]\\ \ast \frac{\mathit{\text{max\_no\_of\_peaks}}}{\mathit{\text{no\_of\_peaks\_from\_PF}}}\end{array}

(4)

where: **Cluster_score**_{
i
} **-** rank (score) for all peaks in the cluster *i* **no_of_peaks**_{
i
} **-** the number of peaks in cluster *i* **max_no_of_peaks -** maximal number of peaks returned by the peak finders **no_of_peaks_from_PF -** number of peaks returned by the currently normalizing peak finder

For example, five peaks with scores (2, 4, 4, 8, 12) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying the rank normalization method the SISSR peak scores were normalized to 1, 4, 4, 7, 9. Rank Normalization type may only be used with BED option.

#### Top rank normalization

In contrast to the previous normalization methods, Top Rank Normalization assumes that the methods that return fewer peaks also return a smaller fraction of False Positives. In this method, similarly to Rank Normalization, after sorting the lists of the peaks detected by the peak finders, the peaks are clustered using their scores, so that all peaks with the same score are in the same cluster. The clusters are sorted starting with the highest-scoring cluster and ending with the lowest-scoring one. Next, each cluster gets a rank computed according to the following equation (5).

\begin{array}{ll}\hfill \mathit{\text{Cluster\_scor}}{e}_{i}=& \mathit{\text{max\_no\_of\_peaks}}\\ -\left[\frac{\mathit{\text{no\_of\_peak}}{s}_{i}-1}{2}+\sum _{k=0}^{i-1}\mathit{\text{no\_of\_peak}}{s}_{k}\right]\end{array}

(5)

where: **Cluster_score**_{
i
}**-** rank (score) for all peaks in the cluster *i* **no_of_peaks**_{
i
}**-** the number of peaks in cluster *i* **max_no_of_peaks** **-** maximal number of peaks returned by the peak finders

For example, five peaks with scores (12, 8, 4, 4, 2) were obtained from SISSR, while MACS produced 10 peaks, which was the highest number of peaks obtained from a single peak finder. After applying Top Rank Normalization method the SISSR peak-scores were normalized to 10, 9, 7.5, 7.5, 6, and rounded up to 10, 9, 8, 8, 6. Top Rank Normalization may be used with BED option only.

### Result generation

#### BED option

Prior to peak-selection, overlapping regions of the peaks obtained from each peak finder are aggregated and weighted by summing up the normalized scores of the regions. Apart from the aggregated score, the number of votes for each region is kept. The number of votes is defined as the number of peak finders that called the region. Finally, the consensus peaks are selected based on one of the following methods.

##### In-degree Centrality Voting Mechanism

Is our simplest mechanism of peak selection. All the aggregated regions are considered as candidate sites. A putative binding site is reported if the number of votes for a region is higher than or equal to a predefined threshold (*min_rank*). Note, that in this method the scores generated by the peak finders are not utilized. For example, given a set of regions 1A: [0 - 7], 1B: [9 - 15], 2: [5 - 25], and 3: [10 - 20] generated by PF 1, PF1, PF2 and PF3, respectively, with the threshold value 2, the regions [5 - 7] and [9 - 20] are reported as binding sites. If the threshold was set to 3, the output region would be [10 - 15] (Figure 2).

##### minFP peak selection method

Attempts to minimize the number of false positive peaks, at the cost of losing some true positive peaks. In this method the user defines *min_rank*, which is the minimum number of votes that a region has to receive in order to be identified as a putative peak. We calculate *MaxP*, the highest score of a peak having fewer than *min_rank* votes. Each peak that has a score higher than *MaxP* is reported as a putative binding site.

##### minFN peak selection method

Attempts to minimize the number of false negative peaks, at the cost of including some false positive peaks. *min_rank* is defined as in *minFP*. *MinP* is the lowest score of the peak having *min_rank* or more votes. Each peak that has a score higher than *MinP* is reported as a putative binding site.

#### WIG option

With the WIG option, peaks are divided into steps, each characterized by a score. The number of scores and the step size determine the width of the peak. All step sizes are unified to the smallest value among all WIG files returned by individual peak finders. All identified peaks are collected and the overlapping regions integrated. The integrated regions are weighted with the normalized score of the overlapping peaks. Optionally, the highest weighted peaks may be selected from the list of the integrated regions by setting a cut-off value. The goal is to select the regions that are called by the majority of the peak finders and that have high scores. This leads to a set of enriched regions that are easy to visualize.