Identifying peaks in *-seq data using shape information
© Strino and Lappe. 2016
Published: 6 June 2016
Peak calling is a fundamental step in the analysis of data generated by ChIP-seq or similar techniques to acquire epigenetics information. Current peak callers are often hard to parameterise and may therefore be difficult to use for non-bioinformaticians. In this paper, we present the ChIP-seq analysis tool available in CLC Genomics Workbench and CLC Genomics Server (version 7.5 and up), a user-friendly peak-caller designed to be not specific to a particular *-seq protocol.
We illustrate the advantages of a shape-based approach and describe the algorithmic principles underlying the implementation. Thanks to the generality of the idea and the fact the algorithm is able to learn the peak shape from the data, the implementation requires only minimal user input, while still being applicable to a range of *-seq protocols. Using independently validated benchmark datasets, we compare our implementation to other state-of-the-art algorithms explicitly designed to analyse ChIP-seq data and provide an evaluation in terms of receiver-operator characteristic (ROC) plots. In order to show the applicability of the method to similar *-seq protocols, we also investigate algorithmic performances on DNase-seq data.
The results show that CLC shape-based peak caller ranks well among popular state-of-the-art peak callers while providing flexibility and ease-of-use.
KeywordsPeak calling ChIP-seq DNase-seq Hotelling observer CLC shape-based peak caller
In order to identify functional elements in a genome, a number of experimental high-throughput techniques have been developed for investigating specific interactions between proteins and DNA. These protocols provide us with a deeper understanding of gene-regulatory and epigenetic mechanisms by identifying, for example, Transcription-Factor Binding Sites (TFBS), open chromatin regions or the location of epigenetic marks.
In broad terms, these techniques chemically cross-link proteins to those stretches of DNA they are bound to in vivo. After shearing the DNA, a protein of interest is extracted along with the cross-linked DNA fragments from the cell-lysate using specific antibodies. Following this Chromatin Immuno-Precipitation (ChIP) step, the short stretches of DNA attached to the protein of interest are identified by high-throughput sequencing.
For any targeted protein and a given cell-line or condition, this results in several million reads of raw sequencing data. Usually a control experiment is performed where the immuno-precipitation step is left out or an antibody that is not specifically binding to the target genome is used. For example, the ENCODE Project (ENCyclopedia Of DNA Elements) has produced data on hundreds of regulatory factors (see http://encodeproject.org/) in mouse and human. For more in-depth information we recommend the “ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia” , .
Furthermore, there are already numerous experimental protocols related to ChIP-seq available and new protocols are published all the time. To name a few prominent examples, ChIP-exo  is a derivative of ChIP-seq where exonucleases are used to identify the genomic location of DNA-protein binding-sites with higher resolution. DNase-seq (DNase I hypersensitive site sequencing , ), ATAC-seq Assay for Transposase-Accessible Chromatin with highthroughput sequencing ) and FAIRE-seq (Formaldehyde-Assisted Isolation of Regulatory Elements sequencing ) are used to identify accessible regions in the genome, and MNase-seq (Micrococcal Nuclease sequencing ) is used to identify nucleosome positioning. Although each experimental technique uses different procedures for fragmentation and enrichment , the computational processing in terms of mapping the sequencing data and analysing the resulting signal in genomic context is similar to processing ChIP-seq data. Hence it is unsurprsing to see pipelines developed for ChIP-seq analysis routinely being applied to data produced with other protocols. Therefore, we initially discuss the analysis of ChIP-seq data and later investigate how algorithms developed for ChIP-seq perform on DNase-seq data.
ChIP-seq experiments are also increasingly used to investigate histone modifications. In contrast to transcription factors, most histone marks are of variable length and can span across entire gene bodies. Although the experimental procedures are similar, the resulting data needs to be treated accordingly. Some existing approaches such as HOMER stitch together narrow peaks to avoid the computational cost of finding regions of variable length. Since histone marks tend to be associated with genes, we opted for the use of existing annotations to classify genes according to peak shape as a practical trade-off between computational complexity and biological sensitivity.
State of the art
The initial step for all downstream computational analyses of ChIP-seq data starts by mapping the reads to a reference genome. Obviously, the quality of the read mapping has an impact on the downstream analysis results. However, details of the mapping process are beyond the scope of this paper and we will assume that an accurate read mapping is provided. For the performance comparison, all peak callers used the same read mappings as input.
Many different approaches to peak calling have been developed. The density of the ChIP signal can be analysed directly (Findpeaks ) or compared to a control signal (CCAT , CisGenome , Erange , PeakSeq ). Signal processing approaches including Gaussian Kernel Density Estimation (FSeq , QuEST ), Hotelling filters (DFilter ) and wavelets  have also been applied to this kind of signal. Many statistical approaches such as Poisson distributions (SISSRS , HOMER , MACS ), negative binomial distributions (ZINBA ), rank-statistics (W-ChIPeaks ), and Cramér-von Mises test (Qeseq ) have been used. Other approaches include clustering approach (SICER ), Hidden Markov Models (ChIPDiff , RSEG ), tree shape statistics (TPIC ), shape recognition (Triform ) and probabilistic models (SignalSpider ).
From a computational viewpoint, the main lessons learned from the current generation of peak calling algorithms can be summarised as follows: The success of peak calling depends on how well the statistical model of the input signal can be fitted to the data under consideration. In this context, parameterising a peak caller can be seen as tweaking its intrinsic model to improve the fit to the data. However, this requires in-depth knowledge of the underlying algorithm and statistical model and a good grasp of how the behaviour is affected by the parameters. Therefore the fine-tuning of parameters remains a “black art” to most biologists who want to analyse the results of their ChIP-seq experiments. Since parameter optimisation is a hard and time-consuming task, it is recommended  to focus on improvements in experimental design in order to obtain better input data rather than attempting to optimise the downstream bioinformatics pipeline.
In recent literature, it has been observed that current peak callers may miss peaks that are clearly visible to the human eye. More precisely the peaks exhibit distinct shapes which act as visual cues. However, as stated by Rye et al. , “peak shape information is not fully exploited in the evaluated programs”. The gap between peaks apparent to expert users and what is recognised algorithmically has been observed time and again in recent literature .
One approach to improve peak calling is to take more of the observed specific characteristics of existing ChIP-seq datasets into account and encode them directly into the algorithm (see e.g. ). However, hard-coding more specific models into the heuristic is not a sustainable path for future developments in the long run, as this procedure may be overfitting certain kinds of datasets. Specially given the increasing variety of ChIP-seq related experimental protocols, this approach would ultimately lead to a similar variety of specialised heuristics. This variety and specialisation will make it increasingly difficult to update, test, and release algorithms while leaving users confused as to which tool and version is optimally suited to their data at hand.
Nevertheless, novel peak-shape recognition algorithms have been shown to outperform existing heuristics, identifying peaks that were missed previously , . Generally speaking, in contrast to a hard-coded internal model these approaches “learn” the peak-shape from the underlying data. In the “learning-phase” an initial set of positive examples is identified, i.e. regions that unambiguously contain peaks. This phase is shared with other methodologies such as MACS and DFilter. From this initial set, a computational representation of the specific peak-shape is constructed. This representation is then applied to the entire dataset to perform the automated peak calling, based on a suitable statistical framework.
Aims for a next-generation peak caller
Reflecting on the recent developments in the field combined with the practical lessons learned from a number of different current peak calling algorithms and the potential advances offered by the aforementioned shape-based approaches, we formulate the main requirements for a new peak calling toolset as follows:
Generality In order to keep up with the steadily rising number of experimental protocols that require peak calling for data-analysis, the algorithmic engine has to be general enough to be applicable across datasets from many different *-seq technologies (i.e. ChIP-seq, DNase-seq, etc.). The toolset should be swiftly adaptable to new datasets as they become available without expensive recoding efforts.
Specificity For achieving optimal results, the specific characteristics of the peaks need to be recognised by the algorithm. The optimisation and parameterisation for the task at hand should not sacrifice the generality of the underlying algorithmic implementation.
Robustness Rather than inventing ad-hoc scoring schemes, the algorithms need to be built on a mathematically and statistically well-founded framework. In particular, we employ methodologies from digital signal processing and machine learning, which have been extensively studied and are deeply understood.
Simplicity Despite the algorithmic and statistical complexity of the data-analysis task, the implementation needs to be suitable for a general audience. This translates into minimising or even eliminating the need for parameterisation and automating the most common tasks. At the same time, the algorithm needs to be transparent about its results and intuitive to use, such that advanced users can adopt the tools easily to their needs.
Some of the aims formulated above may seem to be contradictory or even mutually exclusive to each other at first sight. However, peak calling constitutes a special case of signal detection algorithms that “can be solved by adapting ‘uniform’ and ‘formally optimal’ techniques from the signal processing literature” . In addition to being general such that signals of arbitrary shape can be processed, it is based upon a well studied framework from signal processing theory. The specific shape of the signal is learned from a number of “positive” and “negative” regions (or noise) where the signal is absent or is the result of a sequencing artefact. The resulting shape of the signal minus the noise is encoded in a vector (the so-called Hotelling-observer, named after the mathematical statistician Harold Hotelling ), which is then evaluated against the data-stream. This resulting filter contains the information needed for peak detection and can easily be transferred and applied to other datasets as well. This approach lends itself to visual parameterisation by example such that advanced users could define the set of regions from which the filter is constructed, making it transparent as to what pattern the algorithm is detecting. Examples of approaches along this rationale are described in , , . At the same time, the underlying implementation remains independent of the peak-shape it is detecting - analogous to text-search algorithms being independent of the text pattern or regular expressions.
In order to build specific peak-shape filters without extensive manual annotation of positive and negative regions, we take into account that the vast majority of peak callers hardly disagree about top-scoring peaks . The differences in performance become apparent only for less obvious peaks; it is in this “grey zone” where the fit between the data and the intrinsic statistical model of the algorithms decides about their relative performance. This observation suggests a strategy for boot-strapping the shape-based approach: There are sufficient positive regions that can be safely and unambiguously identified by any of the currently available methods in a first pass. A more specific model of the peak-shape is then inferred from these clear-cut examples, allowing the algorithm to tune itself automatically to the data at hand. A second peak-detection step is performed, resulting in a much more sensitive peak-detection overall.
Results and discussion
Signal detection using a Hotelling-filter
The CLC shape-based peak caller makes use of both the characteristic peak shape and enriched read coverage to identify peaks in *-seq data. Next, we will outline the individual steps of the entire ChIP-seq peak calling pipeline, starting from quality control and normalisation of the data, describing the fundamentals of using a filter, learning a characteristic peak shape from highly enriched regions and calling peak regions including boundary refinement. Finally, we describe how these steps are working together to result in a highly automated peak detection pipeline that provides near optimal results without the need for extensive parameterisation by the user.
Quality control of ChIP-seq data
The cross-correlation function typically has a maximum when the value of the strand-shift is close to the length of the DNA fragment being sequenced. This is indicated in green in Fig. 4. This peak is a characteristic feature of a ChIP-seq experiment. One can expect a pronounced peak around the fragment length because the frame shift between reads mapping to the forward and to the reverse strand near a typical transcription factor binding site (Fig. 2) is on average equal to the fragment length , , . Therefore, the (relative) height of this peak can be considered a proxy for the quality of the ChIP-seq experiment. Finally, the location of this peak can be used to estimate the average length of the DNA fragments after the fragmentation step (e.g. sonication or MNase digestion).
The CLC shape-based peak caller analyses the genomic coverage of the reads. For each read mapping, the 5’ position of the reads mapping to the forward strand and the 3’ position of the reads mapping to the reverse strand are extracted. For each genomic position x, we define f(x) as the number of reads mapping to the forward strand where x is the 5’ position and r(x) as the number of reads mapping to the reverse strand where x is the 3’ position. A quantile standardisation is then applied to f(x) and r(x) such that the normalised coverage functions f ′ and r ′ follow a standard normal distribution, i.e. .
Discovering obvious peaks
where ⋆ denotes the cross-correlation operator. The cross-correlation between a function and a filter can be described as follows: For each genomic position x, we extract the genomic coverage profile of a window centred at x. We multiply this profile by the peak shape filter and we sum the result. The resulting number indicates how well the shape of the filter is matched. The score will reach a maximum at the centre of a peak. Peaks are then identified as the regions whose centres are the genomic positions with highest score and whose size is the size of the filter.
Similarly, the set of negative examples is identified by running a Gaussian filter. If control data is available, the negative examples are identified as regions where the genomic coverage in the control dataset is higher than the one in the ChIP dataset. However, if there is no information to build a negative profile from, the negative profile is estimated from the sequencing noise.
Learning the peak shape
where d i is the Mahalanobis distance between the region i and its reference group, Φ −1 indicates the quantile of the standard normal distribution, m.a.d. indicates the median absolute deviation, and the term is a robust estimate of the standard deviation under the assumption of normal distribution .
Once the positive and negative regions have been identified, the CLC shape-based peak caller learns a filter that matches the average peak shape, which we term Peak Shape Filter. The filter implemented is called Hotelling Observer  and was chosen because it is the matched filter that maximises the Area Under the Curve of the Receiver Operator Characteristic (AUC-ROC), one of the most widely used measures for algorithmic performance.
Even though the shape of the Hotelling Observer is typically similar to the average profiles (Fig. 3), it is in fact modelling the shape that is maximally discriminative between positive and negative example and is therefore more similar to the difference between positive and negative examples.
Peak shape score
The Peak Shape Filter is applied to the experimental data and a score is calculated at each genomic position (Fig. 7). The score is obtained by extracting the genomic coverage profile of a window centred at the genomic position and then comparing this profile to the Peak Shape Filter. The result of this comparison defines the Peak Shape Score. The Peak Shape Score is standardised and follows a standard normal distribution, so a p-value for each genomic position can be calculated as p-value=Φ(−Peak Shape Score of the peak centre), where Φ is the standard normal cumulative distribution function.
The CLC shape-based peak caller
The CLC shape-based peak caller implements all the steps previously described in a single and easy-to-use algorithm. It is available as part of the CLC Genomics Workbench from version 7.5 and up. The input to the algorithm is mapped reads for ChIP-seq and genomic controls. The only parameter is a p-value threshold. We recommend a value of 0.05 for a more conservative peak-calling result, while a a more permissive value of 0.1 identifies also less pronounced peaks. The results of the algorithm are:
QC Report The QC report contains metrics about the quality of the ChIP-seq experiment. It lists the number of mapped reads, the normalised strand coefficient, and the relative strand correlation for each mapping. Furthermore, the QC report shows the mean read length, the inferred fragment length, and the window size used to model the signal shape. In case the input contains paired-end reads, the report will also contain the empirical fragment length distribution.
Peak Shape Filter the Hotelling Observer filter that was learned by the CLC shape-based peak caller.
Peak Shape Score The peak shape score value for every genomic position.
Peaks the list of all called peaks.
There is a number of ChIP-seq peak calling packages available, each with slightly different implementation-dependent strengths and weaknesses. See for example the extensive comparison conducted by Wilbanks and Facciotti . For clarity, in this paper we limit the comparative analysis of the CLC shape-based peak caller to three of the most popular state-of-the-art peak callers: the seqpeak tool included in the CisGenome software collection , the findPeaks tool included in the HOMER software collection  (http://homer.salk.edu/homer/ngs/index.html), and the MACS software , , . These algorithms consistently rank among the top performing implementations  and will serve as reference points.
In this section, we present benchmark results from calling peaks in experiments targeted at identifying transcription factor binding sites (TFBS) from ChIP-seq data and at identifying accessible genomic regions from DNase-seq data.
ChIP-seq benchmark datasets often provide control data together with data from specific experiments (see  for guidelines on how to construct control samples). We will refer to the experiment samples as ChIP sample and to the control as control sample. In this paper, we use the data published by  as our gold-standard of truth. This dataset is based on expert curation of several hundred regions, each manually classified as either positive, negative, or ambiguous. The classification is done for three different ChIP-seq experiments, namely for the transcription factors MAX, NRSF, and SRF. In contrast to synthetic “spike-in” data generated by some authors, the use of manually annotated real-world data has the advantage of a blind experiment in the sense that the gold-standard classification is not produced by the very same persons developing the peak calling algorithm. The ChIP-seq datasets are:
MAX Published by Michael Snyder’s lab at Yale University and generated from cell-line K562 targeting Myc-Associated factor X.
NRSF Published by the Myers Lab at the HudsonAlpha Institute for Biotechnology and generated from cell-line Gm12878 targeting the Neural Restrictive Silencer Factor (NRSF or REST) transcription factor.
SRF Published by the Myers Lab at the HudsonAlpha Institute for Biotechnology and generated from cell-line Gm12878 targeting the Serum Response Factor (SRF) transcription factor.
To show the generality of our method to similar sequencing data, we also investigate the performances of the peak callers on DNase-seq data. The signal produced by DNase-seq data is similar to the one produced by ChIP-seq data, so algorithms developed for ChIP-seq are commonly used to analyse DNase-seq data . However, there are two main experimental protocols for DNase-seq, namely the “end-capture”  and the “double-hit”  protocols. Since the signal produced by the double-hit protocol  is very similar to the signal produced by a ChIP-seq experiment and peak callers are typically run with the same parameters used for the ChIP-seq data. On the other hand, the analysis of DNase-seq data obtained using the end-capture protocol requires specialised parameters in HOMER and MACS, as suggested in their user manuals (see Methods). Those parameter choices reflect the fact that the frame shift between reads mapping to the forward and reverse strand is not equal to the average fragment length as in ChIP-seq data, but is typically close to zero. We note that CisGenome was developed exclusively for analysing ChIP-seq data, so we used the default parameters for DNase-seq data. In this paper, we investigate two “end-capture” DNase-seq datasets:
K562 Published by the Duke University and generated from cell-line K562 using the “end-capture” protocol.
Gm12878 Published by the Duke University and generated from cell-line Gm12878 using the “end-capture” protocol.
The performances of the peak callers in these two datasets were computed using benchmark datasets of genomic regions independently validated using microarray data (see Methods).
Reads in benchmark datasets used
Experiment Rep. 1
Experiment Rep. 2
Experiment Rep. 3
Experiment Rep. 4
Experiment Rep. 5
Control Rep. 1
Control Rep. 2
Number of positive and negative peaks in each of the gold-standard datasets
Calculating performance metrics
Running CLC shape-based peak caller with different input datasets
Area under the ROC curve values for different input settings
Rep. 1 and Rep. 2
Rep. 1 and Rep. 2 a
Rep. 2 a
Rep. 1 and Rep. 2
The results in Table 3 show small differences between the performances obtained by using different choices of input, suggesting that the performances of the CLC shape-based peak caller do not degrade significantly when fewer data are available and even when no control data are available. As expected, running the algorithm using all available data consistently gave top performances, indicating that there is no need to perform pre- or post-processing steps when analysing datasets where replicates are available.
We ran all peak callers (CLC shape-based peak caller, CisGenome, HOMER, and MACS) on the five datasets (Table 1). All tools were run initially with their respective default parameters, as this is how the software is used mostly. However, since all the peak callers by default do not output ambiguous peaks, many ROC curves would plateau and result in an unfairly low AUC-ROC value. Therefore, we relaxed parameters related to filtering of non-significant peaks, while leaving the other parameters unchanged. For CisGenome, we relaxed the cutoff for defining peak boundaries from 3 to 2.5; for HOMER, we relaxed the fold enrichment over input tag threshold from 4.0 to 2.0, the fold enrichment over local tag count threshold from 4.0 to 2.0, and removed the filtering step based on expected unique tag positions; for MACS we relaxed the q-value threshold from the default value of 0.05 to 0.25; and for the CLC shape-based peak caller, we relaxed the p-value threshold from 0.05 to 0.25 (see Methods). In all cases, the resulting AUC-ROC values obtained with these relaxed parameters were greater or equal to the AUC-ROC obtained using default parameters. We note that it was not feasible to remove the filtering steps altogether or to further relax the thresholds, because the resulting AUC-ROC values for CisGenome, MACS, and HOMER degraded in at least one dataset.
Area under the ROC curve values
CLC Peak Shape Score
CLC Peak Caller
The first consideration is that HOMER makes its first mistake after reaching a true positive rate of 0.67, significantly later than the other algorithms. However, the performances of HOMER decline for more ambiguous peaks, resulting in a smaller AUC-ROC than MACS and the CLC shape-based peak caller. On the other hand, MACS makes more mistakes in the beginning, but it is able to identify nearly all peaks, resulting in a higher AUC-ROC value. The CLC shape-based peak caller behaves similarly to HOMER at low false positive rate but is able to call more peaks. We note that in this dataset there is a difference in the ROC curves of the CLC Peak Shape Score and the CLC shape-based peak caller. This is due to the fact that many regions of the gold standard are situated near the centre of a peak. Therefore, in this context, the performance value depends significantly on the ability to identify the correct peak boundaries. For example, a negative region close to a peak region is shown in Fig. 11. In this example, the peak caller correctly identifies that there is no peak in the negative region, while the CLC Peak Shape Score approach assigns a moderate score to the negative region because it is in proximity of a peak. Conversely, a situation where a positive region is situated near a strong peak may be missed by the peak caller, which calls only the main peak and may not extend the boundaries enough. This makes the two ROC curves different, making the peak caller more appropriate for small false positive rates, whereas the Peak Shape Score approach assigns a score to every genomic position. The difference between CisGenome and CisGenome Refined is even more pronounced, as the peaks called by the default CisGenome are too large and often include nearby regions annotated as negatives sites. On the other hand, using refined peak boundaries drastically improves the performances of the algorithm for this dataset, although it misses some peaks of lower quality.
The performances of all the algorithms are very good in this dataset and all algorithms make their first mistake after a true positive rate of 0.7. The main difference between the CLC shape-based peak caller and the other two algorithms is that, although it makes a few mistakes earlier, the peak caller is able to call nearly all peaks, resulting in a higher overall AUC-ROC value. In this dataset, the performances of the CLC shape-based peak caller and the Peak Shape Score are equal, because there is no ambiguity regarding peak boundaries. Therefore, the two ROC curves completely coincide and only the curve from CLC Peak Caller (blue) is visible. The performances of the two CisGenome variants are quite similar, but CisGenome Refined misses some peaks that the default CisGenome correctly identifies.
Similarly to the NRSF dataset, most positive peaks are called by all algorithms and only a few mistakes are made, resulting in good performances for all the algorithms. MACS performs better than the other algorithms for small false positive rates, resulting in the highest AUC-ROC value. In this dataset, the difference between the peak caller and the Peak Shape Score is very small and the curves are very close to each other. The main difference is that the peak caller does not identify all peaks and plateaus slightly before reaching the top of the plot. On the other hand, the Peak Shape Score gives a score to every region in the gold standard, so it is always able to reach a true positive rate of 1.
The ROC curves look similar to the ones obtained for the ChIP-seq data, in particular we note that the AUC-ROC values are high and that all algorithms are able to identify at least 80 % of the positive peaks making only a few mistakes.
Table 4 summarises the results by the area under the ROC-curve values. MACS and the CLC shape-based peak caller have the highest values, while HOMER is often penalised because it misses several peaks and CisGenome does not often identify the correct peak boundaries. Similarly to the MAX dataset, in the GM12878 dataset, MACS performs better than the other algorithms for false positive rates between 0.1 and 0.3, while the CLC shape-based peak caller performs marginally better with low false positive rates and is able to identify more positives and thus results in a higher overall AUC-ROC. On the other hand, the ROC curve of CLC shape-based peak caller in the K562 dataset shows better performances than the other algorithms consistently for all false positive rates.
Even though it is hard to directly compare the results of the Peak Shape Score with more traditional peak calling algorithms, we observe that there are clear advantages in having a score with single nucleotide resolution, especially for calling low-quality peaks. The results for the DNase-seq show similar trends to the results of the ChIP-seq analysis.
Broad peak ChIP-seq data
To further show the generality of the approach, we next applied the methodology to histone mark ChIP-seq data, which present broad peaks with variable length. For classifying these broad-peak patterns, we adapted the peak-shape approach to use annotated gene regions as additional input. Rather than considering coverage shapes in a fixed-sized window around each genomic position, shapes in predefined input regions of variable length are computed by scaling to a unit-window.
The analysis was performed using the same learning procedure, with only few differences. First, genomic coverage was normalised in the same way as the normal algorithm. Instead of considering positive and negative reads as separate inputs, the normalised read coverages were aligned and summed as g ′(x)=f ′(x+μ/2)+r ′(x−μ/2), where f ′(x) is the normalised coverage of reads mapping to the forward strand, r ′(x) is the normalised coverage of reads mapping in the reverse strand and μ is the fragment length mean. This step was necessary since the displacement between forward and reverse reads is a global property of the experiment and does not depend on the variable gene length. The shapes of all genes transcribed in the reverse strand were reversed.
The initial discovery of obvious peaks was done similarly to normal algorithm, the only difference being that instead of computing the Peak Shape Score for each genomic position by deconvolution (Eq. 1), we calculated a single score for each gene region by computing the dot product of the reshaped profile and the filter. The shape learning phase and p-value computation were done identically to the standard algorithm, returning a score and a p-value for each gene.
Area under the ROC curve values for histone mark ChIP-seq data
The performances of the method are consistent among the four cell lines, indicating that the method is robust. We note that the difference in predictive performance between the different datasets is most likely due to the fact that some histone marks (e.g. H3K36me3 ) correlate more strongly with active transcription than the other marks.
In this paper, we discussed the state-of-the-art and current trends in the field of peak-detection algorithms that lead to the development of the new CLC shape-based peak caller. We described our implementation of a general and statistically well founded algorithmic engine, applicable to a wide range of different datasets. This flexibility is combined with specificity by automatically learning the characteristics of the signal present in the data.
The systematic performance evaluation of the CLC shape-based peak caller on a published manually curated reference dataset shows that it compares favourably to popular current algorithms. It is important to note that the existing peak callers are readily tuned to detect narrow peaks at transcription factor binding-sites and hence are very well suited to the benchmark data. Hence, it is reassuring to see the new CLC shape-based peak caller performs on the same level and sometimes even better than CisGenome, HOMER, and MACS.
Besides the theoretical advantages of the peak-shape approach and the positive results obtained on the benchmark data, in practice it is equally important that the CLC shape-based peak caller wraps the underlying algorithm and statistics into an automated pipeline, which is simple to use for non-specialists. At the same time, we provide the option of further optimisation by manually delineating positive and negative examples - a process that is much easier understood and visualised than abstract parameters. Moreover, it is of great importance to make sure that the input data has sufficient quality and consequently that the results can be trusted. Therefore, potential problems are highlighted by detailed quality reports.
To our knowledge, the CLC shape-based peak caller is unique in its combination of flexibility and accuracy through the ability of building optimised filters for different analysis tasks and datasets. This enables sustained development and maintenance of the tool set in exploratory research, since the adaptation to different filters does not require changes to the underlying code base. At the same time, existing filters are transferable and can be applied to new datasets as they become available in production settings, so there is no need to re-learn and optimise the filters from scratch if the same peak-shape is to be detected.
The CLC shape-based peak caller represents an important building block within a larger ecosystem of NGS data analysis software. Hence considerations for further developments have to take into account the interplay with other related components. From a user centric view, major improvements will be gained by seamless interoperability with both upstream and downstream analysis steps. Since peak detection is now handled by a flexible and generally applicable algorithmic engine, the focus shifts from algorithmic issues to data integration, downstream analyses, and visualisation capabilities.
The typical analysis downstream of peak finding or classification are motif finding for Transcription-factor ChIP-seq data, the annotation of peaks with nearby genes for functional analysis such as gene set enrichment and the correlation with other omics datasets such as gene expression data.
A great deal of interoperability is already taken care of by the integration of the CLC shape-based peak caller into the track-based framework of CLC Genomics Workbench. By taking its inputs from tracks and producing outputs in the same format, it can be freely combined with other track-based tools. Therefore it is easily integrated into larger analysis pipelines and workflows while benefiting from the ongoing improvements to the visualisations and operations available for track-based data.
A major advantage of the CLC shape-based peak caller is that it is not exclusively designed to detect signals in transcription factor ChIP-seq datasets. Herein, we have shown that it could be directly applied to the DNase-seq data without having to change parameters and still yielding comparable performances. In principle, the methodology is applicable to detect signals in data from a wide range of different sequencing protocols, including FAIRE-seq, broad-peaks from ChIP-seq of histone-modifications or other epigenetic marks such as DNA methylation.
The manually curated peak annotation by  are obtained via http://tare.medisin.ntnu.no/chipseqbenchmark/downloads/ChIPSeq_files_in_bed_format/.
The corresponding original data from ENCODE (release 1) are available via the following links: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeSydhTfbs/ and http://hgdownload-test.cse.ucsc.edu/goldenPath/hg18/encodeDCC/wgEncodeHudsonalphaChipSeq/release1/.
End-capture  DNase-seq data for the cell lines K562 and GM12878 used in this study were produced by the ENCODE consortium . The mapped.bam files “end-capture”  DNase-seq and validations were obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeOpenChromDnase/.
Validation was performed by simultaneously generating microarray-based data from the same material used for DNase-seq. The microarray used was a Nimblegen tiling array that covers the 1 % of the genome that was the focus of the pilot encode project.
CisGenome version 2.0 was downloaded from http://www.biostat.jhsph.edu/~hji/cisgenome/index_files/download.htm. The seqpeak tool was run with parameters “-c 2.5”. The columns “start” and “end” were chosen to define the peak regions and the columns “left_peakboundary” and “right_peakboundary” were chosen to define the refined boundaries. In both cases, the “rank” column was used to build the ROC curve.
HOMER version 4.7 was downloaded from http://homer.salk.edu/homer/download.html. The findPeaks tool was run with parameters “-style factor -F 2 -L 2 -C 0” for ChIP-seq and with the parameters “-style dnase -F 2 -L 2 -C 0” for the “end-capture” DNase-seq data. The column “findPeaks Score” was used to build the ROC curve. We note that removing the filtering options -F -L -C resulted in dramatically reduced performances, so the threshold for -F and -L was relaxed, instead.
MACS version 2.1.0 was obtained from https://github.com/taoliu/MACS/. The software was run with the parameter “–qvalue 0.25” for ChIP-seq and with the parameters “–qvalue 0.25 –nomodel –shift -100 –extsize 200” for the “end-capture” DNase-seq data The column “q-value” was used to build the ROC curve.
The CLC shape-based peak caller version 1.0 was run with the parameter “– p -value 0.25”. The Score Regions tools from Advanced Peak Shape Tools Plugin version 1.0 beta 3 was used to assign Peak Shape Scores to gold standard regions.
Broad peak ChIP-seq data
Histone modification ChIP-seq data for four cell lines GM12878, HepG2, Huvec and Nhek were collected from the Sequence Read Archive. The human genome hg19-GRCH37 was used as reference. Paired-end RNA-seq data from Encode/Caltech  were downloaded from SRA (GM12878: SRX159821, HepG2: SRX159823, Huvec: SRX159825 and Nhek: SRX159827) and analysed using the RNA-seq Analysis tool of the CLC Genomics Workbench 8.0 the using the parameters “Count paired reads as two = Yes”, “Expression value = RPKM” and “Calculate RPKM for genes without transcripts = Yes”. Overlapping genes were merged and the resulting region was counted once. A gene was considered to be expressed if its RPKM was >2 in all RNA-seq replicates and a gene to not be expressed if its RPKM was <2 in all replicates. This procedure resulted in 6478, 6311, 7025 and 6136 positive regions and 17346, 17894, 17786 and 17583 negative regions for GM12878, HepG2, Huvec and Nhek, respectively.
ChIP-seq data for the four cell lines  was downloaded from SRA (study ID: SRP005344) and was mapped to the human genome using the Map Reads to Reference tool of the CLC Genomics Workbench 8.0 using the option “Non-specific match handling = Ignore”. For each histone modification mark, all ChIP-seq replicates from the same cell line were used as input to the peak-shape analysis. The resulting Peak Shape Score was used to compute AUC-ROC values.
Publication costs for this article were funded by QIAGEN.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 5, 2016: Selected articles from Statistical Methods for Omics Data Integration and Analysis 2014. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-5.
This work was partially funded by the STATegra project, EU FP7 grant number 30600. We gratefully acknowledge Jens Dalgaard Nielsen, Morten Værum, Dan Amlund Thomsen, and Rasmus Haubro Rohde for support with the implementation. We also thank the members of the STATegra and 4DCellFate consortia for providing datasets and insightful feedback.
- Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, Chen Y, DeSalvo G, Epstein C, Fisher-Aylor KI, Euskirchen G, Gerstein M, Gertz J, Hartemink AJ, Hoffman MM, Iyer VR, Jung YL, Karmakar S, Kellis M, Kharchenko PV, Li Q, Liu T, Liu XS, Ma L, Milosavljevic A, Myers RM, Park PJ, Pazin MJ, Perry MD, Raha D, Reddy TE, Rozowsky J, Shoresh N, Sidow A, Slattery M, Stamatoyannopoulos JA, Tolstorukov MY, White KP, Xi S, Farnham PJ, Lieb JD, Wold BJ, Snyder M: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012, 22 (9): 1813-31.View ArticlePubMedPubMed CentralGoogle Scholar
- Marinov GK, Kundaje A, Park PJ, Wold BJ: Large-scale quality analysis of published ChIP-seq data. G3 (Bethesda). 2014, 4 (2): 209-3.View ArticleGoogle Scholar
- Rhee HS, Pugh BF: Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell. 2011, 147 (6): 1408-19.View ArticlePubMedPubMed CentralGoogle Scholar
- Sabo PJ, Kuehn MS, Thurman R, Johnson BE, Johnson EM, Cao H, Yu M, Rosenzweig E, Goldy J, Haydock A, Weaver M, Shafer A, Lee K, Neri F, Humbert R, Singer MA, Richmond TA, Dorschner MO, McArthur M, Hawrylycz M, Green RD, Navas PA, Noble WS, Stamatoyannopoulos JA: Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat Methods. 2006, 3 (7): 511-8.View ArticlePubMedGoogle Scholar
- Song L, Crawford GE: DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc. 2010, 2010 (2): 5384-View ArticleGoogle Scholar
- Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ: Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013, 10 (12): 1213-8.View ArticlePubMedPubMed CentralGoogle Scholar
- Giresi PG, Kim J, McDaniell RM, Iyer VR, Lieb JD: FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 2007, 17 (6): 877-5.View ArticlePubMedPubMed CentralGoogle Scholar
- Jiang C, Pugh BF: Nucleosome positioning and gene regulation: advances through genomics. Nat Rev Genet. 2009, 10 (3): 161-72.View ArticlePubMedPubMed CentralGoogle Scholar
- Meyer CA, Liu XS: Identifying and mitigating bias in next-generation sequencing methods for chromatin biology. Nat Rev Genet. 2014, 15 (11): 709-21.View ArticlePubMedPubMed CentralGoogle Scholar
- Fejes AP, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones SJM: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics. 2008, 24 (15): 1729-30.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu H, Handoko L, Wei X, Ye C, Sheng J, Wei CL, Lin F, Sung WK: A signal-noise model for significance analysis of ChIP-seq with negative control. Bioinformatics. 2010, 26 (9): 1199-204.View ArticlePubMedGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008, 26 (11): 1293-1300.View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008, 5 (7): 621-8.View ArticlePubMedGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009, 27 (1): 66-75.View ArticlePubMedPubMed CentralGoogle Scholar
- Boyle AP, Guinney J, Crawford GE, Furey TS: F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics. 2008, 24 (21): 2537-8.View ArticlePubMedPubMed CentralGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods. 2008, 5 (9): 829-34.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumar V, Muratani M, Rayan NA, Kraus P, Lufkin T, Ng HH, Prabhakar S: Uniform, optimal signal processing of mapped deep-sequencing data. Nat Biotechnol. 2013, 31 (7): 615-22.View ArticlePubMedGoogle Scholar
- Heng-Yi Wu, Jie Zhang KH. Peak detection on ChIP-Seq data using wavelet transformation. In: Bioinformatics and Biomedicine Workshops (BIBMW), 2010 IEEE International Conference On: 2010. p. 555–60.Google Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36 (16): 5221-31.View ArticlePubMedPubMed CentralGoogle Scholar
- Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK: Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol cell. 2010, 38 (4): 576-89.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9 (9): 137-View ArticleGoogle Scholar
- Rashid NU, Giresi PG, Ibrahim JG, Sun W, Lieb JD: ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions. Genome Biol. 2011, 12 (7): 67-View ArticleGoogle Scholar
- Lan X, Bonneville R, Apostolos J, Wu W, Jin VX: W-ChIPeaks: a comprehensive web application tool for processing ChIP-chip and ChIP-seq data. Bioinformatics. 2011, 27 (3): 428-30.View ArticlePubMedGoogle Scholar
- Micsinai M, Parisi F, Strino F, Asp P, Dynlacht BD, Kluger Y: Picking ChIP-seq peak detectors for analyzing chromatin modification experiments. Nucleic Acids Res. 2012, 40 (9): 70-View ArticleGoogle Scholar
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009, 25 (15): 1952-8.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu H, Wei CL, Lin F, Sung WK: An HMM approach to genome-wide identification of differential histone modification sites from ChIP-seq data. Bioinformatics. 2008, 24 (20): 2344-9.View ArticlePubMedGoogle Scholar
- Song Q, Smith AD: Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics. 2011, 27 (6): 870-1.View ArticlePubMedPubMed CentralGoogle Scholar
- Hower V, Evans SN, Pachter L: Shape-based peak identification for ChIP-Seq. BMC Bioinformatics. 2011, 12: 15-View ArticlePubMedPubMed CentralGoogle Scholar
- Kornacker K, Rye MB, Håndstad T, Drabløs F: The Triform algorithm: improved sensitivity and specificity in ChIP-Seq peak finding. BMC Bioinformatics. 2012, 13: 176-View ArticlePubMedPubMed CentralGoogle Scholar
- Wong KC, Li Y, Peng C, Zhang Z: SignalSpider: probabilistic pattern discovery on multiple normalized ChIP-Seq signal profiles. Bioinformatics. 2015, 31 (1): 17-24.View ArticlePubMedGoogle Scholar
- Rye MB, Sætrom P, Drabløs F: A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res. 2011, 39 (4): 25-View ArticleGoogle Scholar
- Heydarian M, Romeo Luperchio T, Cutler J, Mitchell CJ, Kim MS, Pandey A, Soliner-Webb B, Reddy K: Prediction of gene activity in early B cell development based on an integrative multi-omics analysis. J Proteomics Bioinform. 2014, 7 (2): 050-063.View ArticleGoogle Scholar
- Mendoza-Parra MA, Nowicka M, Van Gool W, Gronemeyer H: Characterising ChIP-seq binding patterns by model-based peak shape deconvolution. BMC Genomics. 2013, 14: 834-View ArticlePubMedPubMed CentralGoogle Scholar
- Hotelling H: The generalization of Student’s ratio. Ann Math Stat. 1931, 2 (3): 360-78.View ArticleGoogle Scholar
- Stanton KP, Parisi F, Strino F, Rabin N, Asp P, Kluger Y: Arpeggio: harmonic compression of ChIP-seq data reveals protein-chromatin interaction signatures. Nucleic Acids Res. 2013, 41 (16): 161-View ArticleGoogle Scholar
- Wilbanks EG, Facciotti MT: Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 2010, 5 (7): 11471-View ArticleGoogle Scholar
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26 (12): 1351-9.View ArticlePubMedPubMed CentralGoogle Scholar
- Mahalanobis PC: On the generalised distance in statistics. Proc Nat Inst Sci India. 1936, 2 (1): 49-55.Google Scholar
- Huber P: Robust Statistics. Wiley Series in Probability and Statistics. 1981, Wiley, New York, NY, USAGoogle Scholar
- Feng J, Liu T, Zhang Y: Using MACS to identify peaks from ChIP-Seq data. Curr Protoc Bioinformatics. 2011, 34 (2): 2-14.Google Scholar
- Feng J, Liu T, Qin B, Zhang Y, Liu XS: Identifying ChIP-seq enrichment using MACS. Nat Protoc. 2012, 7 (9): 1728-40.View ArticlePubMedGoogle Scholar
- Koohy H, Down TA, Spivakov M, Hubbard T: A comparison of peak callers used for DNase-seq data. PLoS ONE. 2014, 9 (5): 96303-View ArticleGoogle Scholar
- Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, Ku M, Durham T, Kellis M, Bernstein BE: Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011, 473 (7345): 43-9.View ArticlePubMedPubMed CentralGoogle Scholar
- Bing Li, Michael Carey JLW: The role of chromatin during transcription. Cell. 2007, 128 (4): 707-19. 10.1016/j.cell.2007.01.015.View ArticleGoogle Scholar
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, Giresi PG, Goldy J, Hawrylycz M, Haydock A, Humbert R, James KD, Johnson BE, Johnson EM, Frum TT, Rosenzweig ER, Karnani N, Lee K, Lefebvre GC, Navas PA, Neri F, Parker SCJ, Sabo PJ, Sandstrom R, Shafer A, Vetrie D, Weaver M, Wilcox S, Yu M, Collins FS, Dekker J, Lieb JD, Tullius TD, Crawford GE, Sunyaev S, Noble WS, Dunham I, Denoeud F, Reymond A, Kapranov P, Rozowsky J, Zheng D, Castelo R, Frankish A, Harrow J, Ghosh S, Sandelin A, Hofacker IL, Baertsch R, Keefe D, Dike S, Cheng J, Hirsch HA, Sekinger EA, Lagarde J, Abril JF, Shahab A, Flamm C, Fried C, Hackermüller J, Hertel J, Lindemeyer M, Missal K, Tanzer A, Washietl S, Korbel J, Emanuelsson O, Pedersen JS, Holroyd N, Taylor R, Swarbreck D, Matthews N, Dickson MC, Thomas DJ, Weirauch MT, Gilbert J, Drenkow J, Bell I, Zhao X, Srinivasan KG, Sung WK, Ooi HS, Chiu KP, Foissac S, Alioto T, Brent M, Pachter L, Tress ML, Valencia A, Choo SW, Choo CY, Ucla C, Manzano C, Wyss C, Cheung E, Clark TG, Brown JB, Ganesh M, Patel S, Tammana H, Chrast J, Henrichsen CN, Kai C, Kawai J, Nagalakshmi U, Wu J, Lian Z, Lian J, Newburger P, Zhang X, Bickel P, Mattick JS, Carninci P, Hayashizaki Y, Weissman S, Hubbard T, Myers RM, Rogers J, Stadler PF, Lowe TM, Wei CL, Ruan Y, Struhl K, Gerstein M, Antonarakis SE, Fu Y, Green ED, Karaöz U, Siepel A, Taylor J, Liefer LA, Wetterstrand KA, Good PJ, Feingold EA, Guyer MS, Cooper GM, Asimenos G, Dewey CN, Hou M, Nikolaev S, Montoya-Burgos JI, Löytynoja A, Whelan S, Pardi F, Massingham T, Huang H, Zhang NR, Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Seringhaus M, Church D, Rosenbloom K, Kent WJ, Stone EA, ENCODE Project Consortium, NISC Comparative Sequencing Program, Baylor College of Medicine Human Genome Sequencing Center, Washington University Genome Sequencing Center, Broad Institute, Children’s Hospital Oakland Research Institute, Batzoglou S, Goldman N, Hardison RC, Haussler D, Miller W, Sidow A, Trinklein ND, Zhang ZD, Barrera L, Stuart R, King DC, Ameur A, Enroth S, Bieda MC, Kim J, Bhinge AA, Jiang N, Liu J, Yao F, Vega VB, Lee CWH, Ng P, Shahab A, Yang A, Moqtaderi Z, Zhu Z, Xu X, Squazzo S, Oberley MJ, Inman D, Singer MA, Richmond TA, Munn KJ, Rada-Iglesias A, Wallerman O, Komorowski J, Fowler JC, Couttet P, Bruce AW, Dovey OM, Ellis PD, Langford CF, Nix DA, Euskirchen G, Hartman S, Urban AE, Kraus P, Van Calcar S, Heintzman N, Kim TH, Wang K, Qu C, Hon G, Luna R, Glass CK, Rosenfeld MG, Aldred SF, Cooper S: Identification and analysis of functional elements in 1 % of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. View ArticlePubMedGoogle Scholar
- Li J, Moazed D, Gygi SP: Association of the histone methyltransferase Set2 with RNA polymerase II plays a role in transcription elongation. J Biol Chem. 2002, 277 (51): 49383-8.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.