An integrated ChIP-seq analysis platform with customizable workflows
© Giannopoulou and Elemento; licensee BioMed Central Ltd. 2011
Received: 21 February 2011
Accepted: 7 July 2011
Published: 7 July 2011
Skip to main content
© Giannopoulou and Elemento; licensee BioMed Central Ltd. 2011
Received: 21 February 2011
Accepted: 7 July 2011
Published: 7 July 2011
Chromatin immunoprecipitation followed by next generation sequencing (ChIP-seq), enables unbiased and genome-wide mapping of protein-DNA interactions and epigenetic marks. The first step in ChIP-seq data analysis involves the identification of peaks (i.e., genomic locations with high density of mapped sequence reads). The next step consists of interpreting the biological meaning of the peaks through their association with known genes, pathways, regulatory elements, and integration with other experiments. Although several programs have been published for the analysis of ChIP-seq data, they often focus on the peak detection step and are usually not well suited for thorough, integrative analysis of the detected peaks.
To address the peak interpretation challenge, we have developed ChIPseeqer, an integrative, comprehensive, fast and user-friendly computational framework for in-depth analysis of ChIP-seq datasets. The novelty of our approach is the capability to combine several computational tools in order to create easily customized workflows that can be adapted to the user's needs and objectives. In this paper, we describe the main components of the ChIPseeqer framework, and also demonstrate the utility and diversity of the analyses offered, by analyzing a published ChIP-seq dataset.
ChIPseeqer facilitates ChIP-seq data analysis by offering a flexible and powerful set of computational tools that can be used in combination with one another. The framework is freely available as a user-friendly GUI application, but all programs are also executable from the command line, thus providing flexibility and automatability for advanced users.
The use of chromatin immunoprecipitation in combination with high-throughput sequencing (ChIP-seq) has enabled the study of genome-wide mapping of protein-DNA interaction and epigenetic marks. By sequencing millions of immunoprecipitated DNA fragments in a single experiment, ChIP-seq outperforms the array-based ChIP-chip (Chromatin Immunoprecipitation followed by DNA microarray hybridization) technology in terms of quality, specificity, and coverage [1–3], and has the potential to greatly improve our understanding of the mechanisms underlying transcriptional regulation [4–8]. Many peak detection methodologies and software tools have been developed for the analysis of ChIP-seq data since the introduction of the technology [2, 3, 9, 10]. Although peak detection is important for the analysis of ChIP-seq data, it is only the first step. Additional computational tools are needed to help interpret the genome-wide transcription factor binding and histone mark enrichment patterns revealed by peak detection procedures. We have developed ChIPseeqer, a comprehensive computational framework that enables broad, but also in-depth, analysis of ChIP-seq data. The framework includes: (1) gene-level annotation of peaks, (2) pathways enrichment analysis, (3) regulatory element analysis, using either a de novo approach, known or user-defined motifs, (4) nongenic peak annotation (repeats, CpG islands, duplications, published ChIP-seq datasets), (5) conservation analysis, (6) clustering analysis, (7) visualization tools, (8) integration and comparison across different ChIP-seq experiments. These components share a common architecture: they take as input a set of ChIP-seq peaks, perform the defined analysis, and output one or more sets of peaks that can be used by any of the other tools provided. Thus, our framework was designed to offer the flexibility required to perform complex analyses by defining custom workflows. In principle, ChIPseeqer can use peaks generated by any peak-calling program as initial input, assuming these peaks are in a simple and standard format (i.e., chromosome, start position, end position). For convenience, ChIPseeqer also includes its own fast and accurate peak finding algorithm, previously evaluated and compared with other algorithms in Qin et al. . Details and additional validation of our peak detection algorithm are provided in Additional file 1. By analyzing a published ChIP-seq dataset, we show how the modular nature of the framework can help the users create easily computational pipelines and address sophisticated biological questions.
Although numerous approaches already exist for the analysis of ChIP-seq data, many of them focus on the peak detection process and provide few or no tools for the interpretation of ChIP-seq peaks [5–8, 12–19]. Several frameworks for interpreting ChIPseq datasets have nonetheless been developed, all of which have strengths and limitations. For example, the Cistrome project (unpublished, ) integrates several analysis tools, such as gene annotation, motif and pathways analysis, and peak conservation. However, users must upload their ChIP-seq data to the Cistrome server, which will be increasingly time-consuming and less practical as the size of ChIP-seq datasets increases. Moreover, feeding the results of an analysis directly into another tool within the same framework can be difficult for some users because each tool supports different input formats. Galaxy [21–25], which is used as Cistrome's backend, is a powerful framework but possibly too general for the analysis of ChIP-seq data. In particular, Galaxy does not allow control over several important parameters in ChIP-seq data analysis, such as the maximum or minimum distance between the peak and its closest gene in the peak-gene association task . CisGenome  also supports tools for the integrated analysis of ChIP-seq data, such as annotation of peaks with their neighbor genes, conservation analysis, and motifs discovery. However, CisGenome does not allow the correlation of peaks and their target genes with Gene Ontology (GO) terms and pathways that could shed light on the biological processes and pathways controlled by the transcription factors (TFs) or histone modifications assayed by ChIP-seq. The GPAT program  provides systematic annotation of genomic positions in general. It uses gene annotation from different public databases, such as RefSeq and Ensembl, and provides access to the expression status of the corresponding genes from existing transcriptomic databases, or user-generated expression datasets. The limitation of GPAT is the lack of other tools for the analysis of ChIP-seq data, apart from genomic annotation. Thus, GPAT users cannot perform motif discovery, pathways enrichment, and other analyses that are useful in the ChIP-seq context. EpiChIP  offers gene-based enrichment analysis of ChIP-seq datasets. In particular, EpiChIP looks for enrichment of the ChIP-seq reads over the control sample in specific regions of the genes, such as the 5'- or 3'- end, exons or introns. This approach has the advantage of identifying directly the genes that are enriched in the TF or the histone modification of the reference dataset. However, the program lacks in providing further annotation of the enriched regions. The seqMINER platform  aims at integrating and comparing different ChIP-seq datasets in terms of read density. The algorithm first estimates for a set of genomic regions (i.e., reference dataset) the read density of multiple ChIP-seq datasets. Clustering and visualization methods are then provided to show groups of regions with similar binding features. Although this approach is useful to integrate ChIP-seq datasets, it focuses on the comparison of read density profiles and does not integrate other sources of information, such as the motifs and pathways enrichment or the level of conservation. HOMER  provides a suite of programs originally developed for motif discovery, and later for ChIP-seq peak detection. Although it includes tools for gene annotation, clustering, and visualization of the peaks (e.g., histograms, heatmaps), it does not support conservation analysis and can only run from the command-line. BEDTools  is a UNIX-based collection of utilities that allow common operations on genomic features in general (e.g., find overlaps between two files with genomic intervals, extract FASTA sequences from genomic intervals). Although the BEDTools are designed to provide fast solutions to basic operations on large data volumes produced by DNA sequencing, they do not offer computational tools for the functional interpretation of ChIP-seq peaks (e.g., motifs and pathways analysis). Their command-line nature also demands extra effort and computer skills from users. CEAS  is a stand-alone extension of a web application previously developed for ChIP-chip data , but is also offered through the Cistrome framework. The tool provides basic annotation tools for ChIP-seq data, such as the estimation of peaks distribution across the genome, identification of genes associated with peaks by proximity, and more. However, one drawback of CEAS, when using it through the Cistrome application, is that it produces graphical representation of the results and does not output lists of peaks that belong to specific genomic categories (e.g., promoters, introns). PeakAnalyzer  can subdivide ChIP-seq peaks that have multiple sites of enrichment into smaller peaks; this procedure may facilitate more detailed analysis of individual subpeaks. It also offers annotation functions, and can locate the nearest downstream genes and transcription start sites for each peak. It can also determine overlapping peaks between different datasets. Although PeakAnalyzer allows the user to perform gene annotation, motifs analysis, annotation with functional elements, and comparisons across datasets, it does not support the functional interpretation of the ChIPseq results through their association with pathways. On the other hand, GREAT  is a web application that supports the analysis of functional significance of ChIP-seq peaks using 20 different information sources (e.g., Gene Ontology, PANTHER pathway, Pathway Commons, InterPro). Importantly, the tool integrates not only proximal but also distal binding events to obtain a gene-based p-value for enrichment . However, GREAT does not offer an automated way to retrieve lists of genes and peaks associated with a specific pathway, Gene Ontology term, or motif. This feature (provided in our framework) would enable users to perform further and more targeted analysis on subsets of the initial peaks, which were found to be functionally significant. In contrast, ChIPseeqer has many advantages compared to these programs. First, it offers a variety of tools that cover not only basic gene annotation, but also a wide range of computational analyses, including motif analysis, pathways enrichment, estimation of conservation, read density analysis and more. Second, ChIPseeqer allows the comparison of multiple datasets based not only on read density profiles, but also on peak binding overlap, and integration with other ChIP-seq datasets. Third, the framework provides a straightforward and effortless connection between the tools; no data format transformation is needed to combine the tools and perform a comprehensive and sophisticated data analysis. Fourth, the framework allows the users to control all parameters of the analysis, such as the minimum distance away from transcripts, the upstream distance from transcription start site (TSS), the database annotation and more. In addition, ChIPseeqer runs locally on the user's computer enabling the analysis of very large datasets. Finally, it provides a user-friendly graphical interface that can be used effortlessly even by non-expert users.
ChIPseeqer is available as a set of standalone command line tools. For advanced users, command line tools provide great flexibility and automatability. For less advanced users, we have made these tools available via a graphical user interface (GUI), developed using the multi-platform QT framework . The bundle (i.e., command line tools and GUI) has been tested on Linux and Mac OS X. Detailed installation instructions and documentation for all tools included in the framework are also available online . Our implementation is available as free software, released under the GNU General Public License (GPL) v3 .
Many computations performed in ChIPseeqer involve assessing overlaps between hundreds or thousands of genomic regions (i.e., peaks, transcripts/genes, gene parts), and therefore, efficient algorithms are needed to quickly determine and characterize these overlaps. In ChIPseeqer, fast comparison on genomic intervals is performed using interval trees [39, 40], ordered tree structures that store and index intervals with fast querying and processing times, and ensure efficient searching of all indexed intervals that overlap with any given interval or point. An interval tree is an augmented binary search tree: each node contains an interval and also stores the maximum endpoint of the subtree rooted at the particular node. Apart from the insert and delete operations that characterize the binary search trees, interval trees also support a query operation that allows searching the tree for overlaps with a given interval. The first step of the algorithm sets the root of the tree as current node. The second step checks if the given interval overlaps with the current node; if not, it compares the low endpoint of the given interval with the maximum value stored at the left child of the current node. If the low endpoint of the interval is lower than the maximum value stored, then the current node is set to the left child; otherwise it is set to the right child. Then the algorithm goes back to the first step and repeats the same procedure for the new current node until it finds an overlap of the given interval with the interval stored in the current node, or until the whole tree is explored. Of note, we are using a modified implementation of the original algorithm , so as not to stop at the first overlapping interval but find all intervals that overlap with the given one. Moreover, we use a randomization procedure that takes into account the natural clumping of features  to assess the statistical significance of the observed number of overlapping peaks between two peak files. This procedure consists of generating many "random" lists of peaks maintaining peak sizes and number of peaks as well as the chromosomal and genomic distribution of the peaks in the first peak file. The latter means that each random list of peaks maintains the same fraction of peaks in promoter, exonic, intronic, downstream, and intergenic regions as the original peak file. Then, for each random peak list, the number of overlapping peaks with the second peak file (kept unchanged) is calculated. A p-value is determined by counting the number of times the random overlap is equal to or greater than the originally observed number of overlapping peaks. Additionally, the z-score is estimated, representing the distance (in number of standard deviations) betweem the observed number of overlapping peaks and the average number of overlapping peaks expected by chance.
One of the advantages of the framework is the support of different formats that are well-established in deep sequencing experiments, such as SAM, BAM, eland, extended eland, bed and export. ChIPseeqer also provides gene-based annotation from multiple sources and databases, such as RefSeq, Ensembl, UCSC Genes, and AceView. Finally, four species are currently supported, namely Homo sapiens, Mus musculus, Drosophila melanogaster, and Saccharomyces cerevisiae. Support for additional species can be added to the framework as described in Additional file 1 and in our online documentation.
The main tools of the ChIPseeqer framework.
Offers basic quality control tools.
Splits read files (e.g., bed, eland) into one read file per chromosome.
Peak detection algorithm.
Creates a promoters-based annotation of the detected peaks (i.e., gene name-description, peaks)
Finds the peaks distribution in the genome (e.g., exons/introns/intergenic) and creates lists of these peaks.
Creates a UCSC Genome Browser track for the detected peaks.
Creates a UCSC Genome Browser track for the reads density.
Finds the peaks that overlap with repeating elements, CpG islands and segmental duplicates.
Runs FIRE for the detected peaks, in order to perform an unsupervised motif discovery.
Runs MyScanACE for the detected peaks, in order to look for specific motifs (Jaspar, Bulyk PBM databases).
Runs PAGE for the genes associated with the detected peaks, in order to perform pathways analysis.
Looks for genes (and their corresponding peaks) that are associated to a specific pathway (e.g., apoptosis, GO:0060742).
Estimates the conservation scores for the detected peaks and for random intervals to allow comparison.
Creates a reads density matrix for a window around the TSS or the TES of the genes, or for any interval selected.
Estimates the avg/max reads number for every input peak, across multiple ChIP-seq datasets and creates a peak-based reads matrix.
Clusters a matrix (e.g., k-means, hierarchical, SOMs) and visualizes the clustering.
Compares two lists of peaks and finds the overlapping peaks and the peaks that are unique in each list.
Compares two lists of genes and finds the common genes and the genes that are unique in each list.
Estimates the Jaccard similarity coefficient for a set of peak files. The larger the coefficient, the more similarity you have between two peak files
Creates gene-based matrices (one for promoters, one for exons, etc) for many peak files. Summarizes the number of peaks that fall in specific gene parts, across many different peak files (TFs).
Finds peaks that are away from known genes.
Finds the closest gene(s) for each peak.
Estimates the avg/max reads number for every peak, for a ChIP-seq dataset and creates a peak-based read matrix.
Extracts the peaks that have a specific FIRE motif (can be applied after running FIRE).
Creates the input file for iPAGE from a list of genes.
(1) running the peak detection algorithm for a TF (e.g., ETS) ChIP-seq dataset,
(2) finding the peaks that have a specific motif (e.g., the ELK1 motif) using the ChIPseeqerMotifMatch module,
(3) identifying the peaks that bind at the promoters of known RefSeq genes using ChIPseeqerAnnotate, and
(4) performing pathways analysis on these genes with ChIPseeqeriPAGE, in order to find biological processes in which the given TF is likely involved. Another workflow (see Figure 1B) identifies putative enhancers based on TF and histone modifications ChIP-seq data. In that case, the intergenic peaks are first detected using ChIPseeqerAnnotate, and then the peaks that also overlap with enhancer marks  are reported using CompareIntervals. The corresponding subset of peaks represents putative enhancers; to discover informative regulatory elements within these peaks, unsupervised de novo motif analysis can be performed using ChIPseeqerFIRE. Finally, the ChIPseeqerCons module can be used to compare the conservation between putative enhancers and random genomic regions, in order to determine enhancers that are most likely to be functional.
To illustrate the power and flexibility of ChIPseeqer, we analyzed a published ETS1 ChIP-seq , performed in Jurkat T cells. ETS1 is an oncogene  and member of the ETS family of eukaryotic transcription factors. It is preferentially expressed at high levels in B and T cells, and plays a critical role in T cell activation . Recent studies based on chromatin immunoprecipitation have shown ETS1 binding events in both promoters and enhancers in Jurkat T cells [43, 46].
Pathways analysis between the ETS1 distal and promoter peaks.
T/B cell related
Leukocyte differentiation, GO:0002521
p < 0.001
p < 1
Lymphocyte activation, GO:0046649
p < 0.001
p < 1
p < 0.001
p < 1
Hemopoietic or lymphoid organ development, GO:0048534
p < 0.001
p < 1
Immune response, GO:0006955
p < 0.01
p < 1
Immune system development, GO:0002520
p < 0.01
p < 1
B cell proliferation, GO:0042100
p < 0.01
p < 1
B cell activation, GO:0042113
p < 0.001
p < 1
Biopolymer catabolic process, GO:0043285
p < 1
p < 1e-29
RNA splicing, GO:0008380
p < 1
p < 1e-50
DNA metabolic process, GO:0006259
p < 1
p < 1e-29
T/B cell related
p < 1e-05
p < 1
p < 0.001
p < 1
p < 0.01
p < 1
p < 0.01
p < 1
p < 1e-08
p < 1
p < 1e-04
p < 1
p < 1
p < 1e-06
p < 1
p < 1e-05
List of the 39 genes with both promoter and distal ETS1 peaks.
# Gene ID Gene Description
AKAP11 A-kinase anchor protein 11 2
AKR1A1 alcohol dehydrogenase 3
ATP5O ATP synthase subunit O, mitochondrial precursor 4
C1orf109 hypothetical protein LOC54955 5
C2orf29 hypothetical protein LOC55571 6
C9orf123 transmembrane protein C9orf123 7
CDK9 cell division protein kinase 9 8
CHSY1 chondroitin sulfate synthase 1 9
CKAP2L cytoskeleton-associated protein 2-like 10
CLINT1 clathrin interactor 1 11
DUSP2 dual specificity protein phosphatase 2 12
DUSP6 dual specificity protein phosphatase 6 isoform 13
HSPC157 hypothetical LOC29092 14
KIAA0427 CBP80/20-dependent translation initiation factor 15
LDHA L-lactate dehydrogenase A chain isoform 5 16
LOC100188949 hypothetical LOC100188949 17
LOC285456 hypothetical LOC285456 18
LSM14B protein LSM14 homolog B 19
MAX protein max isoform a 20
MRPS18A 28S ribosomal protein S18a, mitochondrial 21
MTF2 metal-response element-binding transcription 22
NAIF1 nuclear apoptosis-inducing factor 1 23
NDUFA10 NADH dehydrogenase [ubiquinone] 1 alpha 24
POMP proteasome maturation protein 25
PSMA6 proteasome subunit alpha type-6 26
RBM16 putative RNA-binding protein 16 27
RBM38 RNA-binding protein 38 isoform a 28
RPN1 dolichyl-diphosphooligosaccharide--protein 29
SEPHS2 selenide, water dikinase 2 30
SIRPG signal-regulatory protein gamma isoform 1 31
SPRED2 sprouty-related, EVH1 domain-containing protein 32
TFRC transferrin receptor protein 1 33
TMEM18 transmembrane protein 18 34
TRIP13 thyroid receptor-interacting protein 13 isoform 35
TXN2 thioredoxin, mitochondrial precursor 36
UBE2D2 ubiquitin-conjugating enzyme E2 D2 isoform 1 37
ZFAT zinc finger protein ZFAT isoform 1 38
ZNF212 zinc finger protein 212 39
ZNF683 zinc finger protein 683
There is a large occupancy of ETS1 peaks at the promoters but also at intergenic regions.
ETS1 regions are bound by multiple motifs, either from the ETS-domain or non-ETSrelated (e.g., HLF, RUNX1).
Specific pathways are preferentially related to genes with ETS1 binding in promoters or intergenic regions.
It is straightforward to characterize ETS1 binding to genomic regions with enhancers signatures (e.g., H3K4me1+/H3K4me3-).
It is possible to determine a list of genes that may be regulated by an enhancer-bound
TF, through a chromatin-looping event. Although these analyses could have been performed by combining several published tools or custom scripts, in ChIPseeqer this is fast (see Performance Evaluation section in Additional file 1) and straightforward and does not requiring any programming knowledge. Thus, by creating custom workflows that combine powerful computational programs, ChIPseeqer facilitates the comprehensive and in-depth analysis of ChIP-seq datasets.
This section demonstrates the diversity and versatility of the framework by presenting basic ChIPseeqer modules in further detail. A more exhaustive description of these tools is available online .
This module associates a set of ChIP-seq peaks, given as input, with the closest genes in the genome, and determines where each peak is located in these genes. In particular, ChIPseeqerAnnotate classifies input peaks in categories, (i.e., promoter, distal, intergenic, intronic, exonic and downstream). This module generates lists of peaks found in each of these classes of genomic regions. This analysis is controlled by default or user-defined parameters controlled by the user, such as the promoter window around the TSS and the annotation database. For example, promoters are defined by default as 4kb-long regions, around the TSS but not extended further than the downstream extremity of the genes. Moreover, ChIPseeqerAnnotate reports peaks overlapping with the first intron, since it has been reported that some first introns play a vital role in transcriptional control and splicing . The tool also determines peaks not overlapping with any gene part, but found to be at least at a user-defined distance away from known genes (default is set to 2 kb away). We call these peaks distal or intergenic, because they occur in intergenic regions (known to contain important regulatory elements such as enhancers and insulators). The lists of peaks generated by ChIPseeqerAnnotate can be directly used in other tools within the framework to perform subsequent analyses. This can be useful to focus further analyses on specific classes of peaks (e.g. promoter peaks or intergenic peaks); ChIPseeqerAnnotate lets users extract these subsets of peaks. ChIPseeqerAnnotate has other interesting features. It generates a gene-based matrix that summarizes the number of peaks found in each gene part (e.g., promoters, exons, introns, intergenic) and can help the user identify quickly the peak binding occurring in their gene of interest. We have also developed tools that merge and combine this matrix output, in order to extract, for example, the genes with both promoter and intergenic peaks. The expected fraction of peaks in different genomic categories (e.g., promoters, introns, exons, intergenic) is also provided, based on the fraction of the genome in each of these regions, so that it can be compared to the observed fraction of the input peaks in each category. Finally, several widely used gene annotations are supported, such as the RefSeq, Ensembl, UCSCGenes, and AceView (other databases can also be easily added). This feature enables comparing the peak-gene association results between different databases.
Pathways analysis can help elucidate important biological mechanisms associated with genome-wide binding and histone modification patterns. and can be performed after peaks have been associated with genes, as described in ChIPseeqerAnnotate. We have integrated two pathways analysis modes in the framework that involve: (1) looking for a given pathway of interest within the genes associated with input peaks and (2) looking for any pathways that are enriched in these genes. Pathways annotations are obtained from the Gene Ontology , KEGG database , Biocarta pathways , the SignatureDB online resource , and the Reactome pathways . Importantly, both modes generate lists of peaks associated with genes in the query pathway or in the enriched ones. These peaks can then be used as input to other tools in the framework.
In order to discover highly enriched pathways in the genes associated with input peaks, we use iPAGE , an information-theoretic pathway analysis framework. In iPAGE, sets of genes are used as input, and pathways that are enriched in each set are reported . ChIPseeqeriPAGE, is the module that integrates iPAGE within user-defined workflows. As with the previous module, users choose which input peaks to include in the analysis, based on their association with genes. The program then outputs the lists of peaks associated with specific enriched pathways.
Regulatory element analysis of ChIP-seq peaks can discover the DNA sequence motifs bound by the TF assayed by ChIP-seq, and/or to find sequences bound by its co-factors. We have integrated two regulatory element analysis modes in the framework: (1) analysis based on known motifs or user-defined motif patterns, and (2) de novo motif analysis. Both analyses require extracting DNA sequences under the peaks from the genome reference sequence. Efficient extraction is performed by pre-indexing genomes using the SAMTools C library .
Several software tools support searching for peaks that match a specific motif, but they often have limitations that restrict their usability. For example, in Cistrome  it is not straightforward to look for a specific motif, since available motifs are not shown to the user (only the available motif databases are). In HOMER , only motifs previously detected by the software are available for searching; the integration with public and popular sequence motif databases such as JASPAR that would enlarge the pool of available motifs is limited. ChIPseeqerMotifMatch seeks to overcome some of these limitations. To perform known motif analysis in ChIPseeqerMotifMatch, the user either selects the desired motif from a compiled dataset of ~250 TF binding sites (defined as position-specific weight matrices) from JASPAR  and UniPROBE  databases, or provides a consensus sequence in the form of regular expressions (e.g., TCCAAT, [AT]CG[CT]). In the former case, peak regions are scanned using the Berg and Von Hippel method  and a user-defined affinity threshold , and peaks containing one or more occurrence of the motif are given as output. Additional information such as motif positions within the peak and orientation are also reported. In the latter case, user-specific consensus sequences are used instead of weight matrices, and peak regions are scanned using regular expression matching algorithms from the pcre C library .
De novo motif analysis is performed using FIRE, an information-theoretic methodology for identification and characterization of regulatory elements . In order to search for any informative motifs that are highly enriched within the detected ChIP-seq peaks, background sequences are first created. These background sequences can be extracted either randomly across the entire genome (option "random"), or immediately adjacent to the peak regions (option "adjacent"). They can also be extracted so as to preserve the C+G and CpG content of the input sequences using two different options. The "CGI" option estimates the fraction of the original peaks that overlap with CpG islands, and then produces random background regions that maintain this fraction of CpG islands overlap. Alternatively, using the "1MM" option, the program calculates for each input peak sequence 1st order Markov frequencies and uses these frequencies to generate new random sequences. As shown in Additional File 1 Figure S10, both CGI and 1MM preserve C+G and CpG frequencies of the input peaks. The option that the users should use depends on the question they want to address (e.g., Are there are any DNA motifs enriched in ChIP-seq peaks compared to regions flanking these peaks?, Are there any DNA motifs enriched in ChIP-seq peaks compared to randomly selected genomic regions with similar lengths and nucleotide compositions?). After identification of the motifs that best explain the distinction between peak regions and background sequences, peak lists containing each motif are extracted and can be used as input to other tools in the framework, such as pathway analysis tools.
The ability to integrate the results of a ChIP-seq study with existing and publicly available ChIP-seq is an important. For example, this integration could suggest transcription factor, co-factors or histone modification that should be further explored because of their extensive overlap with a set of ChIP-seq peaks. In the ChIPseeqer framework, the ChIPseeqerNongenicAnnotate provides such capabilities; it can determine the subset of input peaks that overlap with peaks obtained from TF or histone modification ChIP-seq datasets of the ENCODE project [73, 74], as well as the statistical significance of this overlap (ENCODE datasets are subject to the ENCODE data usage policy available at http://genome.ucsc.edu/ENCODE/terms.html). ChIPseeqerNongenicAnnotate can perform additional integrative analyses. For example, extensive literature has shown that TF binding sites and specific histone modifications can be associated with repeated elements  and other nongenic elements, such as CpG islands . Filtering the peaks based on that type of features could reveal interesting groups of peaks that have the potential to alter and impact gene expression (e.g., possible promoters, retroelements that impact transcriptional networks  and more). Using the ChIPseeqerNongenicAnnotate module and track-based data from the UCSC Genome Browser, users can quickly and easily determine which of their input ChIP-seq peaks overlap with: (1) known repeated sequences (identified by RepeatMasker ), (2) CpG islands and (3) segmental duplications. While such comparisons can be performed via the UCSC Table Browser (i.e., use intersection between any two tracks), ChIPseeqerNongenicAnnotate facilitates these analyses and allows their integration with other analyses within the framework.
Cross-species conservation analysis is necessary in order to discover functional genomic elements (e.g., distal regulatory elements) and also to prioritize the most promising genomic elements for experimental validation. For these reasons, we have developed ChIPseeqerCons, a tool that estimates the conservation for a given set of peaks and outputs the peaks whose average conservation score is greater than a user-defined threshold (default is set to 0.5). The most useful aspect of this module, is estimating the conservation level of sequences adjacent to the input peaks, or of randomly selected sequences, thus allowing global assessment of peak conservation.
Another interesting feature of ChIPseeqerCons is producing conservation profiles for regions around the summit of the peaks (default is 2kb-long regions), and for random intervals: the average conservation score is estimated for every n-sized bins of the regions (default is n = 10 nucleotides). By plotting the resulting conservation profiles, we can easily compare the level of conservation between the input peaks and randomly generated genomic regions (see Figure 3). ChIPseeqerCons uses the phastCons  or phyloP  scores (freely available as tracks from the UCSC Genome Browser website), calculated from placental mammalian genomes or primates.
The analysis of read density profiles, when combined with clustering methods, can help identify groups of genes with similar binding profiles in their promoters, or groups of peaks that tend to have similar histone modification or TF binding patterns. In ChIPseeqer, we have developed tools that take as input a set of genomic regions and: (1) calculate the read density profile of the regions (split regions into bins and calculate the average read count within each bin), (2) count the maximum or average number of ChIPseq reads for each genomic region. These tools can also perform RPKM-style read count normalization  prior to read counting, in order to compare multiple experiments with different numbers of short reads.
ChIPseeqerDensityMatrix lets the users explore and analyse the average read density profiles, either for user-defined regions around the TSS or TES of the genes (default is set to 4 kb around the TSS), or around the summit of the given peaks (default is set to 2 kb window centred to the peak summit). For each region, bins of n nucleotides (default is 10 nucleotides) are created and the average number of reads for each bin is counted. For example, if 4 kb regions are extracted around the TSS of genes, the result of this module will be a matrix that contains for each promoter the average number of reads for 400 bins of 10 nucleotides each. Clustering the promoters of the resulting matrix is then performed based on their read density profiles, using either the built-in Self-Organizing Map algorithm , or by interfacing with Cluster 3.0 software [81, 82] (ChIPseeqerCluster). The results can be directly visualized using Postscript/PDF heatmaps produced using our built-in visualization tools or using TreeView  (included in the framework). Lists of genomic regions for each cluster can be exported and then used as input into other tools, in order to answer questions such as "Are there any regulatory elements associated with a given promoter binding pattern?" and "Which pathways discriminate between promoter binding patterns?".
ChIPseeqerReadCountMatrix estimates for each of the input peaks the maximum or average number of reads for multiple ChIP-seq datasets. The result of this module is a matrix that contains, for each of the input peaks, the maximum or average reads count for every ChIP-seq experiment. Similarly, the ChIPseeqerCluster module can be used to cluster the peaks of this matrix based on the reads number across multiple datasets, in order to reveal groups of peaks that share common binding in several TFs or histone modifications. The clusters of peaks can be extracted and used as input into other tools.
As more and more ChIP-seq datasets become publicly available, the need for data integration and comparison is becoming essential. Such integration can reveal how different TFs cooperate to regulate gene expression , as well as the interplay between TF binding and histone modifications [85, 86]. The integration between ChIP-seq datasets can be realized by determining the overlap between sets of peaks. In ChIPseeqer, we have addressed this need by implementing fast interval tree-based tools for comparing ChIP-seq experiments at the peak level (CompareIntervals). These tools can be used to compare sets of peaks, and quickly: (1) identify overlapping peaks, (2) merge sets of peaks, or (3) determine peaks in the first set that do not overlap with any peaks from the second set (i.e. find unique peaks). Moreover, as described in previous section, these tools can assess the significance of the overlap between two sets of peaks using randomization tests that take into account the genomic distribution of peaks. In addition to simply counting how many peaks overlap between two peak lists, we provide tools that can also quantify the extent to which two sets of peaks overlap by estimating the pairwise Jaccard similarity coefficient between pairs of ChIP-seq datasets (ChIPseeqerComputeJaccardIndex). The Jaccard index is estimated as the number of peaks that overlap between two peak files, divided by the union of the two files. The larger the coefficient, the more similar two datasets are in terms of overlapping peaks (see Figure 4D). Such comparisons can also be performed at the gene level; we have developed similar tools for gene-based comparisons (CompareGenes) that can be easily used on the genes-based output of ChIPseeqerAnnotate. Finally, the annotation of peaks against a collection of ENCODE ChIP-seq datasets (ChIPseeqerNongenicAnnotate), as well as the read density analysis across multiple datasets (ChIPseeqerReadCountMatrix), both described in previous paragraphs, were also developed in the context of integration and comparison of multiple ChIP-seq experiments.
Visualization is tightly integrated to all modules of the framework in order to facilitate ChIP-seq data exploration and summarize the results of each analysis. ChIPseeqer includes tools for creating UCSC Genome Browser tracks representing peak location and genome-wide read densities. It also includes tools for drawing pie charts summarizing the genomic distribution of the peaks, creating motif and pathway enrichment tables and conservation plots (see Figure 1, Figure 4C). The output of clustering read density profiles can be visualized either using heatmaps, or 2D Kohonen maps . Finally, we provide tools (ChIPseeqerPlotAverageReadDensityInGenes, ChIPseeqerPlotAveragePeaksNumberInGenes) for the visualization of reads density and peaks number in gene parts (e.g., promoters, exons, introns). The description of these tools and examples of the visualization they provide can be found at the ChIPseeqer tutorial .
The ChIPseeqer framework can considerably facilitate the bioinformatics analysis of ChIP-seq data by providing an integrated suite of computational tools that are fast, easy to use (no programming experience required), and can be combined with each other. The variety of tools and the flexibility offered by their parameters (Figure 4A) makes it possible to address most biological questions that are often raised when analyzing ChIPseq datasets. Notably, as demonstrated before, ChIPseeqer users can create personalized workflows in order to perform specific but sophisticated analysis, often requiring integration of multiple datasets (Figure 4D).
ChIPseeqer is a continuously developing project and we are actively working on implementing several additional components. For example, more species will be supported soon (e.g., C. elegans, zebrafish, chicken, rat), as well as more visualization components. Moreover, facilitating the integration of ChIP-seq data with gene expression data, obtained from microarray or RNA-sequencing (RNA-seq) experiments, also represents an important avenue for future improvement of the framework.
In order to fill the gap between the identification of ChIP-seq peaks and the biological interpretation of the data, we have developed ChIPseeqer, a comprehensive computational framework that can be adapted to a user's needs and to the hypotheses of a ChIP-seq study. We showed that using the ChIPseeqer framework we can perform sophisticated analyses of ChIP-seq datasets (e.g., compare and integrate peak/gene lists), explore the data from multiple perspectives (e.g., conservation, motifs occurrence, pathways enrichment), and address specific biological questions, such as "How do promoter peaks differ from distal peaks?", "Are there genes with both promoter and enhancer peaks?". We believe that this framework will be of great assistance to investigators who wish to perform high-level analysis of genome-wide ChIP-seq datasets, but do not yet possess advanced computer programming skills.
ChIPseeqer is freely available and can be downloaded at http://physiology.med.cornell.edu/faculty/elemento/lab/CS_files/ChIPseeqer-2.0.tar.gz The system requirements, instructions on how to install and run the software, and a detailed tutorial are also provided at http://physiology.med.cornell.edu/faculty/elemento/lab/chipseq.shtml The ETS1 and CBPdatasets used in this paper can be found at the GEO (GSE17954), and the H3K4me1 and H3K4me3 datasets are available at http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/hgtcell.aspx
Chromatin Immunoprecipitation followed by DNA microarray hybridization
Chromatin Immunoprecipitation followed by sequencing
Graphical user interface
Transcription end site
Transcription start site
The authors would like to thank all members of Melnick lab (Weill Cornell Medical College) for their suggestions and ideas during the development of the framework and the Elemento lab members (Institute of Computational Biomedicine, Weill Cornell Medical College) for fruitful discussions. In addition, they wish to acknowledge the three anonymous reviewers, as well as Vasileios P. Kemerlis, for their suggestions in the revision of this manuscript. This work was supported by the CAREER grant from National Science Foundation (DB1054964), as well as by startup funds from the Institute for Computational Biomedicine, Weill Cornell Medical College.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.