An integrated tool for effective visualization of multiple set intersections
As visualization of sets and their intersections is becoming more and more challenging due to the increasing number of generated data sets, there is a strong need to have an integrated tool to compute and visualize intersections effectively. To address this challenge, we have developed Intervene, which is composed of three different modules, accessible through the subcommands venn, upset, and pairwise. Intervene accepts two types of input files: genomic regions in BED, GFF, or VCF format and gene/name lists in plain text format. A detailed sketch of Intervene’s command line interface and web application utility with types of inputs is provided in Fig. 1.
Intervene provides flexibility to the user to choose figure colors, label text, size, resolution, and type to make them publication-standard quality. To read the help about any module, the user can type intervene < subcommand > −-help on the command line. Furthermore, Intervene produces results as text files, which can be easily imported to the web application for interactive visualization and customization of plots (see “An interactive web application” section).
Venn diagrams module
Venn diagrams are the classical approach to show intersections between sets. There are several web-based applications and R packages available to visualize intersections of up-to six list sets in classical Venn, Euler, or Edward’s diagrams [11,12,13,14,15,16]. However, a very limited number of tools are available to visualize genomic region intersections using classical Venn diagrams [5, 6].
Intervene provides up-to six-way classical Venn diagrams for gene lists or genomic region sets. The associated web interface can also be used to compute the intersection of multiple gene sets, and visualize it using different flavors of weighted and unweighted Venn and Euler diagrams. These different types include: classical Venn diagrams (up-to five sets), Chow-Ruskey (up-to five sets), Edwards’ diagrams (up-to five sets), and Battle (up-to nine sets).
As an example, one might be interested to calculate the number of overlapping ChIP-seq (chromatin immunoprecipitation followed by sequencing) peaks between different types of histone modification marks (H3K27ac, H3K4me3, and H3K27me3) in human embryonic stem cells (hESC) [17] (Fig. 2a
, can be generated with the command intervene venn --test).
UpSet plots module
When the number of sets exceeds four, Venn diagrams become difficult to read and interpret. An alternative and more effective approach is to use UpSet plots to visualize the intersections. An R package with a ShinyApp (https://gehlenborglab.shinyapps.io/upsetr/) and an interactive web-based tool are available at http://vcg.github.io/upset to visualize multiple list sets. However, to our knowledge, there is no tool available to draw the UpSet plots for genomic region set intersections. Intervene’s upset subcommand can be used to visualize the intersection of multiple genomic region sets using UpSet plots.
As an example, we show the intersections of ChIP-seq peaks for histone modifications (H3K27ac, H3K4me3, H3K27me3, and H3K4me2) in hESC using an UpSet plot, where interactions were ranked by frequency (Fig. 2b, can be generated with the command intervene upset --test). This plot is easier to understand than the four-way Venn diagram (Additional file 1).
Pairwise intersection heat maps module
With an increasing number of data sets, visualizing all possible intersections becomes unfeasible by using Venn diagrams or UpSet plots. One possibility is to compute pairwise intersections and plot-associated metrics as a clustered heat map. Intervene’s pairwise module provides several metrics to assess intersections, including number of overlaps, fraction of overlap, Jaccard statistics, Fisher’s exact test, and distribution of relative distances. Moreover, the user can choose from different styles of heat maps and clustering approaches.
As an example, we obtained the genomic regions of super enhancers in 24 mouse cell type and tissues from dbSUPER [18] and computed the pairwise intersections in terms of Jaccard statistics (Fig. 2c). The triangular heat map shows the pairwise Jaccard index, which is between 0 and 1, where 0 means no overlap and 1 means full overlap. The bar plot shows the number of regions in each cell-type or tissue. This plot can be generated using the command intervene pairwise --test).
An interactive web application
Intervene comes with a web application companion to further explore and filter the results in an interactive way. Indeed, intersections between large data sets can be computed locally using Intervene’s command line interface, then the output files can be uploaded to the ShinyApp for further exploration and customization of the figures (Fig. 1
).
The ShinyApp web interface takes four types of inputs: (i) a text/csv file where each column represents a set, (ii) a binary representation of intersections, (iii) a pairwise matrix of intersections, and (iv) a matrix of overlap counts. The web application provides several easy and intuitive customization options for responsive adjustments of the figures (Figs. 1 and 3). Users can change colors, fonts and plot sizes, change labels, and select and deselect specific sets. These customized and publication-ready figures can be downloaded in PDF, SVG, TIFF, and PNG formats. The pairwise modules also provides three types of correlation coefficients and hierarchical clustering with eight clustering methods and four distance measurement methods. It further provides interactive features to explore data values; this is done by hovering the mouse cursor over each heat map cell, or by using a searchable and sortable data table. The data table can be downloaded as a CSV file and interactive heat maps can be downloaded as HTML. The Shiny-based web application is freely available at https://asntech.shinyapps.io/intervene.
Case study: highlighting co-binding factors in the MCF-7 cell line
Transcription factors (TFs) are key proteins regulating transcription through their cooperative binding to the DNA [19, 20]. To highlight Intervene’s capabilities, we used the command-line tool and its ShinyApp companion to predict and visualize cooperative interactions between TFs at cis-regulatory regions in the MCF-7 breast cancer cell line. Specifically, we considered (i) TF binding regions derived from uniformly processed TF ChIP-seq experiments compiled in the ReMap database [21] and (ii) promoter and enhancer regions predicted by chromHMM [22] from histone modifications and regulatory factors ChIP-seq [23]. The pairwise module of Intervene was used to compute the fraction of overlap between all pairs of ChIP-seq data sets and regulatory regions. The output matrix was provided to the ShinyApp to compute Spearman correlations of the computed values and to generate the corresponding clustering heat map (default parameters; Fig. 4). The largest cluster (green cluster) was composed of the three key cooperative TFs involved in oestrogen-positive breast cancers: ESR1, FOXA1, and GATA3. They were clustered with enhancer regions where they have been shown to interact [24]. The cluster highlights potential TF cooperators: ARNT, AHR, GREB1, and TLE3. Promoter regions were found in the second largest cluster (red cluster), along with CTCF, STAG1, and RAD21, which are known to orchestrate chromatin architecture in human cells [25]. The last cluster was principally composed by TFAP2C data sets. Taken together, Intervene visually highlighted the cooperation of different sets TFs at MCF-7 promoters and enhancers, in agreement with the literature.