Zipper plot: visualizing transcriptional activity of genomic regions

Background Reconstructing transcript models from RNA-sequencing (RNA-seq) data and establishing these as independent transcriptional units can be a challenging task. Current state-of-the-art tools for long non-coding RNA (lncRNA) annotation are mainly based on evolutionary constraints, which may result in false negatives due to the overall limited conservation of lncRNAs. Results To tackle this problem we have developed the Zipper plot, a novel visualization and analysis method that enables users to simultaneously interrogate thousands of human putative transcription start sites (TSSs) in relation to various features that are indicative for transcriptional activity. These include publicly available CAGE-sequencing, ChIP-sequencing and DNase-sequencing datasets. Our method only requires three tab-separated fields (chromosome, genomic coordinate of the TSS and strand) as input and generates a report that includes a detailed summary table, a Zipper plot and several statistics derived from this plot. Conclusion Using the Zipper plot, we found evidence of transcription for a set of well-characterized lncRNAs and observed that fewer mono-exonic lncRNAs have CAGE peaks overlapping with their TSSs compared to multi-exonic lncRNAs. Using publicly available RNA-seq data, we found more than one hundred cases where junction reads connected protein-coding gene exons with a downstream mono-exonic lncRNA, revealing the need for a careful evaluation of lncRNA 5′-boundaries. Our method is implemented using the statistical programming language R and is freely available as a webtool. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1651-7) contains supplementary material, which is available to authorized users.

package retrieves the "start" and "end" genomic coordinates of the closest ChIPseq/DNase-seq/CAGE-seq peak as depicted in this figure, with the "start" always being the part of the peak closest to the TSS: "A" represents a case where the closest peak overlaps with the TSS whereas "B" and "C" are cases where it does not. In "B", the closest peak is the one on the left hand side, since a distance of 1kb is smaller than 5kb. In "C", the closest peak is the one on the right hand side, since a distance of 2kb is smaller than 4kb.
Finally, peaks are ranked based on the distance from the TSS to the "start" of the closest peak and a Zipper plot is generated as described in the main manuscript.
Rationale for calculating both positive and negative distances between closest peaks and TSSs We have investigated this hypothesis as an argument to support the need of calculating AUZleft (negative distances between TSSs and closest peak) and AUZright (positive distances between TSSs and closest peak).
We used all lncRNAs (mono and multi-exonic) from Lncipedia 3.1 and retrieved 11,989 lncRNAs (unique TSSs) with a CAGE peak overlapping with the TSS. Next, we retrieved the closest H3K4me3 and DNase peaks (narrowPeak; across all sample types) for those and generated the following Zipper plots: From these plots is clear that the DNase marks (left) were mostly found a few nucleotides upstream the TSS (=negative distances in the plot; OX axis) whereas the H3K4me3 Finally, we plotted the distribution of distances between these two marks and found that for 9,177 lncRNAs (76.54%) the H3K4me3 mark was found downstream the DNase mark (=distances greater than 0): This result is a useful example where differences between positive and negative distances are clear. Importantly, the width of the grid for each Zipper plot is determined by the (closest) peak furthest away from the TSS among all retrieved ones. Looking at the image below, AUZ CASE_1 seems bigger than AUZ CASE_2 while it should be smaller (marks in CASE_1 are closer to TSS than marks in CASE_2): This issue is due to the difference in grid areas: to compare two AUZ values directly, both need to come from a grid with the same width. This can be achieved by re-scaling the AUZ from the case (or cases, if we are comparing more than 2 Zipper plots) with the Small AUZ values, areas in the plot and how AUZ_window is calculated (Fig 3D) As explained in the previous section, the width of the grid for each Zipper plot is determined by the (closest) peak furthest away from the TSS among all retrieved ones.
Regarding the Zipper plot for the 21,000 mono-exonic lncRNAs, we found that there are few cases where the closest CAGE-seq peak is several Mb away (x-axis) from the TSS.
If we also take into account that we are studying thousands of lncRNAs simultaneously (y-axis), this results in a grid area of the order of 10 9 . Since 76.24% of the mono-exonic lncRNAs have a CAGE peak within 50kb from the TSS (very small value compared to 1Mb), this resulted in very small AUZ_global values.
The "actual areas in the plot" correspond to the AUZ_window values. By default, the Zipper plot is visualized in a +/-5kb window. In the case of the plot for the 21,000 monoexonic lncRNAs, the method virtually sets to 5kb (or other value if the user changes the default window size) all those distances that are located more than 5kb away from the TSS. Therefore, the y-axis extends down to 21,000.
AUZ_window values correspond to the AUZ value that users can "visually" infer from the plot visualized in the pre-defined (5kb) window.