Below, we overview the core functionality of CAGEfightR with examples using previously published 5′-end data. Names of the main CAGEfightR functions for each analysis are indicated in Fig. 1, 2, 3, 4, 5. Genome-browser figures (Figs. 1b-c, 2d, 3b-c, 4c, 5d) are otherwise unedited output from R/CAGEfightR.
Analysis of 5′-end tags
CAGEfightR can import CTSSs from BigWig files and quantify their expression levels across all samples. The CTSSs can then be normalized to Tags-Per-Million (TPM) and summed across samples to yield a global or pooled CTSS signal (Fig. 1a, top). In case of a large number and/or low quality samples, CAGEfightR offers various strategies for calculating more robust pooled CTSSs signals, chiefly by filtering CTSSs only observed in a single or few samples. The pooled CTSS signal can be visualized in genome-browser style along the genome (Fig. 1b-c, 2d, 3c, 4c, 5d).
Analysis of tag clusters
Pooled CTSSs can be used to identify clusters of closely spaced CTSSs on the same strand, referred to as unidirectional clusters, or conventionally Tag Clusters (TCs) in most CAGE papers. CAGEfightR identifies TCs using a slice-reduce approach: First, CTSSs with pooled CTSS signal below a chosen threshold are discarded (slice) and surviving CTSSs on the same strand are then merged into clusters (reduce) (Fig. 1a-b). CAGEfightR includes a host of functions for analyzing such TCs, including filtering on expression, hierarchical annotation of TCs with transcript models and analysis of TC shapes (see below).
TCs reflect the fact that when RNA-polymerase associates with the DNA, it rarely initiates from a single bp, but rather from an array of nearby bps. Such arrays are expected to produce nearly identical RNAs that are subject to the same regulatory cues, as they will share the same promoter sequence and genomic neighborhood. While genes are transcribed from multiple different CTSSs, these CTSSs are grouped in a single or multiple TCs corresponding to the major RNAs (or transcripts/isoforms in RNA-Seq terminology) produced from the gene. Because of this, many studies use a simplification in which such TCs are referred to as TSSs for genes, even though technically TCs group several nearby bp-accurate TSS / CTSSs. To avoid confusion on terminology, and remain consistent with previous CAGE literature, we will primarily use the term `TC` to describe unidirectional clusters.
As an applied example, we analyzed three HeLa CAGE libraries from Andersson et al [47]. Using CAGEfightR, we calculated the pooled CTSS signal across libraries, and from that defined 22,760 TCs with > 1 TPM in at least two samples. As these TCs are defined de novo, it is useful to see how they relate to known transcript models. We therefore annotated the TCs according to University of California Santa Cruz (UCSC) transcripts using CAGEfightR’s hierarchical scheme (Fig. 2a). The hierarchical scheme accounts for the existence of multiple transcripts or isoform of genes, i.e. an annotated promoter of one transcript can be in the 5′-UTR of another transcript from the same gene. CAGEfightR can assign 9 different annotation categories (custom categories can be supplied by the user), based on the most likely association with known transcripts, e.g. a TC is more likely to correspond to an annotated promoter rather than a novel intragenic promoter.
Using this method, we plotted the number of TCs falling into the different annotation categories and their expression distributions (Fig. 2b). Most TCs candidates were found at annotated promoters and were generally highly expressed. However, a substantial number of novel (not overlapping annotated promoters) TCs were identified, in particular in the promoter-proximal region and 5′-UTR, which on average had lower expression than those overlapping annotated promoters.
In vertebrates, the distribution of CTSSs within TCs is related to cell specificity and DNA sequence properties: sharp CTSS distributions have an overrepresentation of TATA-boxes and are more cell- or tissue-specific, while broad CTSS distributions are GC-rich and more ubiquitously expressed [36]. Classification is often based on the width of the CTSS distribution, expressed as the interquartile/interquantile range (IQR), as this measures the width of the bulk of CTSSs within a TC without being affected by a few straggler CTSSs potentially greatly extending the width of the TC. We used CAGEfightR to calculate the genomic region covering the 5–95% IQR of total CTSS expression for each TC in the HeLa set (Similar results were obtained with tighter IQR intervals). This showed a clear bimodal distribution corresponding to sharp and broad CTSS distributions (Fig. 2c-d). Investigation of promoter regions using sequence logos also confirmed that sharp, but not broad distributions had a stronger TATA box (Fig. 2e).
Analysis of enhancer candidates
Active enhancers are characterized by bidirectional transcription initiation of eRNAs [48]. In 5′-end data, this manifests as bidirectional clusters (BCs) of CTSSs, which can be used to systematically identify enhancer candidates [21]. Similarly to above, CAGEfightR uses a slice-reduce approach to identify bidirectional clusters (BCs, as opposed to the previously discussed unidirectional TCs) to predict enhancers (Fig. 1a, c). First, the upstream and downstream pooled CTSSs are quantified for every genomic position. Second, the Bhattacharyya coefficient [49] is used to quantify the departure of the observed pooled CTSS signal from perfect bidirectionality, producing a bidirectionality or balance score for each bp (Additional file 1: Figure S1A). Third, locations with a balance score above a given threshold are identified, and nearby sites are merged into discrete BCs. This slice-reduce approach is conceptually similar to the original enhancer prediction method by Andersson et al. [21], but does not need an input seed of TCs used to find bidirectional pairs, and gives similar results while being more scalable (Additional file 1: Figure S1B-C). As BCs can be found at other genomic regions than active enhancers (e.g. bidirectional gene promoters), BCs in or near known exons can be filtered away to obtain a final set of enhancer candidates [21].
As an applied example, we used the same CAGE HeLa data set as above and identified BCs outside of exonic regions and more than 1 kbp upstream of annotated promoters (based on UCSC transcript models) as enhancer candidates. This resulted in a total of 6384 enhancer candidates (3780 intronic and 2604 intergenic).
As an initial validation step, we investigated whether enhancer candidates had the expected chromatin patterns compared to TCs, by overlapping with DNase I hypersensitive sites sequencing (DNase-Seq), H3K27ac, H3K4me3 and H3K4me1 ChIP-Seq signals from the same cell type. As expected, we observed high DNase sensitivity at enhancer midpoints and TC peaks, and higher levels of H3K27ac at TCs, compared to enhancer candidates. The ratio of H3K4me3 to H3K4me1 is often used to predict enhancers from ChIP-Seq signals, and consistent with this we observed low average H3K4me3 and high H3K4me1 signals around predicted enhancer candidates, and the opposite patterns around TCs (Fig. 3a) [50, 51].
Spatial prediction of enhancer-TSS links and enhancer clusters
An outstanding challenge in enhancer analysis is to link enhancers with target gene(s). Chromatin conformation capture data [52] is only available for a small set of cells, motivating computational prediction methods. A simple but popular linkage method is based on co-expression (correlation of expression) between enhancer candidates and genes across samples, assuming that true enhancer-gene pairs are co-expressed. This has been used for e.g. DNase and CAGE data, and serves as a reasonable hypothesis generator for physical interactions when the distance between enhancer and promoter is limited [21, 25, 53, 54]. CAGEfightR implements this approach by calculating the correlation of expression between enhancer candidates and TCs, with the option of supplying custom functions for calculating correlations in addition to the ones included in base R (Pearson, Spearman, Kendall).
As an example, we applied CAGEfightR to 50 CAGE samples obtained from colonic biopsies from ulcerative colitis and control subjects from Boyd et al. [25]. The reason for not using the HeLa set above was that correlations are more reliable if calculated across many samples. Specifically, we calculated a robust pooled CTSS signal (discarded CTSSs observed only in a single library), then selected TCs having > 1 TPM (and > 10% of total gene expression if they were assigned to genes, see below) in > 6 samples, and predicted enhancer candidates > = 1 CAGE tag in > 6 samples. This resulted in 31,480 TCs and 10,000 enhancer candidates. Using CAGEfightR, we computed the correlation (Kendall’s tau) between pairs of TCs and enhancer candidates within 10 kbp of each other, resulting in 19,271 positively correlated links between TCs and enhancers, where 978 were significant at FDR < 0.05. CAGEfightR supports the visualization of multiple enhancer candidate-TC links across a genomic region. As an example, Fig. 3b shows predicted enhancer-TC links around the TNFRSF1A gene, whose most dominant TC is most highly correlated with an intronic enhancer candidate in the neighboring gene.
Previously, large regions having enhancer-associated chromatin features were identified as drivers of central biological processes. Such regions are often referred to as “super”—or “stretch” enhancers [55]. Using CAGE data to predict enhancers, we have shown that many such chromatin-defined regions can be characterized as a group of bidirectionally transcribed loci, or a cluster of enhancer candidates [21, 25]. It follows that such larger regions can be predicted based on CAGE data, as genomic stretches with many enhancer candidates, and CAGEfightR implements methods for doing this. As an applied example, we used the enhancer candidates predicted based on the ulcerative colitis set above to identify enhancer clusters, defined as enhancers situated less than 12.5 kbp [25] from each other, resulting in 624 stretches with 4–24 enhancers per stretch. CAGEfightR can additionally calculate the average pairwise correlation between enhancer candidates in the stretch to reveal if they show concordant direction of change, indicative of joint activity, and provides methods to visualize these correlations (Fig. 3c).
Analysis in terms of known gene models
Although TCs can be identified de novo, it is useful to be able to analyze their expression across known gene models. Examples include the ability to compare 5′-end expression with RNA-Seq expression on gene level [56], or to link 5′-end expression estimates with gene-centric databases, such as Gene Ontology (GO) terms [57] or pathway/interaction annotation (Kyoto Encyclopedia of Genes and Genomes (KEGG) [58], STRING [51], etc.). CAGEfightR includes functions for annotating TCs to known genes and summarizing their expression within genes to obtain a gene-level expression matrix (Fig. 1a, bottom). This gene-level expression matrix can readily be used with other Bioconductor packages for gene-level analysis (e.g. limma [44], edgeR [33], DESeq2 [35]).
Another key use of gene models in relation to 5′-end methods is the analysis of alternative TSS or alternative promoter usage, which is a key contributor in generating transcript diversity (multiple different transcripts/isoforms from genes). This can be done by identifying genes harbouring several TCs on the same strand, with each TC giving rise to distinct RNAs. In this way, TCs can be seen as TSS candidates for the different transcripts/isoforms produced by a gene, phrasing the analysis in a similar way to alternative splicing or transcript usage for RNA-Seq. To be clear, this is different from analyzing changes in the distribution of CTSSs within a TC (see above), as different TCs in a gene will be widely spaced, have different promoter sequences and genomic neighbourhoods and produce different truncations of RNA, with potentially different regulation and function.
In addition to identifying such alternative TCs within gene models, CAGEfightR offers the option of filtering TCs within genes based on their contribution to overall gene expression: As 5′-end methods can detect very lowly expressed TCs, CAGEfightR can remove alternative TCs making up less than e.g. 10% of total gene expression in a given number of samples to focus only on the most highly expressed RNAs from a gene. This filtering approach is also useful when combining CAGEfightR with popular tools for differential transcript usage such as limma [44], edgeR [33], DEXSeq [45] and DRIMSeq [46], to investigate whether a given TC within a gene is preferentially used under certain conditions.
To illustrate these features in CAGEfightR, we used the ulcerative colitis set, and assigned TCs to genes using UCSC gene models and determined how much each such TCs contributed to overall gene expression. Without any composition filtering, 40% of all genes used more than one TC, falling to 23% when only considering TCs contributing more than 10% of total gene expression in more than 6 samples (Fig. 4a). The majority of discarded TCs were found in protein-coding, intronic and 3′-UTR regions (Fig. 4b), but several interesting cases remained, for example in the SLC16A5 gene, where we identified a highly expressed novel intronic TC (Fig. 4c).
Example of PRO-cap data analysis using CAGEfightR
While conceived as a tool for analyzing CAGE data, CAGEfightR can analyze any 5′-end data similar to CAGE, including nascent RNA 5′-end methods (Table 1). This is highly relevant since nascent 5′-end methods may be more sensitive in terms of enhancer detection, and hence are often used specifically for this purpose. To illustrate the usefulness of CAGEfightR for analyzing such data, we applied CAGEfightR to 59 lymphoblastoid cell line Precision Nuclear Run-on Sequencing for RNA Polymerase II Start Sites (PRO-Cap) libraries from Katla et al [59]. Similarly to the analysis of ulcerative colitis CAGE data above, robust pooled CTSSs (CTSSs observed in > 2 samples) were used to identify TCs (> 1 TPM in > 5 samples) and enhancer candidates (> 0 tags in > 5 samples), and annotated these using UCSC transcript models. Compared to the CAGE libraries above, a larger number of antisense TCs, intergenic TCs and enhancer candidates were detected (Fig. 5a). This is expected as these RNAs are subject to nuclear degradation and thus are more difficult to detect with steady-state RNA methods. PRO-Cap TCs showed the expected pyrimidine-purine di-nucleotide at positions − 1 + 1 (Fig. 5b), and enhancer candidates exhibited the characteristic H3K4me3 to H3K4me1 ratio (Fig. 5c). Figure 5d shows an example of an enhancer candidate detected by PRO-Cap.