CAGEfightR: Cap Analysis of Gene Expression (CAGE) in R/Bioconductor

We developed the CAGEfightR R/Biconductor-package for analyzing CAGE data. CAGEfightR allows for fast and memory efficient identification of transcription start sites (TSSs) and predicted enhancers. Downstream analysis, including annotation, quantification, visualization and TSS shape statistics are implemented in easy-to-use functions. The package is freely available at https://bioconductor.org/packages/CAGEfightR


Introduction
Transcription Start Sites (TSSs) are central to gene regulation research. Cap Analysis of Gene Expression (CAGE) is a popular platform for genome-wide identification of TSSs, based on sequencing the first 20-30 bp of capped full-length RNAs, called CAGE tags 1 . Mapped to a reference genome, CAGE tags identify the location and measure the expression level of TSSs independent of reference transcript models 2 . They can also predict active enhancers based on bidirectional transcription initiation of enhancer RNAs (eRNAs) 3 .
Currently available tools for analyzing CAGE data (e.g. [4][5][6] were developed for older versions of the CAGE-protocol and/or are stand-alone tools. Here, we present the CAGEfightR R/Bioconductor package, which makes Bioconductor packages for analyzing RNA-Seq, ChIP-Seq and microarrays available for CAGE analysis. Compared to existing Bioconductor CAGE packages (e.g. CAGEr 7 , TSRchitect (http://doi.org/10.18129/B9.bioc.TSRchitect)), CAGEfightR uniquely offers enhancer prediction, novel methods for robust tag clustering, annotation and visualization, all implemented using standard Bioconductor classes.

Methods
CAGEfightR functionality is illustrated using CAGE data from mouse lung tissue 8 and HeLa cells 9 ; see Supplementary material.

Results
The input for a CAGEfightR analysis is BigWig-files counting the occurrence of 5' ends of CAGE tags at individual genomic bp, referred to as CAGE TSSs(CTSSs) 2 .
CAGEfightR uses sparse matrices to efficiently store and manipulate large CTSS datasets using little memory; CAGEfightR can analyze tens of samples on a normal laptop, and hundreds of samples on a typical server. CAGEfightR analyses CAGE data at three levels: CTSS-, cluster-and gene-level (Fig.1A), described below.
CAGEfightR provides efficient functions for importing multiple CTSS-files, quantifying and normalizing CTSSs counts to Tags-Per-Million(TPM), which can be summed across libraries to yield a pooled CTSS signal. Nearby CTSSs are commonly grouped into clusters for downstream analyses: CAGEfightR can find unidirectional tag clusters (TCs) for gene TSS identification and bidirectional clusters (BCs) for enhancer prediction.
TCs are found by searching for CTSSs with a signal above a chosen threshold, and then merging nearby CTSS on the same strand into clusters ( Fig.1A-B). The threshold can be tuned to avoid excessive merging of TSSs due to many singleton CTSSs and/or individual CTSSs can be discarded prior to clustering if they are not detected in a certain number of samples.
Unlike using an initial set of TCs as in 3 , CAGEfightR scans the entire genome for BCs directly: upstream and downstream pooled CTSSs are quantified for every genomic position. Next, the Bhattacharyya Coefficient 10 is used to quantify the departure of the observed CTSS signal from perfect bidirectionality (Fig.S1A). Sites with a balance score above a threshold are identified, and nearby sites are merged into BCs (Fig.1C), which can be used for enhancer prediction. CAGEfightR-predicted enhancers have similar enrichment for DNase hypersensitive sites and chromatin states in Hela cells as 3 (Fig.S1B).
A set of TCs and/or BCs (Fig.1A) can be quantified across all samples to yield a cluster-level expression matrix, which can be used as input to other Bioconductor expression analysis tools. It is often useful to annotate CAGE clusters in relation to known transcript models. Because of transcript complexity, clusters may simultaneously overlap annotated TSSs, exons and introns of different transcripts.
CAGEfightR uses a hierarchical approach where conflicting annotations are resolved hierarchically, i.e. overlap to annotated TSS are prioritized vs. 5' UTR regions, etc.

CAGEfightR includes functions for calculating statistics for identifying broad and
sharp TSSs distributions (the interquartile range and TC position entropy (Fig.S1D)), and a framework for implementing custom functions for user-supplied shape statistic.
In order to capitalize on existing data and tools contingent on gene-level expression, it is sometimes useful to measure gene expression using CAGE data. CAGEfightR includes functions for summarizing TC expression within genes to obtain a genelevel expression matrix (Fig.1A).
Another use of gene models is the study of alternative TSSs and differential TSS usage. CAGE can filter lowly transcribed TCs prior to analyses, by discarding TCs contributing with e.g. <10% of total gene expression.

Conclusion
CAGEfightR is an R/Bioconductor package for the downstream analysis of CAGE data. CAGEfightR is the first single framework that robustly detects, quantifies, annotates and visualizes TSSs and enhancers from CAGE data in a manner that is highly compatible with other Bioconductor packages. The memory efficient and scalable implementation allows CAGEfightR to be used on datasets ranging from small-scale experiments to consortia-level projects. While developed for CAGE data, CAGEfightR can analyze any similar type of sparse genomic data, e.g. PEAT, RAMPAGE, CapSeq, etc., and with some modification also tag-based transcription assays, e.g. PRO-Seq, GRO-Seq and NET-Seq.