xcore: an R package for inference of gene expression regulators

Background Elucidating the Transcription Factors (TFs) that drive the gene expression changes in a given experiment is a common question asked by researchers. The existing methods rely on the predicted Transcription Factor Binding Site (TFBS) to model the changes in the motif activity. Such methods only work for TFs that have a motif and assume the TF binding profile is the same in all cell types. Results Given the wealth of the ChIP-seq data available for a wide range of the TFs in various cell types, we propose that gene expression modeling can be done using ChIP-seq “signatures” directly, effectively skipping the motif finding and TFBS prediction steps. We present xcore, an R package that allows TF activity modeling based on ChIP-seq signatures and the user's gene expression data. We also provide xcoredata a companion data package that provides a collection of preprocessed ChIP-seq signatures. We demonstrate that xcore leads to biologically relevant predictions using transforming growth factor beta induced epithelial-mesenchymal transition time-courses, rinderpest infection time-courses, and embryonic stem cells differentiated to cardiomyocytes time-course profiled with Cap Analysis Gene Expression. Conclusions xcore provides a simple analytical framework for gene expression modeling using linear models that can be easily incorporated into differential expression analysis pipelines. Taking advantage of public ChIP-seq databases, xcore can identify meaningful molecular signatures and relevant ChIP-seq experiments. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-05084-0.

ReMap [6] and ChIP-Atlas [7] provide a wealth of uniformly processed ChIP-seq data (genome-wide peaks) for TFs but also other transcriptional regulators including transcriptional coactivators and chromatin-remodeling factors. Currently, only a limited number of tools exist that tap into these databases. Two examples are Lisa [8] identifying the most likely transcriptional regulators in an experiment based on usersupplied gene expression information, and Virtual ChIP-seq program [9] that can predict the binding of individual TF in a cell type of interest based on gene expression information. However, to our knowledge, there are no published methods that take advantage of this data to directly model the activity of transcriptional regulators.
Here, we propose to use the publicly available ChIP-seq data to directly represent the genome-wide occupancy of regulators. We intersected the peaks with promoter regions and used linear ridge regression to infer the regulators associated with observed gene expression changes (Fig. 1A). The advantage of this approach is the direct integration of gene expression profiles with experimental TF binding data. We provide (a) processed and pre-computed, ChIP-seq based molecular signatures (xcoredata), and (b) methodology for activity modeling (xcore). The framework is implemented as an R package (available in Bioconductor) and integrates smoothly with commonly used differential expression workflows like edgeR [10] or DESeq2 [11].

Expression data processing
Xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million (CPM) using edgeR [10]. Users need to designate the base-level samples by providing an experiment design matrix. These samples are used as a baseline expression when modeling changes in gene expression. xcore implements promoter-and gene-level analyses, using either promoter or gene expression data. In our experience we found promoter-level analysis to provide better results (Additional file 1: Fig. S1). Cap Analysis Gene Expression (CAGE) data is an input of choice for promoter level analysis. However, xcore can be used with other types of expression data such as microarray or RNAseq data to perform gene-level analysis. Promoter-level analysis based on RNA-seq data is possible in principle but currently not implemented.

Molecular signatures
A second input consists of molecular signatures describing known transcription factors' binding preferences within the promoter's vicinity. We provide sets of precomputed molecular signatures with xcoredata, the accompanying data package. The signatures were obtained by downloading all ChIP-seq data from ReMap2020 [6] and ChIP-Atlas [7] and intersecting it against ± 500 nt window of know promoter regions, defined based on FANTOM5's hg38 annotation [12]. The signatures can also be easily constructed using xcore by providing predicted TFBS or custom ChIP-seq peaks (see

Expression modeling
In xcore we describe the relationship between the expression (Y) and molecular signatures (X) using linear model formulation: where Y is a sample expression level, µ is the basal expression level, β 0 is the intercept, β j is a j-th molecular signature activity and X j is a j-th molecular signature.
Here, we are interested in finding the unknown molecular signatures' activities (β) that describe the effect of molecular signature (X) on expression (Y). By including µ in the above equation we effectively model the change in expression between the basal expression level and the corresponding sample. Models are trained using penalized linear regression. In particular, we use ridge regression [13] as it allows us to take advantage of an existing significance testing methodology [14]. We observed ridge regression to work equally well to lasso and elastic net regression (Additional file 2: Fig. S2C). In practice, to fit our linear models we use the popular R package glmnet [15]. For each sample, that is for each time point and replicate, a separate model is trained using sample change in expression and molecular signatures shared at the experiment scale. In layman's terms, for each sample, we are seeking to find a combination of ChIP-seq based signatures that best explains the observed changes in gene expression. For each model, the ridge regression λ tuning parameter is found separately using the cross-validation technique (CV). By default tenfold CV is used, and λ value giving the smallest mean squared error is selected.
Next, the estimated molecular signatures' activities can be tested for significance. In short, using matrix formulation the ridge regression estimator is defined as where X is our molecular signatures matrix, is a ridge regression tuning parameter, and Y is a vector of our sample's changes in expression. Then, the estimate of β λ standard error is calculated from the following: where ν is the residual effective degrees of freedom. The significance of the individual molecular signatures' activities can be then tested using a test of significance for ridge regression coefficients. For further details, we refer interested readers to [14].
To summarize the results from individual replicates, following the procedure described in [16], the obtained estimates and their standard errors are pooled across the replicates by calculating their weighted mean with variance-defined weights and weighted mean error: Using this result, we calculate a Z-score for each molecular signature and time-point. Finally, molecular signatures are ranked based on their overall Z-score across the timepoints calculated using Stouffer's Z method [17].

Linear regression models comparison
To compare different models, coefficients of determination (R 2 ) were calculated for models trained on individual replicates at selected time points using tenfold cross-validation and pooled across replicates. Additional information on this procedure is provided in Extended Materials and Methods (Additional file 3).

Results
We used xcore to perform gene expression modeling analysis in the context of three CAGE datasets: (a) newly generated transforming growth factor beta (TGFβ) induced epithelial-mesenchymal transition (EMT) experiment performed in A-549 and MDA-231-D cell lines, (b) previously published FANTOM5's rinderpest infection time-course dataset performed in 293SLAM and COBL-a cell lines using native and recombinant rinderpest virus lacking accessory V and C proteins [12], (c) previously published FAN-TOM5's Human H3 embryonic stem cells differentiated to cardiomyocytes time-course dataset [12] and a microarray dataset: previously published TGFβ induced EMT in A-549 cell line (GSE17708) [18]. Detailed information on the procedures used to process the raw CAGE data can be found in Extended Materials and Methods (Additional file 3).

ChIP-seq molecular signatures provides better model performance
We compared the models built using ChIP-seq signatures (ReMap2020 and ChIP-Atlas) vs motif-based signatures (Jaspar and SwissRegulon). The models based on ChIP-seq signatures showed on average higher R 2 values, which reflects the proportion of variance explained by the model and overall "goodness of fit". In particular, modeling expression between 0 and 24 h after TGFβ treatment in our novel MDA-231-D dataset yielded an average R 2 of 0.179 for ChIP-seq signatures and 0.077 for motif signatures. For comparison the randomized version of ReMap2020 molecular signature yielded R 2 close to 0 (Fig. 1B, Additional file 2: Fig. S2B).

Comparison with the state-of-the-art tools
We compared our results with state-of-the-art motif-based gene expression prediction framework ISMARA [1] and Lisa program which predicts the most likely transcriptional regulators from gene expression data based on ChIP-seq and chromatin accessibility data available in Cistrome Data Browser [25]. While ISMARA is conceptually similar and was inspirational to xcore, Lisa takes a different approach. Using a user supplied list of differentially expressed genes, Lisa first selects a subset of relevant experiments describing chromatin state (H3K27ac ChIP-seq or DNase-seq) using lasso regression. Next it identifies the most relevant TF using in-silico deletion technique [8]. To compare with our results, we used both tools on our novel TGFβ induced EMT, rinderpest infection and embryonic stem cells differentiated to cardiomyocytes datasets. We have run ISMARA in RNA-seq mode with a genome version hg38 and no miRNA using raw FASTQ files for our novel TGFβ induced EMT dataset and BAM files available in FANTOM5 study [12] mapped against genome version hg38 for the other datasets. To use Lisa we performed differential expression analysis using edgeR [10] between the most extreme time points in our time-course datasets. Then lists of 100 most significant up-(logFC > 0) and 100 most significant down-regulated (logFC < 0) genes were submitted to Lisa. Next, we compared the results from all tools with a list of related transcriptional regulators. We constructed lists of related transcriptional regulators for each dataset using Gene Ontology term regulation of epithelial to mesenchymal transition (GO:0010717), KEGG pathway Measles (map05162) and Gene Ontology term heart development (GO:0007507) by including only regulators available in the references of all tools. The number of EMT related transcriptional regulators recovered among the topscoring signatures was higher for xcore and Lisa than ISMARA (Table 1). In case of rinderpest infection ( Table 2) Lisa recovered the highest number of related TF in 293SLAM cell line. In the case of COBL-a and COBL-a rinderpest(-C) analyzes xcore found one more TF than ISMARA and Lisa. Finally, for embryonic stem cells differentiated to Table 1 Recovering epithelial to mesenchymal transition transcriptional regulators     (Table 3) Lisa was able to find the highest number of related TF, while xcore and ISMARA found the same number of related TF.

Conclusions
Xcore provides a flexible framework for integrative analysis of gene expression and publicly available TF binding data to unravel putative transcriptional regulators and their activities. Our analyses showed superior results when using ChIP-seq based signatures as compared to motifs-based ones. We attribute this difference to the presence of biotype-specific binding information which might be lost in motifs that describe more general transcription factor binding preferences. Despite high numbers of ChIP-seq signatures and redundancy, our machine learning framework is able to select biologically relevant signatures. In our comparison with motif-based ISMARA and ChIP-seq based Lisa, xcore performed competitively with those tools. Especially, both xcore and Lisa worked exceptionally well at recovering EMT-related transcriptional regulators. However, a comprehensive comparison of xcore with other tools would require further benchmarking efforts. Such efforts are currently hindered by the lack of standard benchmarking datasets for transcriptional regulators' inference problems. In conclusion, xcore is useful for generating testable hypotheses about the data and provides a novel way to connect gene expression data with relevant ChIP-seq experiments.