Expression data processing
Xcore takes promoter or gene expression counts matrix as input, the data is then filtered for lowly expressed features, normalized for the library size and transformed into counts per million (CPM) using edgeR [10]. Users need to designate the base-level samples by providing an experiment design matrix. These samples are used as a baseline expression when modeling changes in gene expression. xcore implements promoter- and gene-level analyses, using either promoter or gene expression data. In our experience we found promoter-level analysis to provide better results (Additional file 1: Fig. S1). Cap Analysis Gene Expression (CAGE) data is an input of choice for promoter level analysis. However, xcore can be used with other types of expression data such as microarray or RNA-seq data to perform gene-level analysis. Promoter-level analysis based on RNA-seq data is possible in principle but currently not implemented.
Molecular signatures
A second input consists of molecular signatures describing known transcription factors’ binding preferences within the promoter's vicinity. We provide sets of precomputed molecular signatures with xcoredata, the accompanying data package. The signatures were obtained by downloading all ChIP-seq data from ReMap2020 [6] and ChIP-Atlas [7] and intersecting it against ± 500 nt window of know promoter regions, defined based on FANTOM5’s hg38 annotation [12]. The signatures can also be easily constructed using xcore by providing predicted TFBS or custom ChIP-seq peaks (see xcore user guide). Detailed information on the molecular signatures construction can be found in Extended Materials and Methods (Additional file 3).
Expression modeling
In xcore we describe the relationship between the expression (Y) and molecular signatures (X) using linear model formulation:
$$Y = \mu + \beta_{0} + \beta_{1} X_{1} + \cdots + \beta_{p} X_{p}$$
where Y is a sample expression level, µ is the basal expression level, β0 is the intercept, βj is a j-th molecular signature activity and Xj is a j-th molecular signature.
Here, we are interested in finding the unknown molecular signatures’ activities (β) that describe the effect of molecular signature (X) on expression (Y). By including µ in the above equation we effectively model the change in expression between the basal expression level and the corresponding sample. Models are trained using penalized linear regression. In particular, we use ridge regression [13] as it allows us to take advantage of an existing significance testing methodology [14]. We observed ridge regression to work equally well to lasso and elastic net regression (Additional file 2: Fig. S2C). In practice, to fit our linear models we use the popular R package glmnet [15]. For each sample, that is for each time point and replicate, a separate model is trained using sample change in expression and molecular signatures shared at the experiment scale. In layman’s terms, for each sample, we are seeking to find a combination of ChIP-seq based signatures that best explains the observed changes in gene expression. For each model, the ridge regression λ tuning parameter is found separately using the cross-validation technique (CV). By default tenfold CV is used, and λ value giving the smallest mean squared error is selected.
Next, the estimated molecular signatures’ activities can be tested for significance. In short, using matrix formulation the ridge regression estimator is defined as
$$\hat{\beta }^{\lambda } = \left( {X^{\prime } X + \lambda I} \right)^{ - 1} X^{\prime } Y$$
where \(X\) is our molecular signatures matrix, \(\lambda\) is a ridge regression tuning parameter, and \(Y\) is a vector of our sample’s changes in expression. Then, the estimate of βλ standard error is calculated from the following:
$$Var\left( {\beta^{\lambda } } \right) = \hat{\sigma }^{2} (X^{\prime } X + \lambda I)^{ - 1} X^{\prime } X(X^{\prime } X + \lambda I)^{ - 1} ,$$
$$\hat{\sigma }^{2} = \frac{{\left( {Y - X\hat{\beta }^{\lambda } } \right)^{\prime } \left( {Y - X\hat{\beta }^{\lambda } } \right)}}{\nu }$$
where ν is the residual effective degrees of freedom. The significance of the individual molecular signatures' activities can be then tested using a test of significance for ridge regression coefficients. For further details, we refer interested readers to [14].
To summarize the results from individual replicates, following the procedure described in [16], the obtained estimates and their standard errors are pooled across the replicates by calculating their weighted mean with variance-defined weights and weighted mean error:
$$\overline{x} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \frac{{x_{i} }}{{\sigma_{i}^{2} }}}}{{\mathop \sum \nolimits_{i = 1}^{n} \frac{1}{{\sigma_{i}^{2} }}}},\quad \sigma_{{\overline{x}}} = \sqrt {\frac{1}{{\mathop \sum \nolimits_{i = 1}^{n} \frac{1}{{\sigma_{i}^{2} }}}}}$$
Using this result, we calculate a Z-score for each molecular signature and time-point.
Finally, molecular signatures are ranked based on their overall Z-score across the time-points calculated using Stouffer’s Z method [17].
Linear regression models comparison
To compare different models, coefficients of determination (R2) were calculated for models trained on individual replicates at selected time points using tenfold cross-validation and pooled across replicates. Additional information on this procedure is provided in Extended Materials and Methods (Additional file 3).