Skip to main content

TIANA: transcription factors cooperativity inference analysis with neural attention

Abstract

Background

Growing evidence suggests that distal regulatory elements are essential for cellular function and states. The sequences within these distal elements, especially motifs for transcription factor binding, provide critical information about the underlying regulatory programs. However, cooperativities between transcription factors that recognize these motifs are nonlinear and multiplexed, rendering traditional modeling methods insufficient to capture the underlying mechanisms. Recent development of attention mechanism, which exhibit superior performance in capturing dependencies across input sequences, makes them well-suited to uncover and decipher intricate dependencies between regulatory elements.

Result

We present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), a deep learning framework that focuses on interpretability. In this study, we demonstrated that TIANA could discover biologically relevant insights into co-occurring pairs of transcription factor motifs. Compared with existing tools, TIANA showed superior interpretability and robust performance in identifying putative transcription factor cooperativities from co-occurring motifs.

Conclusion

Our results suggest that TIANA can be an effective tool to decipher transcription factor cooperativities from distal sequence data. TIANA can be accessed through: https://github.com/rzzli/TIANA.

Peer Review reports

Background

In multicellular organisms, identical DNA is present in almost all cells, but diverse gene expression profiles lead to the development of various cell types and functions. Cis-regulatory elements (CREs) such as enhancers play a pivotal role in driving cell type-specific gene expression by orchestrating a complex interplay of multiple transcription factors (TFs) [1]. Enhancers serve essential functions in modulating gene expression profiles to suit specific cell types and states. These enhancer sequences harbor binding motifs for various TFs, and when the correct combinations of these factors converge and bind to the sequences, they initiate a series of events that result in the activation of enhancers and subsequently target genes [2, 3]. Combinatorial TF binding constitutes an intricate regulatory mechanism for enhancer selection and activation, ensuring that only the appropriate genes are activated within specific cell types and states [2, 4,5,6]. Consequently, only specific sets of enhancers containing binding motifs for the TFs needed for specific cellular states are activated. Notably, combinatorial cooperativities between TFs during enhancer selection and activation are highly dynamic, often contingent upon various environmental signals that alter TF activity and expression.

These cooperativities can result from different modes of DNA binding. For example, some TF pairs, such as GATA1 and TAL1, form complexes that that interact in a cooperative manner with composite recognition motifs exhibiting strict spacing relationships [7]. More often, combinations of TFs that exhibit interdependent binding relationships are observed at regulatory elements at which their corresponding DNA recognition motifs exhibit relaxed spacing relationships, such as macrophage lineage determining factors PU.1 and C/EBP family members [4, 7]. Most enhancer and promoter elements typically exhibit displacement of nucleosomes over a range of ~ 150 to 300 bp, suggesting that these regions can potentially be occupied by multiple factors relying on fixed and relaxed combinations of motifs. The combinatorial nature of enhancer regulation, involving the convergence of multiple TFs, underscores a unique opportunity to investigate regulatory mechanisms via enhancer sequence analysis.

Over the past decade, many state-of-the-art tools, such as Homer and the MEME suite, have been highly successful in identifying overrepresented TF motifs within regulatory sequences [4, 8]. While these tools provide critical insights into TF motif occurrence in regulatory sequences, they focus on individual motifs and are limited to inferring TF cooperation from CREs.

Recent advancements in deep learning, particularly convolutional neural networks and natural language processing (NLP) techniques, have shown promise in unraveling previously undiscovered regulatory mechanisms. Among them, multi-head attention (MHA) enables the model to simultaneously attend to different segments of the input sequence using multiple attention heads. This facilitates the learning of both short and long-range dependencies by allowing individual heads to capture local patterns and aggregate information across distant parts of the sequence. Moreover, through the parallel processing of multiple attention heads, MHA enables the learning of diverse representation subspaces. The versatility of MHA makes it an ideal method for studying genomic regulatory mechanisms, given the diverse ways regulatory elements interact. Notably, a recently developed deep learning model, SATORI, used MHA mechanisms to decipher TF cooperativities [9]. While the findings revealed significant cooperativities, several limitations need to be addressed. First, like most deep learning applications for genomic tasks, the initial convolution layer in SATORI model is designed to learn activation patterns. While this approach can learn de novo DNA motifs, annotating the convolution filters with motif libraries can be ambiguous and challenging [9, 10]. Additionally, NLP researchers have demonstrated that the attention scores produced by attention units, the approach employed in SATORI, exhibit limited interpretability and are susceptible to false cooperativities [11, 12]. Instead, NLP researchers show that saliency-based techniques such as integrated gradients can address this shortcoming for attention scores [11].

Here, we present Transcription factors cooperativity Inference Analysis with Neural Attention (TIANA), an MHA-based framework to infer combinatorial TF cooperativities from epigenomic data (ATAC-seq and ChIP-seq). To address the limitations of existing tools, TIANA uses known motif weights to initialize convolution filters to ease the interpretation challenge, allowing convolution filter activations to be directly associated with known TF motifs. In addition, TIANA uses integrated gradients to interpret the TF interdependencies from the attention units. We tested TIANA’s ability to recover TF co-binding pair motifs from ChIP-seq data, demonstrating that TIANA could identify key co-occurring TF motif pairs. We extended our validation to three lineage-determining TF (LDTF) ChIP-seq datasets and identified LDTF co-occurrences that are consist with previous literature findings. Last, we applied TIANA to the Kdo2-lipid A (KLA)-activated bone marrow-derived macrophage (BMDM) enhancer dataset, demonstrating TIANA can recover highly divergent functions associated with NFkB factors. The results suggest that TIANA is a highly robust tool that aims to assist in the identification of specific combinations of TFs that specify cell and signal-specific transcriptional outcomes.

Implementation

Overview of TIANA

Figure 1A shows the overall model architecture and training strategy. TIANA consists of two primary steps: In the first step (denoted as Training Step), it employs a deep learning framework to classify between positive sequences (such as ATAC-seq and ChIP-seq peak sequences) and size-matched negative sequences (random genomic sequences). This classification step uses a neural network architecture that consists of one convolutional layer with preloaded known motif information and one self-attention layer. Attention heads allow the model to simultaneously focus on different aspects of the input sequence, enabling it to capture diverse patterns and relationships. In the second step (denoted as Interpretation Step), TIANA infers TF cooperativities using integrated gradients, which computes gradients on attention heads (denoted as attention attributes) of the fully trained model from the previous step. TIANA uses integrated gradients to quantify these cooperativities by first identifying active motifs from input sequences, next, TIANA links active motifs using the attention attribute. Lastly, TIANA compares attention attributes between positive and negative sequences to identify motif pairs that have significant attention levels to infer TF cooperativities. A schematic overview of required files to run TIANA can be found in Figure S1. Briefly, a peak file such as in bed format from a ChIP-seq/ATAC-seq experiment are required from the user. TIANA provides motif files as well as random genomic sequences used during training. The output of TIANA consists of a html report of significance levels of motif pairs, a corresponding csv file and a list containing all motif pairs identified and their corresponding attention attributes.

Fig. 1
figure 1

Overview of the TIANA implementation. A TIANA consists of two main steps. In the first Training Step, TIANA learns to classify one-hot encoded positive sequences from randomly generated negative sequences. Next, in the Interpretation Step, TIANA utilizes integrated gradient to compute attention attribute matrices from the MHA block, enabling the inference of co-occupancy of TFs. B In the Interpretation Step, active motifs are identified using motif score thresholds. As shown in the cartoon example in B, for a positive sequence, the TF1 motif ends at 80 bp, while the TF4 motif ends at 150 bp, colored in red. The attention attribute, highlighted in blue, connecting 80 bp and 150 bp, is used to associate two active motifs from these respective locations (shown as TF1 and TF4 are linked with a blue edge). For each motif pair, TIANA compares the attention levels between all positive sequences and negative sequences to determine whether the motif pair exhibits significantly higher attention in positive sequences than in random negative sequences

Architecture and training

TIANA was implemented using tensorflow v2.9.1. The first component of TIANA is a binary classification deep learning framework. TIANA takes training data consisting of positive sequences and negative sequences. Positive sequences are enhancers, open chromatin, or TF binding peak sequences. Negative sequences are noncoding random genomic sequences. Positive and negative sequences were one-hot encoded to A: [1,0,0,0];C: [0,1,0,0]: G: [0,0,1,0], T: [0,0,0,1]. All input sequences are 200 bp in size unless otherwise specified, and TIANA can tolerate input sizes up to 1000 bp. For data used in this study, chr1, chr8, and chr18 were held out as the evaluation testing set; other chromosomes were used for training. Training performance of dataset used in this study are listed in Supplementary Table 1.

The first layer of the TIANA model is a one-dimensional convolution layer that uses filters to identify short nucleotide patterns. The convolution layer is preloaded with known motif position-specific score matrices (PSSMs) and set to be fixed. We obtained known motifs from JASPAR motif library [13, 14]. To reduce the redundancy of motifs, we used a hierarchical clustering function from TBA to group highly similar motifs [15]. The Biopython motif module was used to convert Jaspar-format motifs to motif PSSMs, which were loaded into convolution layers and computed the motif score cutoff. Unless otherwise specified, the convolution layer used in this study consists of 448 filters, with 224 motifs and their reverse complement motifs.

We used a vertical 1D max-pooling layer with a pooling size of 2, after the convolution layers to capture the highest activation score for each forward-reverse complement motif pair. By default, TIANA combines highest forward and reverse complement motif scores at each position. TIANA also provides an optional approach that computes forward and reverse complement motif separately through raising the –compute_rc flag when calling tiana.py. The vertical max-pooling layer was followed by a batch normalization layer and a 1D horizontal max-pooling layer, with pool size of 4 to reduce the computational overhead and improve model generalizability [16]. A detailed TIANA model summary can be found in Supplementary Table 2. For the examples used in this study, TIANA was trained for 50 epochs.

Positional encoding

We used a positional encoding method based on sine and cosine functions to preserve the positional information [17]. Suppose that the input from a previous layer is \(X_{{{\text{input}}}} \in R^{{m \times f}}\) where \(m\) is the sequence size and \(f\) is the total number of motifs/filters. The positional encoding matrix \(P\in {R}^{m\times f}\) shares the same dimension as \({X}_ {{\text{input}}}\). For each element \({P}_{i,2j}\) or \({P}_{i,2j+1}\), the positional embedding values are computed as:

$${P}_{i,2j}=sin\left(\frac{i}{{n}^{2j/f}}\right)$$
$${P}_{i,2j+1}=cos\left(\frac{i}{{n}^{2j/f}}\right)$$

where n is a scalar factor with a default value of 10,000. The output matrix can be written as \({X}_{out\text{put}}={X}_{in\text{put}}+P\).

Multi-head attention unit

The output of positional encoding was then fed into the multi-head attention block, which consisted of Query (\(Q\)), Key (k) and Value (\(V\)) matrices. The QKV matrices provide the framework for the attention mechanism, with the Query matrix generating queries, the Key matrix providing contextual information, and the Value matrix holding the actual content, collectively facilitating effective attention computation and learning. TIANA uses one MHA layer with four attention heads. For each attention head \(h\), the score matrix \({S}_{h}\in {R}^{m\times m}\) is computed using:

$$Attention\left({S}_{h}\right)=softmax\left(\frac{Q{K}^{T}}{\sqrt{{d}^{k}}}\right)$$

where: \({d}_{k}\) represents the size of Key (k).

The output of the MHA unit is computed by taking the dot-product of attention score matrix \(S\) with value matrix \(V\):

$$A=S \times V$$

The output of the MHA unit was then fed into a flattened layer, as well as two fully connected layers with 256 and 64 units, respectively. The output of the fully connected layers was then fed into a single unit output layer with a sigmoid activation function for binary classification.

Integrated gradients

We used integrated gradients of the attention matrix \(S\) to measure the TF cooperativities. We denote \(G\in {R}^{m\times m}\) as the integrated gradients attention attribute matrix for each attention score matrix \(S\). Using the methodologies described previously [11, 18], \(G\) can be approximated as:

$${\text{G}} = \frac{{\text{S}}}{{\text{a}}} \odot \sum _{{{\text{k}} = 1}}^{{\text{a}}} \frac{{\partial {\text{F}}\left( {\frac{{\text{k}}}{{\text{a}}}{\text{S}}} \right)}}{{\partial {\text{S}}}}$$

where, \(a\): number of steps in the Riemann approximation with default value of 100. \(F\): TIANA’s classification function. \(\frac{\partial F}{\partial S}\): gradient of F relative to attention score matrix S. \(\odot\): elementwise multiplication.

Quantifying transcription factor cooperativities

Figure 1B provides a schematic of the process to quantify TF motif cooperativities after the Training Step. First, for each sequence, active motifs are identified by having activation values from convolution layers higher than motif score thresholds computed previously using Biopython motif module with a p-value threshold of 1e−4. The positional information of activated motifs was preserved in attention attribute matrices \(G\). Next, for each sequence, TIANA computes max attention attributes by calculating the position-wise maximum over the attention attribute matrices across all attention heads to produce \({G}^{max}\in {R}^{m\times m}\) matrix.

The max attention attribute value \({G}_{i,j}^{max}\) was used to link two active motifs at two positions \(i\) and \(j\) for \({\forall }_{i}\in \left\{\text{1,2},\dots m\right\}\) and \({\forall }_{j}\in \left\{\text{1,2},\dots m\right\}\). In the cartoon example shown in Fig. 1B, TF1 and TF4 motifs are active in position 80 bp and 150 bp, thus TF1 and TF4 are linked using the attention attribute value \({G}_{\text{80,150}}^{max}\) (blue edge connecting TF1 and TF4 in Fig. 1B). For each sequence, if a TF motif is active at multiple locations, thus a motif pair has multiple attention attributes, the highest attribute value was used for the next significance testing step. Summary statistics, including number of peaks used in each dataset and number of motif pairs with attention attributes are listed in Supplementary Table 3.

Identifying significant motif pairs

For each pair TF motif, we compare the attention attributes for all the peaks between positive and negative sequences. To eliminate noise from low-quality predictions, positive peaks with sigmoid activation less than 0.7 and negative peaks with activation higher than 0.3 from the Training Step were removed from the significance testing. We used Mann–Whitney U test to compare the significance of attention attributes differences of a TF pair in positive and negative sequences. All p-values were corrected using Bonferroni correction method. The output of TIANA reports all significant (p-values < 1e−4) TF pairs from the positive sequences, sorted by p-values.

Sequencing data processing

ATAC-seq and H3K27ac ChIP-seq for KLA activated BMDM were downloaded from GEO (accession: GSE109965) [19]. LDTF ChIP-seq data for ATF3, SPI1, and CEBPB data were downloaded from GEO (accession: GSE111856 and GSE109965) [15, 19]. ETS1 and RUNX1 co-binding data were downloaded according to Martin et al. [20] from ENCODE (ID: ENCSR588AKU for RUNX1 and ID: ENCSR000BKQ for ETS1). Oct4/Sox2 ChIP-seq data were downloaded from Malik et al. [21] (accession: GSE103980). TF co-binding ChIP-seq data were obtained from the ENCODE data portal, as described in Shen et al. [7]. Sequencing reads were mapped using Bowtie2, using either human GRCh38/hg38 or mouse GRCm38/mm10 as a reference genome [22]. We used Homer findPeaks to identify peaks and used IDR to obtain reproducible peaks [23]. Peaks were resized to 200 bp and converted to a bed format before converting into one-hot encoding formats for training and testing with TIANA.

Enhancers were called from ATAC-seq and H3K27ac ChIP-seq using a similar method described previously [24]. First, we identified reproducible open chromatin peaks using ATAC-seq data. Next, we used Homer annotatepeaks.pl function to identify peaks with > 16 H3K27ac tag counts. Finally, we selected regions with > 2.5 fold change in H327ac, marked as “KLA-activated” and “KLA-repressed” peaks.

Comparative methods

We compared TIANA to SATORI, another MHA-based model used to infer TF cooperativities. We followed the method described in SATORI to extract the attention score matrix, which we compared with the attention attribute matrix from TIANA. Since the attention score (SATORI) and attention attributes (TIANA) cannot be directly compared due to a different data distribution, we computed rank for the attention score and attention attributes for comparison.

Results

TIANA prediction performance

We compared the classification performance of TIANA with two other deep learning models, SATORI and DeepSEA. Both TIANA and SATORI utilize convolution and multi-head attention units in their architectures, while DeepSEA employs convolutional layers in its network. Figure 2A shows the receiver operating characteristic (ROC) curves for TIANA, SATORI, and DeepSEA. In KLA-activated BMDM enhancer peaks and EGR1_ETV5 co-binding ChIP-seq peaks in K562 cells, all three models performed well in classification, with area under the ROC curve (AUC) scores exceeding 0.9.

Fig. 2
figure 2

Comparisons of classification performance using SATORI, DeepSEA and TIANA. A ROC curves showing the performance of SATORI, DeepSEA and TIANA on KLA activated enhancers in BMDMs (denoted as KLA), and EGR1_ETV5 co-binding peaks in K562 cells (denoted as EGR1_ETV5). Solid, dashed and dotted lines represent the performances of TIANA, SATORI and DeepSEA, respectively. B Changes in AUC scores when using 25% of the original training peaks. Random down sampling was performed five times in each model/sample and compared to the performance of the full model in A. Error bars represent 95% confidence intervals

Next, because assays such as ChIP-seq and ATAC-seq typically yield a wide range of peaks, ranging from a few thousand to tens of thousands in different cell types and conditions, it is necessary to evaluate the robustness of TIANA’s learning performance, especially with a low number of input peaks. Thus, we retrained TIANA, SATORI, and DeepSEA models with only 25% of the original training sequences through random down sampling, while using the same number of peaks for testing (all peaks in chr1, 8, and 18) and compared the change in performance in these models. As seen in Fig. 2B, SATORI and DeepSEA tend to exhibit a greater decrease in classification performance when trained with only 25% of the original input sequences.

We reason that SATORI and DeepSEA require sufficient input sequences to learn convolution filters that represent motifs, while TIANA only needs to learn TF cooperativities with preloaded motif weights. The comparison of training performance supports the notion that TIANA’s approach, which preloads motif weights, has a minimal impact on overall learning performance while offering robust learning performance with limited input data.

Identification of TF cooperativities in embedded motifs

To benchmark TIANA, we conducted simulation experiments to embed three motif pairs, SPI1-SIX1, SPI1-CEBPE, and RXRA-NRF1, into genomic sequences using similar approaches as previously published [9, 25]. In summary, we randomly selected 40,000 DNA sequences and embedded one pair of motifs (referred to as Motif 1 and Motif 2) into 20,000 sequences, which we denote as random sequences. For the remaining 20,000 sequences, which served as negative sequences, we embedded either Motif 1 or Motif 2, but not both, in each sequence. We repeated this simulation five times for each of the motif pairs.

Among all the embedded motif pairs, TIANA successfully identified these pairs as significant motif pairs with p-values equal to or close to zero with close to 100% classification performance. The results of the simulation experiment are summarized in Supplementary Table 4.

TIANA identifies of TFs for collaborative binding

TAL1 and GATA1 are a well-documented TF pair that dimerize upon binding to a WGATTA (GATA1) and a partial E-box (TAL1) motifs located 7–9 bp apart. Prior experimental investigations also suggest that GATA1-TAL1, instead of any other TFs, have a pioneering role in setting up co-binding peaks and subsequent cellular functions [26]. Considering the experimental findings, it is reasonable to hypothesize that a robust computational model will identify TAL1 and GATA1 motifs, especially when located between 7 and 9 bp, as the principal determinant for GATA1-TAL1 co-binding. We thus obtained previously identified co-binding peaks from a K562 cell line and compared the performance of TIANA and SATORI to recover TAL1-GATA1 co-occurrences and corresponding attention levels. Table 1 summarizes the significance level and motif pairs recovered from both models.

Table 1 p-values of the GATA1-TAL1 motif recovered from GATA1-TAL1 ChIP-seq co-binding peaks and percent of peaks with active GATA1-TAL1 motif filters located 7–9 bp apart

Although both methods can successfully identify GATA1-TAL1 motif pairs as highly significant pairs, SATORI identified 8% of GATA1-TAL1 co-binding peaks with corresponding motif pairs located between 7 and 9 bp. In contrast, TIANA’s motif pair identified 68% of co-binding peaks containing GATA1-TAL1 motifs. TIANA’s finding confirms previous results, which suggested that three-quarters of co-binding peaks bound by GATA1-TAL1 were located 7–9 bp apart [26]. Figure 3A shows attention levels for motif pairs located between 7 and 9 bp and > 10 bp for both models. TIANA demonstrated higher levels of attention to TAL1-GATA1 in both constrained spacing (7–9 bp) and > 10 bp compared with SATORI. Comparing with the preload motif approach employed by TIANA, the convolution layer in SATORI identified multiple filters that are all partially matched to the consensus TAL1 half E-box motif, consistent with the previous reports [9, 26].

Fig. 3
figure 3

Attention level comparison of TIANA and SATORI on TF co-binding motifs. A Attention levels for GATA1-TAL1 co-binding peaks, motifs located between 7 and 9 bp and > 10 bp are shown separately. B Attention levels for six pairs of TFs that are known to have relaxed spacing relationships. (Mann–Whitney U test, ***: 4.00e−09 < p <  = 4.00e−08; ****: p <  = 4.00e−09)

To further validate if TIANA can identify motif-orientation specific TF pairs, we analyzed the results from reverse complement-enabled outputs on GATA1-TAL1 co-binding peaks. As seen in Figure S2A, for GATA1-TAL1 motifs, TIANA predicts TAL1 upstream of GATA1 (or downstream if both motifs are reversed) with high attention levels, this result is consistent with prior experimental data showing TAL1 motif (half E-box) frequently locates 7–9 bp upstream of GATA motif (WGATAA) [26]. Similarly, we applied TIANA to ETS1-RUNX1 co-binding peaks used in Martin et al., [20]. As shown in Figure S2B, similar levels of attention attributes are observed for different orientations as well as the spacings between motifs. The attention attributes identified here also align with the experimental findings from Martin et al. [20]. We also applied TIANA to SOX2-OCT4 motif pairs, a well-characterized TF pair that frequently heterodimerize to perform essential biological functions [21, 27,28,29]. As shown in Figure S2C, similar to TAL1-GATA1, SOX2-OCT4 exhibits orientation and position preference, consistent with prior findings [21, 27].

Next, we compared the attention performance between TIANA and SATORI with TF pairs that are known to co-bind with relaxed spacing relationships [7]. We obtained the co-binding peaks for each TF pair dataset and compared the attention levels, attention attributes from TIANA and attention scores from SATORI, that matched the corresponding motifs. Figure 3B shows the attention rank between TIANA and SATORI, where TIANA shows a consistently higher attention level than SATORI. Although other TF motifs may influence the co-binding of these TF pairs, it is reasonable to believe these motif pairs play the primary role in the co-binding peaks. This finding further demonstrates that TIANA has a superior ability to identify important TF cooperativities compared to the existing tool.

Discoveries of motif dependencies in lineage-determining factors

LDTFs are TFs that are required to establish epigenetic landscape for cell lineages, some of which have chromatin opening abilities attributed to pioneer factors. Commonly known LDTFs for macrophages include SPI1, CEBP family, and AP-1 factors. Thus, we obtained ChIP-seq data for SPI1, CEBPB, and ATF3 (AP-1 family) and used TIANA to identify the top motif pairs for each LDTF. Figure 4 shows the top 5 results for each ChIP-seq dataset. The result for PU.1 ChIP-seq suggests that PU.1 largely depends on its own SPI1/Ets-factor motifs, as evidenced by all other motifs needing to pair with an SPIC or other Ets-factor motifs. This observation is consistent with the well-established knowledge that SPI1 acts as a potent chromatin modifier as it is the master regulator for the myeloid lineage that establishes the chromatin landscape, allowing other LDTFs to bind [30]. The ATF3 ChIP-seq results show that the AP-1 family such as Fos and Jun related motifs are strongly associated with the Ets motifs. Last, the results for the CEBPB ChIP-seq suggest that the CEBPB motif is strongly associated with SPI1/Ets motif and MEF2C motif. These findings are consistent with prior findings and confirm TIANA’s ability to identify biologically relevant information [24].

Fig. 4
figure 4

Top TF motif pair identified from LDTF ChIP-seq data. A ATF3 B CEBPB and C SPI1 ChIP-seq ranked by p-values. The last three columns show the − log10 of the p-values for each motif pairs, mean attention levels in positive sequences verses negative sequence, and summary statistics (mean, standard deviation) of the spacing relationship between the motif pairs in positive sequences, respectively

Divergent roles of NF\({\varvec{k}}\)B in macrophage enhancers

Finally, we evaluated TIANA’s performance with differentially activated and repressed enhancers in KLA-stimulated BMDMs. As shown in the schematics in Fig. 5A, the addition of substrate KLA will activate and repress different sets of enhancers in BMDMs. Figure 5B demonstrates that TIANA assigned significantly higher attention to the REL (p65)-NFkB (p50) heterodimer motif pair in KLA-activated enhancers than in KLA-repressed enhancers. On the other hand, KLA-repressed enhancers show greater attention to the p50–p50 homodimer than the KLA-activated enhancers. Figure 5C and Fig. 5D show the top significant motif pairs TIANA identified. In addition to the divergent roles of NFkB factors in which heterodimers are preferred in KLA activated enhancers and homodimers are dominant in HLA-repressed enhancers, TIANA also identifies several key motifs associated with each group that is known to associate with KLA-stimulated and repressed macrophages, such as SPI1 (SPI1/Ets) and AP-1 (MAFF and JUN) [24, 31].

Fig. 5
figure 5

TIANA identifies divergent NF-kB factor functions in macrophage regulatory elements. A Schematics showing KLA-activated and repressed enhancer regions in bone-marrow-derived macrophages. B Boxplot showing attention attribute levels for p65–50 heterodimer motifs and p50-p50 homodimer motifs in KLA-activated and repressed enhancers. (Mann–Whitney U test, ****: p <  = 4.00e−09) C Similar to Fig. 4, top TF motif pairs identified in KLA-activated enhancers. D Top TF motif pairs identified in KLA-repressed enhancers

This finding is consistent with previous experimental data showing that p65-p50 heterodimers are preferentially associated with KLA-activated enhancers, whereas p50 homodimerizes in KLA-repressed enhancers [32,33,34]. Due to motif similarities between REL/p50 and NFκB/p65, traditional motif enrichment tools cannot identify such divergent functions [24]. Existing tools, such as MAGGIE, with sufficient resolution to assign such divergent roles require multiple strains of mouse data to leverage genomic variant information [24], while TIANA can obtain similar findings using data from one strain of mouse without additional strain variant information.

Discussion

This study demonstrates that TIANA is a robust tool to infer cooperative TF co-occurrence from epigenomic data. TIANA was built to improve interpretation, as we recognize that interoperability can be a significant obstacle for deep learning-based tools’ applications in genomics. We preloaded known motifs as convolution filter weights. Although this may limit the kind of DNA patterns that can be learned in the training stage, we show that the “known motif” approach employed by TIANA is still able to achieve satisfactory learning performance compared with similar methods such as DeepSEA and SATORI. Moreover, we also demonstrated that the "known motif" approach could tolerate various input sequence numbers without suffering significant performance drops.

Interestingly, our data suggest that commonly used "learned motif" approach, used by existing tools such as SATORI and DeepSEA, may have a limited ability to tolerate differences in motif sizes. To the best of our knowledge, there has yet to be a consensus on the appropriate filter size used in convolution layers. Currently, the commonly used convolution filter sizes can range from 5 to 20 bp [9, 35,36,37]. However, TF motifs are naturally very different in size. For example, an experimentally validated dataset suggested that the TAL1/half E-box motif can be as short as 3–4 bp, while CTCF motif is 19 bp [26, 38]. Having long motif filters in the architecture unavoidably allows other unwanted patterns to be learned/activated. On the other hand, short motif filters will lead to ambiguity, since many distinct TFs, at least in part, share very similar patterns. Annotating filters to motifs has been frequently reported as a challenge to fully interpret deep learning models used in genomics [9]. Using the TAL1 motif as an example, the learned filter by SATORI contains patterns much longer than the consensus motif. It is reasonable to believe that suboptimal filters will burden the annotation of filters with known motifs and limit the interpretability of attention scores/attributes, since accurate motifs/filters will lead to flawed attention between features.

The improved ability of TIANA to identify TF co-occurrence can also be attributed to the integrated gradients/saliency-based method employed in the model. Given that saliency-based methods have become an essential interpretation technique and are critical to understanding the feature contribution in machine deep learning models [18, 39], TIANA demonstrates how it can also assist in the discovery of biologically relevant insights.

Our application of TIANA to LDTF ChIP-seq and KLA-activated BMDM enhancers showed that TIANA can recover insights that were previously only available with experimental validations or tools that leverage genomic variant information, such as MAGGIE. It should be noted that TIANA does not leverage genomic variants and, thus, can be adapted to a wide range of datasets with simple experimental setups. Moreover, unlike many existing tools that focus on discovering individual motifs/TFs, TIANA focuses on inferring significant TF cooperativities pairs among the co-occurring motifs from the input sequences. This is an advantage of TIANA because TF interdependency and/or exclusivity were largely unaccounted for previously. Collectively, we demonstrated that TIANA is an effective tool to investigate regulatory mechanisms hidden in regulatory sequences.

Conclusion

Sequences within regulatory elements often contain rich mechanistic insights into cell-type specific functions. Deep learning-based tools can be highly effective at deciphering the hidden information within these sequences if the model can be interpreted. We demonstrated that TIANA is an effective interpretation-oriented deep learning tool to infer hidden TF cooperativities information embedded within nucleotides.

Availability of data and materials

ATAC-seq and H3K27ac ChIP-seq for KLA activated BMDM were downloaded from GEO (accession: GSE109965) [19]. LDTF ChIP-seq data for ATF3, SPI1, and CEBPB data were downloaded from GEO (accession: GSE111856 and GSE109965) [15, 19]. TF co-binding ChIP-seq data were obtained from the ENCODE data portal, as described in Shen et al. [7]. DNA motif library used in this study is available at JASPAR motif database at https://jaspar.elixir.no/downloads/. The code for TIANA is available from its Github repository at https://github.com/rzzli/TIANA.

References

  1. Allis CD, Jenuwein T. The molecular hallmarks of epigenetic control. Nat Rev Genet. 2016;17(8):487–500.

    Article  CAS  PubMed  Google Scholar 

  2. Heinz S, Romanoski CE, Benner C, Glass CK. The selection and function of cell type-specific enhancers. Nat Rev Mol Cell Biol. 2015;16(3):144–54.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Lagha M, Bothma JP, Levine M. Mechanisms of transcriptional precision in animal development. Trends Genet. 2012;28(8):409–16.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010;38(4):576–89.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Rao S, Ahmad K, Ramachandran S. Cooperative binding between distant transcription factors is a hallmark of active enhancers. Mol Cell. 2021;81(8):1651-1665.e4.

    Article  CAS  PubMed  Google Scholar 

  6. Wei B, Jolma A, Sahu B, Orre LM, Zhong F, Zhu F, Kivioja T, Sur I, Lehtiö J, Taipale M, Taipale J. A protein activity assay to measure global transcription factor activity reveals determinants of chromatin accessibility. Nat Biotechnol. 2018;36(6):521–9.

    Article  CAS  PubMed  Google Scholar 

  7. Shen Z, Li RZ, Prohaska TA, Hoeksema MA, Spann NJ, Tao J, Fonseca GJ, Le T, Stolze L, Sakai M, Romanoski CE, Glass CK. Systematic analysis of naturally occurring insertions and deletions that alter transcription factor spacing identifies tolerant and sensitive transcription factor pairs. Elife. 2022;1:11.

    Google Scholar 

  8. Bailey TL, Johnson J, Grant CE, Noble WS. The MEME suite. Nucleic Acids Res. 2015;43(W1):W39-49.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Ullah F, Ben-Hur A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res. 2021;49(13):e77–e77.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Quang D, Xie X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 2016;44(11):e107–e107.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Hao Y, Dong L, Wei F, Xu K. Self-Attention attribution: interpreting information interactions inside transformer. Proc AAAI Conf Artif Intell. 2021;35(14):12963–71.

    Google Scholar 

  12. Serrano S, Smith NA. Is attention interpretable? In: ACL 2019—57th annual meeting of the association for computational linguistics; proceedings of the conference. 2019; pp. 2931–51. Preprint at https://arxiv.org/abs/1906.03731v1

  13. Fornes O, Castro-Mondragon JA, Khan A, Van Der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, Santana-Garcia W, Tan G, Chèneby J, Ballester B, Parcy F, Sandelin A, Lenhard B, Wasserman WW, Mathelier A. JASPAR 2020: update of the open-access database of transcription factor binding profile. Nucleic Acids Res. 2020;48(D1):D87-92.

    CAS  PubMed  Google Scholar 

  14. Rauluseviciute I, Riudavets-Puig R, Blanc-Mathieu R, Castro-Mondragon JA, Ferenc K, Kumar V, Lemma RB, Lucas J, Chèneby J, Baranasic D, Khan A, Fornes O, Gundersen S, Johansen M, Hovig E, Lenhard B, Sandelin A, Wasserman WW, Parcy F, et al. JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2023. https://doi.org/10.1093/nar/gkad1059.

    Article  PubMed Central  Google Scholar 

  15. Fonseca GJ, Tao J, Westin EM, Duttke SH, Spann NJ, Strid T, Shen Z, Stender JD, Sakai M, Link VM, Benner C, Glass CK. Diverse motif ensembles specify non-redundant DNA binding activities of AP-1 family members in macrophages. Nat Commun. 2019. https://doi.org/10.1038/s41467-018-08236-0.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Derry A, Krzywinski M, Altman N. Convolutional neural networks. Nat Methods. 2023;20(9):1269–70.

    Article  CAS  PubMed  Google Scholar 

  17. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017;2017-December:5999–6009. Preprint at https://arxiv.org/abs/1706.03762v5

  18. Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. 2017.

  19. Link VM, Duttke SH, Chun HB, Holtman IR, Westin E, Hoeksema MA, Abe Y, Skola D, Romanoski CE, Tao J, Fonseca GJ, Troutman TD, Spann NJ, Strid T, Sakai M, Yu M, Hu R, Fang R, Metzler D, et al. Analysis of genetically diverse macrophages reveals local and domain-wide mechanisms that control transcription factor binding and function. Cell. 2018;173(7):1796-1809.e17.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Martin V, Ar Ica Zhuang F, Zhang Y, Pinheiro K, Gordân R. High-throughput data and modeling reveal insights into the mechanisms of cooperative DNA-binding by transcription factor proteins. Nucleic Acids Res. 2023;51(21):11600–12. https://doi.org/10.1093/nar/gkad872.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Malik V, Glaser LV, Zimmer D, Velychko S, Weng M, Holzner M, Arend M, Chen Y, Srivastava Y, Veerapandian V, Shah Z, Esteban MA, Wang H, Chen J, Schöler HR, Hutchins AP, Meijsing SH, Pott S, Jauch R. Pluripotency reprogramming by competent and incompetent POU factors uncovers temporal dependency for Oct4 and Sox2. Nat Commun. 2019;10(1):1–16.

    Article  Google Scholar 

  22. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Li Q, Brown JB, Huang H, Bickel PJ. Measuring reproducibility of high-throughput experiments. Ann Appl Stat. 2011;5(3):1752–79.

    Article  Google Scholar 

  24. Shen Z, Hoeksema MA, Ouyang Z, Benner C, Glass CK. MAGGIE: leveraging genetic variation to identify DNA sequence motifs mediating transcription factor binding and function. Bioinformatics. 2020;36(Supplement_1):i84-92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Greenside P, Shimko T, Fordyce P, Kundaje A. Discovering epistatic feature interactions from neural network models of regulatory DNA sequences. Bioinformatics. 2018;34(17):i629–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Han GC, Vinayachandran V, Bataille AR, Park B, Chan-Salis KY, Keller CA, Long M, Mahony S, Hardison RC, Pugh BF. Genome-wide organization of GATA1 and TAL1 determined at high resolution. Mol Cell Biol. 2016;36(1):157–72. https://doi.org/10.1128/MCB.00806-15.

    Article  CAS  PubMed  Google Scholar 

  27. Chang YK, Srivastava Y, Hu C, Joyce A, Yang X, Zuo Z, Havranek JJ, Stormo GD, Jauch R. Quantitative profiling of selective Sox/POU pairing on hundreds of sequences in parallel by Coop-seq. Nucleic Acids Res. 2017;45(2):832–45. https://doi.org/10.1093/nar/gkw1198.

    Article  CAS  PubMed  Google Scholar 

  28. Aksoy I, Jauch R, Chen J, Dyla M, Divakar U, Bogu GK, Teo R, Leng Ng CK, Herath W, Lili S, Hutchins AP, Robson P, Kolatkar PR, Stanton LW. Oct4 switches partnering from Sox2 to Sox17 to reinterpret the enhancer code and specify endoderm. EMBO J. 2013;32(7):938–53. https://doi.org/10.1038/emboj.2013.31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Merino F, Ng CKL, Veerapandian V, Schöler HR, Jauch R, Cojocaru V. Structural basis for the SOX-dependent genomic redistribution of OCT4 in stem cell differentiation. Structure. 2014;22(9):1274–86.

    Article  CAS  PubMed  Google Scholar 

  30. Barozzi I, Simonatto M, Bonifacio S, Yang L, Rohs R, Ghisletti S, Natoli G. Coregulation of transcription factor binding and nucleosome occupancy through DNA features of mammalian enhancers. Mol Cell. 2014;54(5):844–57.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Kaikkonen MU, Spann NJ, Heinz S, Romanoski CE, Allison KA, Stender JD, Chun HB, Tough DF, Prinjha RK, Benner C, Glass CK. Remodeling of the enhancer landscape during macrophage activation is coupled to enhancer transcription. Mol Cell. 2013;51(3):310–25.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Brignall R, Moody AT, Mathew S, Gaudet S. Considering abundance, affinity, and binding site availability in the NF-κB target selection puzzle. Front Immunol. 2019;29(10):609.

    Article  Google Scholar 

  33. Natoli G, Saccani S, Bosisio D, Marazzi I. Interactions of NF-κB with chromatin: the art of being at the right place at the right time. Nat Immunol. 2005;6(5):439–45.

    Article  CAS  PubMed  Google Scholar 

  34. Cheng CS, Feldman KE, Lee J, Verma S, Bin HD, Huynh K, Chang M, Ponomarenko JV, Sun SC, Benedict CA, Ghosh G, Hoffmann A. The specificity of innate immune responses is enforced by repression of interferon response elements by NF-κB p50. Sci Signal. 2011. https://doi.org/10.1126/scisignal.2001501.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Rombauts S, Déhais P, Van Montagu M, Rouzé P. PlantCARE, a plant cis-acting regulatory element database. Nucleic Acids Res. 1999;27(1):295–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1–8.

    Article  Google Scholar 

  37. Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning–based sequence model. Nat Methods. 2015;12(10):931–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Kulakovskiy IV, Vorontsov IE, Yevshin IS, Sharipov RN, Fedorova AD, Rumynskiy EI, Medvedeva YA, Magana-Mora A, Bajic VB, Papatsenko DA, Kolpakov FA, Makeev VJ. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 2018;46(D1):D252–9.

    Article  CAS  PubMed  Google Scholar 

  39. Shrikumar A, Greenside P, Kundaje A. Learning Important Features Through Propagating Activation Differences. In: 34th international conference on machine learning (ICML 2017), vol. 7; 2017. pp. 4844–66. Preprint at https://arxiv.org/abs/1704.02685v2

Download references

Acknowledgements

The authors would like to thank Payam Saisan for providing critical feedback of the manuscript.

Funding

R.Z.L. and C.K.G. were supported by NIH grant NS096170 and DK091183. C.Z.H. is supported by the NIH K99 MH129983. The funding bodies had no direct roles in the design or execution of the study.

Author information

Authors and Affiliations

Authors

Contributions

RZL: conceptualization, software, data analysis, visualization, writing; CZH conceptualization, analysis, visualization, supervision, writing; CKG conceptualization, funding acquisition, project administration, supervision, writing.

Corresponding author

Correspondence to Christopher K. Glass.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Availability and requirements

Project name: TIANA. Project home page: https://github.com/rzzli/TIANA. Operating system(s): Platform independent. Programming language: Python. Other requirements: Python 3.8, Tensorflow 2.9.1, Keras 2.9.0, NumPy 1.22 or higher, Pandas 1.2.4 or higher, Matplotlib 3.3.2 or higher, Seaborn 0.12.2 or higher, Scikit-learn. License: MIT License. Any restrictions to use by non-academics: No restrictions.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R.Z., Han, C.Z. & Glass, C.K. TIANA: transcription factors cooperativity inference analysis with neural attention. BMC Bioinformatics 25, 274 (2024). https://doi.org/10.1186/s12859-024-05852-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-024-05852-0

Keywords