Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies

Background Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for de-novo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and non-phylogenetic approaches based on sequences from only one species that typically take into account intra-motif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intra-motif dependencies separately improves de-novo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously. Results Here, we present an approach for de-novo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intra-motif dependencies. We study the degree of intra-motif dependencies inferred by this approach from ChIP-seq data of 35 transcription factors. We find that significant intra-motif dependencies of orders 1 and 2 are present in all 35 datasets and that intra-motif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intra-motif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1. Conclusion Combining phylogenetic footprinting with motif models incorporating intra-motif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1495-1) contains supplementary material, which is available to authorized users.

In the following subsections we describe four additional studies to substantiate the findings presented in the manuscript and provide some statistics regarding the used datasets and the run-time of our approach.
In the first Results subsection of the main manuscript we study to which degree intra-motif dependencies can be captured using the proposed phylogenetic footprinting models (PFMs). In Supplementary Section 1.1 we study species-specific motifs to rule out the possibility that the estimated intra-motif dependencies presented in the manuscript result from a mixture of speciesspecific motifs. We find that species-specific motifs are highly similar for most transcription factors (TFs).
In the second Results subsection of the main manuscript and Figures 3 and 4 we show that taking into account base dependencies improves phylogenetic footprinting. In Supplementary Section 1.2 we study whether these results are robust to different species compositions. We find that the proposed PFM is robust with respect to different species compositions. Additionally, in Supplementary Section 1.3 we study to which degree intra-motif dependencies and phylogenetic dependencies contribute to the improved motif prediction performance. We confirm previous findings that modeling base dependencies as well as phylogenetic dependencies improves classification performance. We find that modeling base dependencies improves classification performance to a higher degree than modeling phylogenetic dependencies and we find that modeling both typically results in the best classification performance. In Section 1.4 we compare the classification performance of the proposed PFM to the classification performance of a state-of-the-art method by Eggeling et al. [1].

Species-specific motifs are highly similar for most TFs
Intra-motif dependencies may be a constant phenomenon conserved across the examined species or a rather dynamic phenomenon significantly changing during the evolution of these species. The latter case may imply that species-specific motifs are different to a certain degree. Consequently, the estimation of base dependencies across species may result in the estimation of spurious results. Hence, first we visually study differences among the species-specific motifs for each of the 35 TFs using difference logos, second we determine whether observable differences between speciesspecific motifs are significant or not, and third we examine the distribution of position-specific mutual informations (MIs) for each species. Therefore, we extracted for each of the 35 TFs one motif for each of the 10 species resulting in 35 × 10 species-specific motifs as described in Supplementary Section 2.1. Please note that the extracted species-specific motifs for other species than the reference species are not representative for these phylogenetically related species.

Primates show almost no differences in their sequence logos
We use the freely available R package DiffLogo for the visual inspection of motif differences [3]. DiffLogo enables the illustration and investigation of differences between highly similar motifs such as binding motifs of TFs from different experiments, different motif prediction algorithms, or different species. Hence, we use tables of difference logos generated with DiffLogo for a pair-wise comparison of all species-specific motifs. Each difference logo displays position-specific differences of base distributions by a stack of bases which height is proportional to the base distribution difference quantified by the Jensen-Shannon divergence. The Jensen-Shannon divergence is zero in case of two identical base distributions and 1 in case of two maximally different base distributions. The tables of difference logos for all 35 TFs can be found in Additional File 1. All sequence logos of PFM(0), PFM(1), and PFM(2) can be found in Additional File 3.
Exemplary, Supplementary Figure S1 shows the table of difference logos for the TF Bach1. We find that the species-specific motifs segregate into two main groups, where one group comprises seven higher primates and the second group comprises three species from the Laurasiatheria superorder, i.e., dog, horse, and cow. We find differences between the motifs of both groups at various motif positions, where the motif differences of relatively high degree are located at rather conserved motif positions as well as at more variable motif positions. For instance, we find relatively high differences at motif position 8, where guanine is more abundant in the primate motifs and the remaining bases are more abundant in the Laurasiatheria motifs. However, the maximum Jensen-Shannon divergence in all difference logos for Bach1 is below 0.01 bits.
We examine the motif differences between species-specific motifs for all 35 TFs. We see for 14 TFs that the set of ten species segregates into the two groups of seven higher primates and three from the Laurasiatheria clade as before in case of Bach1 (CEBPB, CTCF, EGR1, MafK, Max, NRSF, POU5F1, Rad21, SRF, TCF12, TEAD4, USF1, USF2, YY1). We find for the remaining 21 TFs that the motifs of the seven higher primates are more similar to each other compared to the motifs of dog, horse, and cow. With other words, in these cases the pairwise difference logos of dog, horse, and cow do not form a second cluster. The differences observed among species-specific motifs could partly result from missing binding sites in some species. Supplementary Table S6 shows for each species and for each TF the proportion of sequences which are available for the computation of species-specific motifs.

Species-specific motifs are typically highly similar
We study the statistical significance of differences between species-specific motifs as follows. We examine for each of the 35 TFs the similarity of species-specific motifs using a statistical test for each two species ((10 * 9)/2 = 45 species pairs). We calculate for each TF and for each two species the p-value for the null hypothesis that two species-specific motifs arise from the same distribution as described in Supplementary Section 2.2 resulting in 35 * 45 = 1575 pairwise comparisons. We count for each two species how often we reject the null hypothesis for a confidence level of α = 0.05. These counts range from 0 to 35, where 0 means that the species-specific motifs of two species show no significant differences for each transcription factor and 35 means that the speciesspecific motifs of two species show significant differences for each transcription factor. The binary nature of the results of statistical tests can lead to the issue that comparisons between three species are not transitive, i.e., if there are no significant differences between the species-specific motifs of species A and B and there are no significant differences between the species-specific motifs of species B and C it can happen that there are significant differences between the species-specific motifs of species A and C.
Supplementary Table S3 shows the results for each pair of species. We find that the seven primate-specific motifs are highly similar to each other and that the three species-specific motifs of cow, dog, and horse show greater differences compared to those of the seven primates. Using a significance level of 95%, we expect 5% of all 1575 pairwise differences to be significant by chance. Figure S1: Comparison of species-specific motifs for the TF Bach1. We depict a table of difference logos with one row and one column for each species-specific motif to emphasize the differences between species-specific motifs. Each difference logos depicts the motif differences position-wise with a stack of bases, which height is calculated by the Jensen-Shannon divergence of the position-specific base distributions. The overall similarity between species-specific motifs is calculated by the sum of Jensen-Shannon divergences of all motif positions and depicted by the background color of the difference logos from green (similar) to red (dissimilar). The table of difference logos indicates, that the ten species-specific Bach1 motifs primarily segregate into two clusters, where one cluster comprises seven primates and the other cluster comprises three non-primates (cow, dog, and horse).
We find for only 47 of 1575 pairwise comparisons that two species-specific motifs show significant differences. However, we find only 47 (3%) of the pairwise differences to be significant, stating that the observed differences are not greater than expected by chance.
Specifically, we find that these 47 cases apply only to 6 of the 35 TFs, namely Bach1, CEBPB, MafK, Max, SP1, and USF1 and typically only apply to comparisons between a primate species and one of the species dog, cow, and horse. We find no significant differences between the seven primates reflecting the close phylogenetic relationship and accordingly the high sequence similarity. Amongst the three species dog, cow, and horse we find for 1 of the 105 pairwise comparisons significant differences.
These results imply that the motifs estimated across species as presented in the previous section are typically not a mixture of species-specific motifs.

Intra-motif dependencies are highly similar for all species
We examine for each of the 35 TFs the distribution of species-specific MIs using MI profiles I S 1 and I S 2 as described in Methods 4 for each species S ∈ {hg19, panT ro, papHam, ponAbe, rheM ac, calJac, equCab, canF am, gorGor, bosT au}. Figure S2 shows two examples of species-specific MI profiles I S 1 and I S 2 for the two TFs CJUN and Nrf. All species-specific MI profiles are available in Additional File 2. We depict for both TFs (i) the sequence logo inferred by the PFM(2) from all species in the first row, (ii) the species-specific MI profiles inferred from the PFM(1) in the second row, and (iii) the species-specific MI profiles inferred from the PFM(2) in the third row. The species-specific MI profiles inferred from both models are highly similar to each other.
First, we study the species-specific MI profiles I S 1 . We find for each of the 35 TFs that the species-specific MI profiles I S 1 are highly similar for all species. We also find that the MIs in the MI profiles I 1 are sometimes stronger, sometimes weaker, and often averaged compared to the species-specific MIs I S 1 implying that the MI profiles I 1 inferred from all species are partly a result of interference of species-specific MIs. For example, in case of TF CJUN at motif positions w ∈ {2, 3, 4}, the MIs I 1 (w) are smaller than the MIs I S 1 (w) for all species except horse and marmoset and in case of TF Nrf at motif positions w ∈ {8, 9, 10, 11} the MIs I 1 (w) are higher than the MIs I S 1 (w) for all species. Specifically, we find the largest difference between two I S 1 for FOSL1 with 0.35 bits. However, the MI profiles I 1 inferred from all species are typically highly similar to the species-specific MI profiles I S 1 . Second, we examine species-specific MI profiles I S 2 . We find for each of the 35 TFs that the species-specific MI profiles I S 2 are highly similar for all species. We also find that the MIs in the MI profiles I 2 are sometimes stronger, sometimes weaker, and sometimes averaged compared to the species-specific MIs I S 2 implying that the MI profiles I 2 inferred from all species are partly a result of interference of species-specific MIs. For example, in case of CJUN the MI profile I 2 (w) is typically smaller than the MI profile I S 2 (w) for all species at all motif positions w and in case of Nrf at motif positions w ∈ {8 − 11} the MIs I 2 (w) are higher than the MIs I S 2 (w). Specifically, we find the largest difference between two I S 2 for FOSL1 with 0.48 bits. However, the MI profiles I 2 inferred from all species are typically highly similar to the species-specific MI profiles I S 2 as in case of I 1 and I S 1 .

PFMs are robust according to different species compositions
We compare the performance of the PFM(0), the PFM (1), and the PFM(2) described in Methods 2. We measure the classification performance as described in Methods 3 on datasets of 35 TFs described in Methods 1. For the evaluation of the robustness of our model, we vary the number of orthologous species the PFMs rely on. Specifically, we use four species (human, horse, dog, and cow), five species (human, chimp, horse, dog, and cow), seven species (human, chimp, baboon, rhesus, horse, dog, and cow), and ten species (human, chimp, baboon, orangutan, rhesus, marmoset, gorilla, horse, dog, and cow). See Supplementary Figures S3, S4, S5, and Figure 5 of the main manuscript for the results. We find for each species composition that the PFM(1) typically outperforms the PFM(0) and that the PFM(2) outperforms both the PFM(0) and the PFM(1) in all cases. This finding indicates that modeling higher order markov models within phylogenetic footprinting seem to be robust despite the species composition and that the PFM with higher markov orders yields a better fit to the data.      Figure S5: Classification performance of PFM(1) and PFM(2) on four species. We plot the mean and standard error of the relative improvement of the ROC AUC for the model PFM(1) and the model PFM(2) relative to the performance of the model PFM(0) for each of the 35 TFs on alignments of human, chimp, baboon, rhesus, horse, dog, and cow. Modeling base dependencies of order 1 typically increases the classification performance and modeling base dependencies of order 2 increases the classification performance in all cases. The performance of the PFM(2) is larger than the performance of the PFM(1) in all cases.

Taking into account phylogeny improves classification performance in almost all cases.
It has been shown that taking into account base dependencies improves one-species approaches neglecting phylogenetic dependencies and it has been shown that taking into account phylogenetic dependencies can improve one-species approaches neglecting base dependencies. In the manuscript we have shown that taking into account base dependencies improves phylogenetic footprinting. Unfortunately, it can not be concluded from these observations that a model taking into account base dependencies and phylogenetic dependencies outperforms a model taking into account base dependencies but neglecting phylogenetic dependencies, because phylogenetic dependencies may potentially impair the model taking into account base dependencies.
Here, we systematically study the impact of both higher order base dependencies and phylogenetic dependencies to classification performance. Therefore, we study the performances of four different models, namely i) a model taking into account neither base dependencies nor phylogenetic dependencies (human(0)), ii) a model taking into account base dependencies of order 2 and neglecting phylogenetic dependencies (human(2)), iii) a model neglecting base dependencies and taking into account phylogenetic dependencies PFM(0), and iv) a model taking into account both base dependencies and phylogenetic dependencies (PFM(2)) as described in Methods 2. The models PFM(0) and PFM(2) take into account phylogenetic dependencies and are inferred from the alignments described in Methods 1. The models human(0) and human(2) do not take not into account phylogenetic dependencies and are inferred from the human sequences of the alignments described in Methods 1. The models human(0) and human(2) are special cases of PFM(0) and PFM(2) incorporating only one species.
For case a), it has been shown that human(2) typically outperforms human(0), i.e., that modeling base dependencies improves classification performance. For case b), it has also been shown that PFM(0) typically outperforms human(0), i.e., modeling phylogenetic dependencies improves classification performance. For case c), we have shown that PFM(2) outperforms PFM(0), i.e., that taking into account higher order base dependencies improves phylogenetic footprinting. For case e) we assume that PFM(2) outperforms human(0) considering the cases a) and b). The cases d) and f) are unknown so far.
We measure the classification performance of all four models as described in Methods 3 on datasets of 35 TFs. Figure S6 shows the corresponding values for the four models human(0), PFM(0),human (2), and PFM (2) Figure S6.
It is not surprising that the model taking into account both base dependencies and phylogenetic dependencies outperforms the model ignoring base dependencies and phylogenetic dependencies (case e). We find that modeling base dependencies typically improves classification performance (cases a and c). Interestingly, we find that modeling base dependencies clearly outperforms modeling phylogenetic dependencies (case f). In fact, solely taking into account phylogenetic dependencies shows only a partial improvement (case b), but taking into account both phylogenetic dependencies and base dependencies shows a clear performance improvement compared to solely taking into account base dependencies (case d). These results suggest that phylogenetic footprint-ing approaches benefit from taking into account base dependencies and that approaches on single species already taking into account base dependencies benefit from taking into account phylogenetic dependencies.  Figure S6: Classification performance of the models human(0) and human(2) on human sequences and PFM(0) and PFM(2) on alignments of ten species for each of the 35 TFs. We show the mean and standard error of the ROC AUC. See Table S4 and  Figure S7: Classification performance of the two models PFM(0) and PFM(2) incorporating phylogenetic dependencies for each of the 35 TFs. We show the mean and standard error of the relative increase of ROC AUC of (top) PFM(0) relative to the classification performance of human(0) and (bottom) PFM(2) relative to the classification performance of human (2). Typically both models show a higher classification performance.  Figure S8: Classification performance of the two models human(2) and PFM(2) incorporating base dependencies of order two for each of the 35 TFs. We show the mean and standard error of the relative increase of ROC AUC of (top) human(2) relative to the classification performance of human(0) and (bottom) PFM(2) relative to the classification performance of PFM(0). Typically both models show a higher classification performance.

Comparison of a PMM(2) with a PFM(2) on 35 TFs
We compare the performance of the PFM of order 2 described in Methods 2 and the parsimonious motif model (PMM) of order 2 proposed in Eggeling et al and denoted as PMM(2) [1]. We measure the classification performance as described in Methods 3 on datasets of 35 TFs described in Methods 1 (the PMM uses only human sequences). See Supplementary Figure S9 for the results. We find that the mean performance of the PFM(2) is greater than the mean performance of the PMM(2) in 17 cases and we find that the mean performance of the PMM(2) is greater than the mean performance of the PFM(2) in 18 cases indicating that both parsimony and phylogeny can be beneficial depending on the TF of interest.

Data statistics
In the following we provide two statistics regarding the used datasets. In section 1.5.1 we study the distance between adjacent positive ChIP-Seq regions In section 1.5.2 we show the number of sequences used for the calculation of species-specific motifs.

Distances between adjacent ChIP-Seq positive regions
In the manuscript we describe in Methods 1 how the positive and negative datasets we use in our studies were obtained. Here, we study the distances between two adjacent ChIP-Seq positive regions. The distance D between two adjacent ChIP-Seq positive regions A and B, where A is located before B, is defined by the difference between the last base of region A and the first base of region B. Figure S10 shows the result. We find an median distance of 145414 bases and average distance of 971094 bases. We find that 99.7% of all distances are larger than 500, 94.2% are larger than 5000, and 90% are larger than 10000. Table S6 shows for each TF and for each species the number of binding site sequences that were incorporated into the calculation of the corresponding species-specific motif relative to the number of human binding site sequences.

Distance between adjacent peaks
log10 distance in bases 2 Supplementary Methods

Investigating species-specific motifs
We infer for each species and each TF an iMM(1) and iMM(2) as follows. First, we train a PFM(2) on the complete positive training dataset by expectation maximization ( [2]). We restart the expectation maximization algorithm 100 times with different initializations and choose the PFM(2) with the maximum likelihood.
Second, we use this model to calculate for each position n in each alignment X n the probability p(X n , M n = 1, n ) that a binding site occurs in alignment X n at position n . For each species o we store this probability and the corresponding sub-sequence of length W starting at position n . We omit all sub-sequences containing gaps.
Third, for each species we calculate an iMM(1) and an iMM (2) from the corresponding set of species-specific sub-sequences and their probabilities p(X n , M n = 1, n ).

Statistical test for the similarity of species-specific motifs
Given two sets of weighted sub-sequences BS 1 and BS 2 we perform the following steps to calculate a p-value for the null hypothesis that the two sets BS 1 and BS 2 are generated one iMM (2). In this manuscript, each set corresponds to a set of probability-weighted sub-sequences obtained from the second step in the previous subsection.
First, we infer an iMM(2) on each set of weighted sub-sequences BS 1 and BS 2 . We calculate the Jensen-Shannon divergence between both iMMs(2) denoted by JS original .
Second, we build two new sets of weighted sub-sequencesBS 1 andBS 1 by randomly assigning the sub-sequences from BS 1 and BS 2 , such an median distance ofBS 1 contains the same number of sub-sequences as BS 1 andBS 2 contains the same number of sub-sequences as BS 2 . The assignment of the weights to the sub-sequences in preserved. We infer an iMM(2) on each set of weighted sub-sequencesBS 1 andBS 2 . We compute the Jensen-Shannon divergence between both iMMs(2) denoted by JS random .
Third, we repeat the second step 1000 times and calculate the p-value as the number of cases where JS random < JS original divided by 1000. Tables   Table S1: Dataset statistics for human ChIP-Seq data and alignments. For each of the 35 TFs we specify the (i) name of the TF, the (ii) number of ChIP-Seq positive regions, the (iii) number of alignments for ten species after quality control, and the (iv) average alignment length. Nearly all ChIP-Seq positive regions are covered by alignments of ten species.