 Research Article
 Open access
 Published:
Combining phylogenetic footprinting with motif models incorporating intramotif dependencies
BMC Bioinformatics volumeÂ 18, ArticleÂ number:Â 141 (2017)
Abstract
Background
Transcriptional gene regulation is a fundamental process in nature, and the experimental and computational investigation of DNA binding motifs and their binding sites is a prerequisite for elucidating this process. Approaches for denovo motif discovery can be subdivided in phylogenetic footprinting that takes into account phylogenetic dependencies in aligned sequences of more than one species and nonphylogenetic approaches based on sequences from only one species that typically take into account intramotif dependencies. It has been shown that modeling (i) phylogenetic dependencies as well as (ii) intramotif dependencies separately improves denovo motif discovery, but there is no approach capable of modeling both (i) and (ii) simultaneously.
Results
Here, we present an approach for denovo motif discovery that combines phylogenetic footprinting with motif models capable of taking into account intramotif dependencies. We study the degree of intramotif dependencies inferred by this approach from ChIPseq data of 35 transcription factors. We find that significant intramotif dependencies of orders 1 and 2 are present in all 35 datasets and that intramotif dependencies of order 2 are typically stronger than those of order 1. We also find that the presented approach improves the classification performance of phylogenetic footprinting in all 35 datasets and that incorporating intramotif dependencies of order 2 yields a higher classification performance than incorporating such dependencies of only order 1.
Conclusion
Combining phylogenetic footprinting with motif models incorporating intramotif dependencies leads to an improved performance in the classification of transcription factor binding sites. This may advance our understanding of transcriptional gene regulation and its evolution.
Background
Gene regulation is an essential process in every living organism that controls the activity of gene expression and enables the concerted up and downregulation of gene products. Gene regulation involves a wide range of subprocesses such as transcriptional regulation including DNA methylation [1], histon modifications [2], and promotor escaping [3] as well as posttranscriptional regulation including modulated mRNA decay [4], siRNA interference [5, 6], and alternative splicing [7, 8]. One important process in gene regulation is the interaction of transcription factors (TFs) with their corresponding transcription factor binding sites (TFBSs) [9, 10]. The algorithmic discovery of TFBSs and the simultaneous inference of their motifs is known as denovo motif discovery and a challenging task in bioinformatics. Many different approaches exist for denovo motif discovery, which can be divided in two groups.
The first group comprises approaches based on sequences of only one species, which we refer to as onespecies approaches in this work, using statistical models for the binding of TFs to their TFBSs. One of the most popular motif models is the simple position weight matrix (PWM) model, which does not take into account any dependency between different positions of the same TFBS, but there are also more complex motif models that take into account intramotif dependencies. Irrespective of the wide variety of different motif models used, all of these approaches have in common that they do not take into account phylogenetic information available from orthologous sequences of phylogenetically related species.
Complex motif models that take into account intramotif dependencies have been shown to outperform simpler motif models like the PWM model [11â€“13]. Examples for highly popular tools that model intramotif dependencies are Dimont [14], MEMEChIP [15], DeepBind [16], and diChIPMunk [17].
In contrast, the second group of denovo motif discovery approaches known as phylogenetic footprinting incorporates orthologous sequences of at least two phylogenetically related species. The basic idea of these approaches is that TFBSs tend to be subject to negative selection during evolution, which can increase the recognition of TFBSs in the reference species. Phylogenetic motif models, which model the binding of TFs to their TFBSs and their evolution simultaneously, are based on evolutionary models such as the popular Felsenstein model [18]. Irrespective of the wide variety of different phylogenetic motif models used, all of these approaches have in common that they do not take into account intramotif dependencies.
Not all sequences from the reference species may have orthologous sequences in phylogenetically related species, and not all aligned sequences may comprise functional TFBSs at the same alignment positions [19]. Moreover, alignment errors, binding site turnovers, and spurious alignments from convergent evolution may affect the utility of phylogenetic footprinting. Nevertheless, phylogenetic footprinting has been shown to outperform onespecies approaches for many TFs and have become increasingly attractive due to next generation sequencing and the resulting avalanche of data [20â€“22].
Examples for highly popular phylogenetic footprinting tools that have been applied to eukaryotes and prokaryotes are FootPrinter [23], PhyME [24], MONKEY [25], MicroFootprinter [26], Phylogenetic Gibbs Sampler [27], PhyloGibbs [28], PhyloGibbsMP [29], or MotEvo [30].
In summary, onespecies approaches neglect phylogenetic information, whereas phylogenetic footprinting, which incorporates this information, neglects intramotif dependencies. The main objective of this work is to develop an approach that combines these two ideas and to investigate if taking into account intramotif dependencies can improve phylogenetic footprinting. Specifically, we propose a simple phylogenetic footprinting model (PFM) capable of taking into account both intramotif dependencies and phylogenetic information in Methods, and we study if modeling intramotif dependencies improves phylogenetic footprinting based on human ChIPSeq data of 35 TFs and more than 10^{5} multiple alignments of human ChIPseq positive regions and their orthologous sequences of 9 mammalian species ranging from chimp to cow in Results.
Methods
In this section we describe (i) the studied datasets, (ii) the used notation and the likelihood calculation of the PFM, (iii) the performance measure, (iv) the calculation of the mutual information, and (v) details regarding the estimation algorithm and implementation of the proposed model.
Data
We use freely available ChIPSeq data for 50 transcription factors from the ENCODE project [31, 32]. The ChIPseq experiments were performed by several production groups in the ENCODE Consortium and analysed by the ENCODE Analysis Working Group based on a uniform processing pipeline developed for the ENCODE Integrative Analysis effort [33]. We focus on datasets for the human H1hESC cell line. The uniform processing pipeline utilizes the SPP peak caller [34] and biological replicates (at least two per transcription factor) are analysed jointly with a Irreproducible Discovery Rate (IDR) score of at least 2%. The resulting ChIPseq regions of the Uniform TFBS track reference the hg19 assembly [35] and each comprise the chromosome, start position, end position, and an enrichment score. We exclude 15 datasets which yield repetitive motifs analog to [13] and hence retain datasets of 35 TFs.
For each TFs we select the top 20% of the available ChIPseq regions ranked by enrichment score. We denote these regions as ChIPseq positive regions and use them as basis for the positive dataset (Additional file 1: Table S1 and Additional file 1: Section 1.3). We denote the regions between ChIPseq positive regions on one chromosome as ChIPseq negative regions. For each TF we extract two regions of length 500 bp from each ChIPseq negative region centered at one third and two thirds, and use these as basis for the negative dataset. Hence, there are roughly twice as many negative regions than positive regions. We remove regions from the positive and the negative region sets that are shorter than 20 bp. For each region in the positive and negative region sets we extract the corresponding alignment consisting of 46 mammals using the freely available multiple genome alignment from UCSC [36].
We apply the following steps to each alignment. We remove alignment columns with gapsymbols or ambiguous symbols in the human sequence and concatenate the remaining alignment columns. We retain the 10 species with the best alignment coverage, namely Human (hg19), Chimp (panTro), Baboon (papHam), Orangutan (ponAbe), Rhesus (rheMac), Marmoset (calJac), Horse, (equCab), Dog (canFam), Gorilla (gorGor), and Cow (bosTau). We replace ambiguous symbols with gapsymbols. We remove all alignments which comprise no base symbols for 20% or more species. See Additional file 1: Table S1 for statistics on the number of ChIPSeq positive regions and the number of extracted alignments and see Additional file 1: Table S2 for details about the origin of the used ChIPSeq data and Additional file 2 contains all extracted alignments.
Phylogenetic footprinting model
Notation
Each dataset of each TF contains N alignments, with each alignment containing O sequences (one per observed species). Of course the number of alignments per TF, N, varies from TF to TF (See Additional file 1: Table S1). The nth alignment is denoted by X _{ n } and its length is denoted by L _{ n }. Each sequence of alignment X _{ n } is composed of L _{ n } symbols. We denote by \(X_{n}^{u,o}\) the uth symbol of the oth sequence of the nth alignment. All symbols belong to the set \({\mathcal {A}} = \{A,C,G,T,\}\) where A,C,G, and T denote the bases and âˆ’ denotes a gap in the alignment. Missing sequences in alignment n are represented by L _{ n } gap symbols.
An alignment X _{ n } may or may not contain a binding site. This is encoded in the variable M _{ n }, with M _{ n }=0 indicating that alignment X _{ n } does not contain a motif and M _{ n }=1 indicating that alignment X _{ n } does contain a motif. This model is known as ZOOPS (zero or one occurrence of a binding site per sequence) or NOOPS (noisy OOPS) model. Due to its simplicity and its modularity this model is widely used for denovo motif discovery [37â€“40].
Likelihood
The probability that the alignment X _{ n } is generated by our PFM can be written as
with variable M _{ n } taking a Bernoulli distribution and Î¸ denoting model parameters, namely (i) the topology of the phylogenetic tree, (ii) the substitution probabilities, and (iii) the evolutionary model with its stationary probabilities for the flanking regions as well as for the binding site regions.
We need to specify the probability for nonmotifbearing p(X _{ n }M _{ n }=0,Î¸) and for motifbearing alignments p(X _{ n }M _{ n }=1,Î¸). For reasons of clarity we omit Î¸ in the following.
Likelihood of a nonmotifbearing alignment
Since sequences are assumed to be conditionally independent, the probability of an alignment decomposes as the product of the probability of each of its sequences:
Now, the probability of each sequence follows a homogeneous Markov Chain of order C:
where p(u,k) stands for the (at most C) predecessors of the uth base for a sequence starting at position k, namely the set p(u,k)={v max(k,uâˆ’C)â‰¤v<u}, and
where \(\pi _{0}^{a,\zeta }\) is the parameter encoding the probability of a base a in the background sequence provided that its predecessors are in joint state Î¶.
Likelihood of a motifbearing alignment
We note W for the length of the motif. Since the motif can be present in different positions, the probability of a motifbearing assignment is a weighted sum over each possible motif position â„“ _{ n }:
We assume motifs to be uniformly distributed a priori, thus having that \(p(\ell _{n}M_{n}=1) = \frac {1}{L_{n}W+1}\). Again, conditional independence of sequences allows to express probability of an alignment as a product of the probability of its single sequences
And the probability of each single sequence breaks into three parts: (i) an initial nonmotif bearing part containing bases i(â„“ _{ n })={1,â€¦,â„“ _{ n }âˆ’1}, (ii) the motif, containing bases m(â„“ _{ n })={â„“ _{ n },â€¦,â„“ _{ n }+Wâˆ’1} and (iii) a final nonmotif bearing part formed by bases e(â„“ _{ n })={â„“ _{ n }+W,â€¦,L _{ n }} :
with the nonmotif bearing parts following a homogeneous Markov Chain of order C as described above and the motifbearing part following a nonhomogeneous Markov Chain defined as
with
where \(\pi _{w}^{a,\zeta }\) is a parameter that encodes the probability of a base a, at position w of the motif provided that its predecessors are in joint state Î¶.
Management of gaps
A sequence may have gaps introduced by the alignment algorithm. We compute the probability of a gap by summing over all possible nucleotides at that position in that sequence. For example to assess \(p\left (X_{n}^{u,o} =   X_{n}^{p(u,1),o}=\zeta,M_{n}=0\!\!\!\!\!\!\!\!{\phantom {\sum ^{0}_{0}}}\right)\), we use \(\sum _{a\in \{A,C,G,T\}}p\left (X_{n}^{u,o}= a  X_{n}^{p(u,1),o}=\zeta,M_{n}=0\!\!\!\!\!\!\!\!{\phantom {\sum ^{0}_{0}}}\right).\)
The used model estimation procedure and the freely available implementation are specified in Methods 5, and run times are exemplified in Additional file 1: Section 1.6.
Measuring classification performance
We evaluate all PFMs by a stratified repeated random subsampling validation by estimating all PFMs from a training set and measuring classification performance on a test set as follows.
In step 1, we generate two training sets and two disjoint test sets for each of the 35 transcription factors as follows. We randomly select 70% but maximal 1000 alignments from the set of alignments of a particular transcription factor as positive training set, and we choose the set of the remaining alignments but maximal 1000 as positive test set. We randomly select 70% but maximal 1000 alignments from the corresponding set of negative alignments of this transcription factor, and we choose the set of the remaining alignments but maximal 1000 as negative test set.
In step 2, we train a foreground model on the positive training set and a background model on the negative training set by expectation maximization [41] using a numerical optimization procedure in the maximization step. In all cases, we attempt to find a motif of length W=20 bp. It is known that the motifs of many TFs have a length smaller than W bp, but adding some possibly uninformative positions in case of short motifs is less harmful than not being able to take into account all motif positions in case of long motifs. We restart the expectation maximization algorithm, which is deterministic for a given dataset and a given initialization, 100 times with different initializations and choose the foreground model and the background model with the maximum likelihood on the positive training data and the negative training data, respectively, for classification. We use a likelihoodratio classifier of the two chosen foreground and background models, apply this classifier to the disjoint positive and negative test sets, and calculate the area under the receiver operating characteristics curve and the area under the precision recall curve as measures of classification performance.
We repeat both steps 25 times and determine (i) the mean area under the receiver operating characteristic curve and its standard error and (ii) the mean area under the precision recall curve and its standard error.
Relative increase of classification performance
We compute the relative increase or decrease of the classification performance of the PFM(1) and the PFM(2) relative to the PFM(0), where PFM(C) denotes a PFMs taking into account base dependencies of order C. We compute R _{ PFM(C)} as the ratio of the improvement of the PFM(C) relative to the PFM(0) divided by the maximum possible improvement to the PFM(0) as given by
Negative values of R _{ PFM(C)} denote a decrease of classification performance and positive values of R _{ PFM(C)} denote an increase of classification performance up to a maximum of R _{ PFM(C)}=1 which denotes perfect classification (provided that the AUC of PFM(0) is smaller than 1).
Mutual information
The mutual information (MI) is a standard measure for quantifying statistical dependencies. We compute the MI between a base at position w in a motif and its C preceding bases for w>C as follows
where X _{ w } denotes the base at position w and \(X_{w}^{C} =(X_{wC},\ldots,X_{w1})\) denotes the context of X _{ w }. I _{ C }(w) denotes the amount of information in the Cmer ending at position wâˆ’1 about its adjacent base at position w. I _{ C }(w) is undefined for wâ‰¤C.
We denote the vector of MIs values I _{ C }(w) for wâˆˆ{C+1,â€¦,W} by I _{ C }=(I _{ C }(C+1),â€¦,I _{ C }(W)), where W is the length of the motif, and we call this vector MI profile.
Implementation
We implement the proposed PFM based on the freely available Java Framework Jstacs [42]. Among others, Jstacs provides readytouse sequence models for reuse, numerical and nonnumerical optimization procedures for model estimation, serialization of models, and methods for the statistical evaluation of results. In contrast to existing tools which are typically focused on application, using Jstacs we are able to compare different PFMs in a detailed way by extracting mandatory information about the inferred models and the predicted binding sites.
Algorithm 1 shows the pseudocode for inferring a PFM from a set of alignments. The implementation of the proposed phylogenetic footprinting model is available at https://github.com/mgledi/PhyFoo/.
Results and discussion
We propose a model for phylogenetic footprinting that is capable of taking into account intramotif dependencies as specified in Methods 2. Specifically, we model intramotif dependencies in TFBSs as well as dependencies among adjacent bases in flanking sequences by Markov models of orders 0,1, and 2, and we denote the proposed PFM by PFM(0), PFM(1), and PFM(2).
In the first subsection we study if the proposed PFMs can capture intramotif dependencies of orders 1 and 2 in ChIPSeq data of 35 TFs. In the second subsection we study if modeling base dependencies can improve phylogenetic footprinting. Both studies are based on human sequences extracted from ENCODE ChIPseq data [33] and corresponding orthologous sequences of 9 mammalian species, yielding 35 data sets comprising 135196 multiple sequence alignments with an average length of 124 bases (Methods 1).
Intramotif dependencies can be captured by phylogenetic footprinting
In this subsection we study to which degree intramotif dependencies can be captured using the PFMs of orders 1 and 2.
We measure the degree of intramotif dependencies of order 1 between two neighboring bases or of order 2 between a dimer and its neighboring base by the MI as described in Methods 4. The MI quantifies the amount of information in a base or a dimer about the neighboring base in units of bits and ranges from 0 bits in case of statistical independence to 2 bits in case of deterministic dependency of the considered base on the preceding base or the preceding dimer. We compute the MI for every position of a binding site and call the resulting vector of MI values MI profile.
For each of the 35 TFs, we compute the two MI profiles of orders 1 and 2 from the motifs obtained by phylogenetic footprinting using the PFM(2). We present the resulting 35Ã—2 MI profiles as Additional file 3 and the 2Ã—2 MI profiles of the two TFs CJUN and Nrf as examples in Fig. 1 a.
First, we study the MI profiles of order 1 for these two TFs. For both TFs we find statistically significant intramotif dependencies between neighboring bases at all positions. For CJUN, intramotif dependencies of order 1 are particularly strong at motif positions 2 to 4, yielding a maximum MI of 0.52 bits at motif position 4. For Nrf, intramotif dependencies of order 1 are particularly strong at motif positions 8 to 11 and 14 to 15, yielding a maximum MI of 0.23 bits at motif position 11.
Next, we study the MI profiles of order 2. Again, we find statistically significant intramotif dependencies between dimers and their neighboring bases at all positions for both CJUN and Nrf. For CJUN, intramotif dependencies of order 2 are particularly strong at motif positions 2 to 4, yielding a maximum MI of 0.70 bits at motif position 3. For Nrf, intramotif dependencies of order 2 are particularly strong at motif positions 8 to 11 and 13 to 15, yielding a maximum MI of 0.28 bit at motif position 11.
Moreover, we find that intramotif dependencies of order 2 are significantly stronger than the corresponding intramotif dependencies of order 1 at several positions for both CJUN and Nrf. Comparing the MI profiles of orders 1 and 2, we find that the MI profile of order 2 is up to twofold higher than the MI profile of order 1 for CJUN and up to sevenfold higher for Nrf, stating that in both TFs there are significant intramotif dependencies of order 2 beyond those expected from the corresponding intramotif dependencies of order 1.
Next, we study the MI profiles of orders 1 and 2 for all 35 TFs. In order to condense the results and to allow a visual comparison of the results for both profiles and all 35 TFs, we show for each MI profile and each TF the maximum and mean MI values in Fig. 2 a.
We find that the average of the 35 maximum MI values of order 1 is 0.39 bits, whereas the average of the 35 maximum MI values of order 2 is significantly greater at 0.56 bits. Likewise, we find that the average of the 35 mean MI values of order 1 is 0.14 bits, whereas the average of the 35 mean MI values of order 2 is significantly greater at 0.23 bits. These observations suggest that intramotif dependencies are present in all of the studied TFs and that intramotif dependencies of order 2 are typically stronger than those of order 1.
By scrutinizing Figs. 2 a and b, however, we also find that the maximum and mean MIs values vary significantly from TF to TF. For example, we find a maximum and mean MI value of order 1 of 0.11 bits and 0.05 bits for CEBPB and a maximum and mean MI value of order 1 of 0.89 bits and 0.20 bits for Mxi. Analogously, we find a maximum and mean MI value of order 2 of 0.16 bits and 0.07 bits for CEBPB and a maximum and mean MI value of order 2 of 1.15 bits and 0.37 bits for Mxi.
To study the possibility that these captured intramotif dependencies are an artifact resulting from a mixture of different speciesspecific motifs, we finally study the similarity of the 10 speciesspecific motifs as well as the 20 speciesspecific MI profiles of orders 1 and 2. We find that the observed pairwise differences between the speciesspecific motifs are not significant (Additional file 1: Section 1.1.1). Moreover, we find that the speciesspecific MI profiles are similar to each other and to the corresponding MI profiles captured by phylogenetic footprinting (Additional file 4, Additional file 1: Section 1.1.2). Both findings indicate that the intramotif dependencies shown in Fig. 1 b and in Additional file 3 cannot be explained as an artifact resulting from a mixture of different speciesspecific motifs.
Modeling intramotif dependencies improves phylogenetic footprinting
In this subsection we study if modeling base dependencies can improve phylogenetic footprinting.
First, we compute the classification performance of the PFMs of orders 0,1, and 2 as described in Methods 3. Second, we determine the increase of the classification performance of the PFMs taking into account base dependencies of orders 1 and 2 relative to the classification performance of the PFM neglecting base dependencies as described in Methods 3. Here, positive values indicate an increase of classification performance, while negative values indicate a decrease of classification performance.
Figure 3 a shows the classification performances of the PFMs of orders 0,1, and 2 for each of the 35 TFs, and Fig. 3 b shows the corresponding relative increases. We find that modeling base dependencies of order 1 increases the classification performance in 31 of 35 cases, and we find that modeling base dependencies of order 2 increases the classification performance in all of the 35 cases. Moreover, we find that modeling base dependencies of order 2 always yields a higher classification performance than modeling base dependencies of order 1.
By scrutinizing Fig. 3 a, we find that the differences of the classification performances of the PFMs of orders 1 and 2 and the PFMs of order 0 vary significantly from TF to TF. For example, in case of base dependencies of order 1 we find the highest difference of 11% for CHD2 and the lowest difference of âˆ’1% for Rad21. In case of base dependencies of order 2 we find the highest difference of 13% for Rad21 and the lowest difference of 1% for RXRA.
By scrutinizing Fig. 3 b, we find that also the relative increases of classification performances vary significantly from TF to TF. For example, in case of base dependencies of order 1 we find the highest increase of 70% for JARIDA1A and the lowest increase of âˆ’7% for Rad21. In case of base dependencies of order 2 we find the highest increase of 78% for JARIDA1A and the lowest increase of 7% for RXRA.
Figure 4 summarizes the results by showing (a) the classification performance of the PFMs of orders 0,1, and 2 averaged over all 35 TFs and (b) the relative increases of classification performances averaged over all 35 TFs. We observe that the average classification performance increases significantly from order 0 to order 1 and from order 1 to order 2. Specifically, we find that the average classification performance of the PFM(1) is 4.6% higher than that of the PFM(0) and that the average classification performance of the PFM(2) is 3.5% higher than that of the PFM(1). We find that the average relative increase of the classification performance of the PFM(1) over that of the PFM(0) is 25% and that the average relative increase of the classification performance of the PFM(2) over that of the PFM(0) is 42%.
Next, we study the robustness of the proposed approach with respect to the number of species in the multiple sequence alignments. We perform the same study on the same 35 datasets with alignments comprising only subsets of the 10 species, and we find that for all subsets the classification performance increases significantly from order 0 to order 1 for many of the 35 TFs and from order 1 to order 2 for all of the 35 TFs (Additional file 1: Section 1.2).
These findings indicate that taking into account base dependencies improves phylogenetic footprinting, but they also indicate that this improvement is small. Given the fact that taking into account base dependencies improves onespecies approaches, too, it could well be that the improvement obtained by taking into account base dependencies in onespecies approaches is greater than in phylogenetic footprinting. Such a difference could result in the situation where the advantage of phylogenetic footprinting over onespecies approaches when neglecting base dependencies decreases or even turns into a disadvantage when taking into account base dependencies.
To study to which degree the small improvement of phylogenetic footprinting by taking into account base dependencies might be overshadowed by a possibly greater improvement of onespecies approaches, we compare the classification performances of the four cases of onespecies approaches and phylogenetic footprinting when neglecting and taking into account base dependencies (Additional file 1: Section 1.3). Consistent to previous studies, we find that phylogenetic footprinting yields a higher (lower) classification performance compared to onespecies approaches for 23 (12) of the 35 TFs when neglecting base dependencies. When taking into account base dependencies, however, phylogenetic footprinting yields a higher (lower) classification performance compared to onespecies approaches in 31 (4) of the 35 TFs.
This finding indicates that the small improvement of phylogenetic footprinting by taking into account base dependencies is greater than the corresponding improvement of onespecies approaches. It also indicates that the previously observed advantage of phylogenetic footprinting over onespecies approaches when neglecting base dependencies (23 to 12) does not decrease or turn into a disadvantage, but becomes even more pronounced (31 to 4), when taking into account base dependencies. This increased advantage of phylogenetic footprinting over onespecies approaches achieved by taking into account base dependencies is surprising as it indicates the presence of some synergy of modeling both phylogenetic and base dependencies.
We finally study for each of the 35 TFs which of the four models yields the highest classification performance, and we find that onespecies approaches neglecting base dependencies yields the highest classification performance for one TF (CEBPB), onespecies approaches taking into account base dependencies yields the highest classification performance for three TFs (BCL11A, MafK, and RXRA), phylogenetic footprinting neglecting base dependencies never yields the highest classification performance, and phylogenetic footprinting taking into account base dependencies yields the highest classification performance for 31 TFs. This finding indicates that phylogenetic footprinting can be improved by taking into account base dependencies, that onespecies approaches using base dependencies can be improved by taking into account phylogenetic dependencies, and that there is a surprising synergy of simultaneously modeling both phylogenetic and base dependencies.
Conclusions
In this work, we introduced a phylogenetic footprinting model capable of taking into account base dependencies and evaluated this phylogenetic footprinting model on ChIPseq data of 35 TFs. We found significant intramotif dependencies of orders 1 and 2 in all 35 datasets and that the inferred intramotif dependencies of order 2 are stronger than those of order 1 for all 35 TFs. We also found that these intramotif dependencies cannot be explained as an artifact resulting from a mixture of different speciesspecific motifs. We further found that the classification performance of the introduced phylogenetic footprinting model is higher than that of phylogenetic footprinting models neglecting base dependencies for all of the 35 TFs and higher than that of onespecies approaches for 31 of the 35 TFs. These findings suggest that combining phylogenetic footprinting with motif models incorporating intramotif dependencies may lead to an improved prediction of TFBSs and thus advance our understanding of transcriptional gene regulation and its evolution.
Abbreviations
 MI:

mutual information
 PFM:

phylogenetic footprinting model
 PWM:

position weight matrix
 TF:

transcription factor
 TFBS:

transcription factor binding site
References
Smith ZD, Meissner A. DNA methylation: roles in mammalian development. Nat Rev Genet. 2013; 14(3):204â€“20. doi:10.1038/nrg3354.
Tessarz P, Kouzarides T. Histone core modifications regulating nucleosome structure and dynamics. Nat Rev Mol Cell Biol. 2014; 15(11):703â€“8. doi:10.1038/nrm3890.
Sainsbury S, Bernecky C, Cramer P. Structural basis of transcription initiation by RNA polymerase II. Nat Rev Mol Cell Biol. 2015; 16(3):129â€“43. doi:10.1038/nrm3952.
Schoenberg DR, Maquat LE. Regulation of cytoplasmic mRNA decay. Nat Rev Genet. 2012; 13(4):246â€“59. doi:10.1038/nrg3160.
de Fougerolles A, Vornlocher HP, Maraganore J, Lieberman J. Interfering with disease: a progress report on sirnabased therapeutics. Nat Rev Drug Discov. 2007; 6(6):443â€“53.
Tam OH, Aravin AA, Stein P, Girard A, Murchison EP, Cheloufi S, Hodges E, Anger M, Sachidanandam R, Schultz RM, et al.Pseudogenederived small interfering rnas regulate gene expression in mouse oocytes. Nature. 2008; 453(7194):534â€“8.
Sultan M, Schulz MH, Richard H, Magen A, Klingenhoff A, Scherf M, Seifert M, Borodina T, Soldatov A, Parkhomchuk D, et al.A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science. 2008; 321(5891):956â€“60.
Luco RF, Pan Q, Tominaga K, Blencowe BJ, PereiraSmith OM, Misteli T. Regulation of alternative splicing by histone modifications. Science. 2010; 327(5968):996â€“1000.
Hobert O. Gene regulation by transcription factors and micrornas. Science. 2008; 319(5871):1785â€“6.
Voss TC, Hager GL. Dynamic regulation of transcriptional states by chromatin and transcription factors. Nat Rev Genet. 2014; 15(2):69â€“81.
Bulyk ML, Johnson PL, Church GM. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002; 30(5):1255â€“61.
Salama RA, Stekel DJ. Inclusion of neighboring base interdependencies substantially improves genomewide prokaryotic transcription factor binding site prediction. Nucleic Acids Res. 2010; 38(12):135â€“5.
Eggeling R, Roos T, MyllymÃ¤ki P, Grosse I. Inferring intramotif dependencies of dna binding sites from chipseq data. BMC Bioinforma. 2015; 16(1):375.
Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from highthroughput data. Nucleic Acids Res. 2013; 41(21):e197. doi:10.1093/nar/gkt831.
Ma W, Noble WS, Bailey TL. Motifbased analysis of large nucleotide data sets using memechip. Nat Protoc. 2014; 9(6):1428â€“50.
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA and RNAbinding proteins by deep learning. Nat Biotechnol. 2015; 33(8):831â€“8. doi:10.1038/nbt.3300.
Kulakovskiy I, Levitsky V, Oshchepkov D, Bryzgalov L, Vorontsov I, Makeev V. From binding motifs in chipseq data to improved models of transcription factor binding sites. J Bioinforma Comput Biol. 2013; 11(01):1340004.
Felsenstein J. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 1981; 17(6):368â€“76.
Schmidt D, Wilson MD, Ballester B, Schwalie PC, Brown GD, Marshall A, Kutter C, Watt S, MartinezJimenez CP, Mackay S, Talianidis I, Flicek P, Odom DT. Fivevertebrate ChIPseq reveals the evolutionary dynamics of transcription factor binding. Sci (New York, NY). 2010; 328(5981):1036â€“40. doi:10.1126/science.1186176.
Hardison RC, Taylor J. Genomic approaches towards finding cisregulatory modules in animals. Nat Rev Genet. 2012; 13(7):469â€“83.
Katara P, Grover A, Sharma V. Phylogenetic footprinting: a boost for microbial regulatory genomics. Protoplasma. 2012; 249(4):901â€“7.
MartinezMorales JR. Toward understanding the evolution of vertebrate gene regulatory networks: comparative genomics and epigenomic approaches. Brief Funct Genom. 2015. doi:10.1093/bfgp/elv032.
Blanchette M, Tompa M. Footprinter: a program designed for phylogenetic footprinting. Nucleic Acids Res. 2003; 31(13):3840â€“2.
Sinha S, Blanchette M, Tompa M. Phyme: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinforma. 2004; 5(1):170.
Moses A, Chiang D, Pollard D, Iyer V, Eisen M. Monkey: identifying conserved transcriptionfactor binding sites in multiple alignments using a binding sitespecific evolutionary model. Genome Biol. 2004; 5(12):98. doi:10.1186/gb2004512r98.
Neph S, Tompa M. Microfootprinter: a tool for phylogenetic footprinting in prokaryotic genomes. Nucleic Acids Res. 2006; 34(suppl 2):366â€“8.
Newberg LA, Thompson WA, Conlan S, Smith TM, McCue LA, Lawrence CE. A phylogenetic gibbs sampler that yields centroid solutions for cisregulatory site prediction. Bioinformatics. 2007; 23(14):1718â€“27.
Siddharthan R, Siggia ED, Van Nimwegen E. Phylogibbs: a gibbs sampling motif finder that incorporates phylogeny. PLoS Comput Biol. 2005; 1(7):67.
Siddharthan R. Phylogibbsmp: module prediction and discriminative motiffinding by gibbs sampling. PLoS Comput Biol. 2008; 4(8):1000156.
Arnold P, Erb I, Pachkov M, Molina N, van Nimwegen E. Motevo: integrated bayesian probabilistic methods for inferring regulatory sites and motifs on multiple alignments of dna sequences. Bioinformatics. 2012; 28(4):487â€“94. doi:10.1093/bioinformatics/btr695.
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57â€“74. doi:10.1038/nature11247.
UCSC. Genome Bioinformatics. 2016. http://hgdownload.cse.ucsc.edu/downloads.html. Accessed 29 Apr 2016.
Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, Bernstein BE, Bickel P, Brown JB, Cayting P, et al.Chipseq guidelines and practices of the encode and modencode consortia. Genome Res. 2012; 22(9):1813â€“31.
Kharchenko PV, Tolstorukov MY, Park PJ. Design and analysis of ChIPseq experiments for DNAbinding proteins. Nat Biotech. 2008; 26(12):1351â€“9. doi:10.1038/nbt.1508.
ENCODE. Uniform TFBS composite track. http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/. Accessed 29 Apr 2016.
Multiple alignments of the hg19/GRCh37 human genome assembly. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz46way/. Accessed 29 Apr 2016.
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment. Science. 1993; 262(5131):208â€“14.
Redhead E, Bailey TL. Discriminative motif discovery in dna and protein sequences using the deme algorithm. BMC Bioinforma. 2007; 8(1):1.
Keilwagen J, Grau J, Paponov IA, Posch S, Strickert M, Grosse I. Denovo discovery of differentially abundant transcription factor binding sites including their positional preference. PLoS Comput Biol. 2011; 7(2):1001070.
Agostini F, Cirillo D, Ponti RD, Tartaglia GG. Seamote: a method for highthroughput motif discovery in nucleic acid sequences. BMC Genomics. 2014; 15(1):925.
Lawrence CE, Reilly AA. An expectation maximization (em) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins Struct Funct Bioinforma. 1990; 7(1):41â€“51.
Grau J, Keilwagen J, Gohr A, Haldemann B, Posch S, Grosse I. Jstacs: a java framework for statistical analysis and classification of biological sequences. J Mach Learn Res. 2012; 13(1):1967â€“71.
Acknowledgements
We thank Ralf Eggeling, Jan Grau, Patrice Peterson, and Marcel Quint for valuable discussions. We thank the HudsonAlpha Institute for Biotechnology, the Stanford University, the Broad Institute of MIT and Harvard, and the University of Southern California for performing the ChIPseq experiments and the ENCODE Analysis Working Group for providing the datasets.
Authorsâ€™ contributions
MN and IG developed the key idea. MN and JC developed the computational methods. MN and HT performed the studies. All authors wrote, read, and approved the final manuscript.
Funding
This work was financially supported by DFG (grant no. GR3526/1), Gencat (2014 SGR 118), and Collectiveware (TIN201566863C21R).
Availability of data and materials
The datasets used in this work are included within the article and its additional files. The implementation of the proposed phylogenetic footprinting model is available at https://github.com/mgledi/PhyFoo/.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional files
Additional file 1
Supplementary Material. This file is structured in three sections, presenting four additional studies, details about the implementation and some statistics regarding the datasets of all 35 TFs.
In Section 1, Supplementary Results, we first study differences among speciesâ€“specific motifs of 35 TFs. We then study the robustness of the proposed PFM to different species compositions on data of 35 TFs. Third, we examine the impact of base dependencies and phylogenetic dependencies on classification performance. In the fourth subsection, we compare the proposed PFM(2) with a state of the art tool by Eggeling et al. 2015 [13] on data of 35 TFs. In the fifth subsection, we show statistics of the distances between ChIPseq positive regions and the alignment coverage of ten species. Finally, we specify the runâ€“time of our freely available implementation of the proposed PFM.
In Section 2, Supplementary Methods, we specify details about the estimation of speciesâ€“specific motifs and we define a statistical test for the significance of differences among speciesâ€“specific motifs.
In Section 3, Supplementary Tables, we show statistics of the datasets of 35 TFs, summarize results regarding the significance of speciesâ€“specific motifs and the impact of base dependencies and phylogenetic dependencies, and show the alignment coverage of ten species for 35 TFs. (PDF 1034.24 kb)
Additional file 2
Sequence data. This archive contains data files of alignments of the ChIPseq positive regions and negative control regions for each of the 35 TFs in FASTA format. (ZIP 83763.2 kb)
Additional file 3
Sequence logos, MI profiles of order 1, MI profiles of order 2, and speciesspecific MI profiles of orders 1 and 2. The file contains for each of the 35 TFs the sequence logo inferred using the PFM(2) aligned with MI profiles of order 1, the MI profiles of order 2, and speciesspecific MI profiles of orders 1 and 2 for each of the 10 species. (PDF 2129.92 kb)
Additional file 4
Tables of difference logos. The file contains for each of the 35 TFs a 10Ã—10 table of difference logos for a pairwise visual comparison of speciesspecific motifs. (ZIP 26112 kb)
Additional file 5
Sequence logos of predicted binding sites. The file contains sequence logos and their reverse complements of predicted binding sites inferred using the PFM(0), the PFM(1), and the PFM(2) for each of the 35 TFs. (PDF 11776 kb)
Additional file 6
ROC curves. The pdf file comprises for each TF one plot that shows the 25 ROC curves and one plot that shows the 25 PR curves from the 25â€“fold stratified repeated random subsampling validation procedure described in Methods 3. (PDF 2611.2 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Nettling, M., Treutler, H., Cerquides, J. et al. Combining phylogenetic footprinting with motif models incorporating intramotif dependencies. BMC Bioinformatics 18, 141 (2017). https://doi.org/10.1186/s1285901714951
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901714951