 Research article
 Open access
 Published:
An efficient algorithm for improving structurebased prediction of transcription factor binding sites
BMC Bioinformatics volume 18, Article number: 342 (2017)
Abstract
Background
Gene expression is regulated by transcription factors binding to specific target DNA sites. Understanding how and where transcription factors bind at genome scale represents an essential step toward our understanding of gene regulation networks. Previously we developed a structurebased method for prediction of transcription factor binding sites using an integrative energy function that combines a knowledgebased multibody potential and two atomic energy terms. While the method performs well, it is not computationally efficient due to the exponential increase in the number of binding sequences to be evaluated for longer binding sites. In this paper, we present an efficient pentamer algorithm by splitting DNA binding sequences into overlapping fragments along with a simplified integrative energy function for transcription factor binding site prediction.
Results
A DNA binding sequence is split into overlapping pentamers (5 base pairs) for calculating transcription factorpentamer interaction energy. To combine the results from overlapping pentamer scores, we developed two methods, KmerSum and PWM (Position Weight Matrix) stacking, for fulllength binding motif prediction. Our results show that both KmerSum and PWM stacking in the new pentamer approach along with a simplified integrative energy function improved transcription factor binding site prediction accuracy and dramatically reduced computation time, especially for longer binding sites.
Conclusion
Our new fragmentbased pentamer algorithm and simplified energy function improve both efficiency and accuracy. To our knowledge, this is the first fragmentbased method for structurebased transcription factor binding sites prediction.
Background
Transcription factors (TFs) interact with specific DNA sequences, called transcription factor binding sites (TFBSs), to regulate gene expression [1, 2]. Genomewide TFBS identification, a crucial step in deciphering transcription regulatory networks and annotating genomic sequences, remain a key challenge in postgenomics research. Both highthroughput experimental methods and computational approaches have been developed to tackle this problem. Each method has its unique advantages and limitations [3].
Computational methods include sequencebased and structurebased TFBS predictions. Structurebased prediction methods take advantage of the increasing numbers of TFDNA complex structures in Protein Data Bank (PDB) [4, 5]. Unlike sequencebased methods that rely on sequence conservation and usually are family based, structurebased TFBS prediction methods consider the physical interactions between a TF and candidate binding sequences (Fig. 1). The advantage of structurebased TFBS prediction methods lies in that they can explain the possible mechanisms involved in specific TFDNA binding and recognition, and help understand the effects of mutations on gene expression since these methods mimic in vivo binding and recognition events. A typical structurebased TFBS prediction method evaluates each candidate DNA sequence by “threading” it onto the DNA structure of a known TFDNA complex and the binding affinity or binding energy is then calculated using energy functions [3]. Virtually all structurebased methods for TFBS prediction require a TFDNA interaction model that can be experimentally solved proteinDNA complex structures [6,7,8] or high quality homologous TFDNA models as it has been demonstrated that transcription factors from the same family, in general, interact with DNA in a similar manner [9,10,11]. One issue in structurebased TFBS prediction concerns the potential divergence of DNA structures of different sequences for a transcription factor since only one TFDNA complex model is used for evaluating different sequences. For this method to work, TFDNA binding modes and DNA structures should be very similar. Our recent survey showed that the DNA structures of different cognate binding sequences of a transcription factor generally conserve well with smaller root mean squared deviations (RMSDs), suggesting that the interaction modes are conserved between the transcription factor and its specific binding sequences even though there are variations at certain positions of the binding motif [3].
The binding affinities between TFs and their binding sequences are evaluated using an energy function, which can be knowledgebased, physicsbased or a combination of both [12]. Knowledgebased energy functions are derived from statistical analysis of a set of known, nonredundant proteinDNA complexes. Their resolution varies from residuelevel [13,14,15,16] to atomlevel potentials [17,18,19,20]. Physicsbased energy functions, on the other hand, perform atomic level physicochemical calculations to quantify electrostatic interactions, van der Waals (VDW) forces, solvation energy, and others [4]. Using physicsbased energy can be computationally expensive and the method is sensitive to conformational changes. This is important since xray structures represent the majority of TFDNA complexes in PDB, which are snapshots of dynamic ensembles of many possible conformations. Knowledgebased potentials, on the other hand, are less sensitive to conformational changes because of their relatively coarselevel and meanforce nature, and are more computationally efficient. However these potentials are “averaged” values among different types of interactions and are less accurate for some amino acid types due to low count problem [12].
To take advantage of the unique features of both knowledgebased and physicsbased potentials for structurebased TFBS prediction, we recently developed an integrative energy (IE) function that consists of three terms, a residuelevel knowledgebased multibody (MB) potential, an explicit hydrogen bond (HB) energy, and an electrostatic potential for πinteraction energy. Studies have shown that both hydrogen bonds and ππ interactions play critical roles in specific proteinDNA binding [12]. Even though the multibody potential implicitly captures biophysical interactions including hydrogen bonds and πinteractions, the meanforce nature and the typical low count problem limit its ability to capture the key hydrogen bonds and πinteractions that contribute to TFDNA binding specificity [15]. The IE function improved TFBS prediction accuracy over MB as well as DDNA3, a knowledgebased atomiclevel proteinDNA interaction potential [12, 15, 18]. However, the algorithm cannot scale well for prediction of longer TF binding sites, especially for binding sites from TF dimers or tetramers. As shown in Fig. 1, our previous prediction algorithm first generates TFDNA complexes consisting of a TF and every possible permutation of its target sequence using 3DNA [21, 22]. The IE function is then applied to each TFDNA complex to calculate their binding energy and subsequently predict their binding sites (Fig. 1) [12]. The total number of TFDNA complex energy calculations is 4^{L}, where L is the length of the binding motif. For example, in our previous approach, we used a binding sequence of length 8, which requires evaluating a total of 65,536 TFDNA complexes. As the size of the binding sites increases, the time complexity increases exponentially.
Here we propose a new approach, called pentamer algorithm, which splits the DNA binding sequence into a series of overlapping subsequences/fragments of length 5 base pairs (bps), for more efficient and accurate TFBS prediction. Fragmentbased methods have shown their power in the field of protein structure prediction [23, 24]. Also in DNA shape studies, Rohs et al. have developed a DNA pentamer model for predicting DNA structural features [25,26,27,28]. In their model, DNA shape features of a nucleotide are predicted from a pentamer sequence that takes sequence context, two on each side, into consideration [28]. Even though we used the same term “pentamer” for DNA fragments in this work, the research problems are different. One uses a pentamer model or a 5bp sliding window to predict the shape features of the center nucleotide [25,26,27,28]. Our approach, on the other hand, calculates TFDNA pentamer binding energy and the fulllength binding sites are predicted by combining the fragment scores in postpreocessing. To the best of our knowledge, our method is the first attempt to use DNA fragments for structure or interactionbased TF binding site prediction. In addition to the pentamer algorithm, we modified our IE function to simplify the calculation of the hydrogen bond energy and πinteraction energy proposed in our previous method [12]. Results show that our new approach not only dramatically improve the prediction speed, it also helps improve prediction accuracy, especially for TF dimers with longer binding sites.
Methods
Modified integrative energy function
In our previous study, the IE function consisted of a multibody potential, a hydrogen bond term and a πinteraction term [12]. The knowledgebased multibody potential utilizes structural environment for accurate assessment of proteinDNA interactions as it uses DNA trinucleotides, called triplets, as an interaction unit to study interactions between TF and DNA [15]. The potential is distance dependent and the distance is calculated between an amino acid’s βcarbon and the geometric center of a nucleotide triplet. The position of a nucleotide is represented by the N1 atom in pyrimidines or the N9 atom in purines [15]. The hydrogen bond energy was calculated using FIRST, a third party program [12, 29], which makes the calculation less efficient. Since electrostatic interactions are involved in both hydrogen bond and πinteractions, in this study, we combine the hydrogen bond energy and πinteraction energy into one electrostatic energy term to reduce the complexity of energy calculation. The modified integrative energy function is shown in Eq. 1.
where E _{ IE } is the new, simplified integrative energy score, W _{ MB } and E _{ MB } are the weight and normalized energy score for the multibody potential respectively, and W _{ E } and E _{ E } are the weight and normalized score for the electrostatic energy respectively. Each energy term is normalized using the MinMax normalization method as we described in our previous study [12]. Since there are only a limited number of nonredundant TFDNA complexes with known TFBSs, not enough to have a separate training set for weight optimization, we used weights of 1 and 0.5 for W _{ MB } and W _{ E } respectively. The electrostatic term has smaller weight than W _{ MB } since electrostatic interactions are already implicitly captured in the multibody potential. The electrostatic potential is calculated using a variation of coulombs law (Eq. 2) where the partial charges of the atoms within interaction distance were determined using Marvinsketch, from Chemaxon (Additional file 1: Table S1) [30].
where E _{ ab } is the electrostatic energy between an atom a of an amino acid and an atom b of a DNA base, k _{ e } is Coulomb’s constant. N _{ A } is Avogadro’s number, q _{ a } and q _{ b } are the charges of the two atoms. ε is the dielectric constant and d is the distance between the point charges. The charges, q _{ a } and q _{ b }, are determined by multiplying the partial charge values with the charge of an electron (1.6 × 10^{−19} coulombs). The electrostatic potential of each atomic interaction is added together for the total electrostatic energy between the TF and a specific DNA sequence as shown in Eq. 3.
where E _{ E } is the total electrostatic energy between the TF and a DNA binding site, N _{ ab } is the number of amino acidbase interactions, E _{ ab } is the electrostatic energy between atom a of an amino acid and atom b of a base. The interaction distance d for atoms involved in hydrogen bond interaction was set at between 1.5 Å and 2.9 Å, a typical distance between the hydrogen atom and the hydrogen bond acceptor atom [21, 31,32,33]. We used REDUCE to add hydrogen atoms to the TFDNA complex structures [34]. The cutoff distance for atoms involved in a possible πinteraction between an aromatic amino acid and a base was 4.5 Å based on previous studies [35]. The sum of the charges found in the electron cloud of aromatic residues, were used as the charge for the electrostatic energy calculation to account for the delocalization of electrons in πsystems and their involvement in ππ interaction [12, 36,37,38].
We have demonstrated that the original IE function outperforms both residue and atomiclevel knowledgebased potentials in structurebased prediction of TF binding sites [12], therefore, in this study we only compared prediction accuracy of our new pentamer algorithm (with a simplified IE function) to the original IE function.
Pentamer algorithm
Generation of TFpentamer DNA complexes
The first step of the algorithm is to determine the binding sequence for a transcription factor. It can be based on prior knowledge or automatically detected using the TFDNA complex structure. For automatic detection, the TFDNA complex is checked for the first and the last base that are in contact with the TF using a distance cutoff of 5 Å between heavy atoms. Though the noninteracting flanking base pairs are less conserved, recent studies have shown that these flanking bases contribute to DNA binding specificity by affecting DNA shape and stability [27, 39,40,41,42,43]. Therefore we added two bases on each side of the binding sequence of length n, which resulted in an n + 4 DNA sequence for the initial system (Fig. 2). For example, a DNA binding sequence of 5 base pairs becomes a 9 bp sequence after adding two flanking base pairs on each side (Fig. 2). Energy minimization was first performed on the TFDNA complex using UCSF Chimera 1.8 with the following parameters: 100 steepest descent steps with a step size of 0.02, 100 conjugate gradient steps with a step size of 0.02, and an update interval of 10 as described in our previous study [12, 33]. The DNA sequence was then split into a series of overlapping 5 bp sequences by shifting one base pair at a time. The DNA sequence in each TFpentamer was mutated to every possible permutation using 3DNA [21, 22], which resulted 4^{5} or 1024 TFpentamer complex structures for each original TFpentamer. In total, there are n*1024 TFpentamer complex structures to be evaluated, where n is the number of pentamer fragments from the original DNA structure. The binding energy for each TFpentamer DNA complex was then calculated (Eq. 1).
Binding motif prediction
To predict the TF binding motif from these TFpentamer interaction energies, we developed two different methods, KmerSum algorithm and position weight matrix (PWM) stacking algorithm (Fig. 3). In the KmerSum algorithm, the IE score of a fulllength binding sequence is the sum of the interaction energy of overlapping pentamer sequences with the TF and the score of each permutation of the fulllength binding sequence is calculated accordingly (Fig. 3a). The statistically significant scores from the binding sequence IE score distribution of all the fulllength permutations were determined and their corresponding DNA sequences were used to generate a binding motif as described previously (Fig. 3a) [12]. In this study, the critical value for statistical significance in the KmerSum algorithm was 0.01 normalized by the length of the predicted motif. For the PWM stacking algorithm (Fig. 3b), the binding sequence was broken up into pentamer subsequences. The IE score of each permutation of each pentamer sequence was calculated. For a given pentamer representing 5 contiguous bases of the binding motif, a PWM representing the statistically significant pentamer sequences was calculated from the distribution of IE scores of all possible pentamer permutations (Fig. 3b). Each position (column) in a pentamer PWM represents a specific position (column) in the binding motif PWM. All of the corresponding cells representing the frequency of a particular nucleotide in a specific position were added together to generate a position frequency matrix (PFM) of the binding motif (Fig. 3b). The PFM was then converted to a PWM and a motif logo using the method described by Schneider and Stephens [44, 45].
The TFpentamer energy calculation and binding motif prediction were run on a cluster (a total of 708 computing cores) with dual Intel Xeon 2.93 GHz 6core processors–X5670 and 3GBs RAM per core. The CPU time was recorded and the speed was compared with the fulllength prediction method using IE, MB and DDNA3 energy fucntions.
Dataset
The new method was tested on two nonredundant sets: TF monomerDNA complexes and TF dimerDNA complexes [12]. The dataset consisted of high quality Xray crystal structures of TFDNA complexes (resolution <3 Å, and Rfactor ≤ 0.3) in PDB with corresponding JASPAR PWMs [46]. All TF monomer chains share no more than 35% sequence identity. The TF monomer dataset contains 27 nonredundant TF chainDNA complexes representing 12 transcription factor families: helix loop helix, zinc fingers, homeodomains, leucine zippers, STAT1, fork head, ETS family, high mobility group (HMG), NFAT, SMAD, P53 DNA binding domain, and runt domains. The dataset include: 1 AM9:A, 1 BC8:C, 1BF5:A, 1DSZ:A, 1GU4:A, 1H9D:A, 1JNM:A, 1LLM:C, 1NKP:A, 1NKP:B, 1NLW:A, 1OZJ:A, 1P7H:L, 1PUF:A, 1PUF:B, 2A07:F, 2 AC0:A, 2DRP:A, 2QL2:A, 2QL2:B, 2UZK:A, 2YPA:B, 3F27:A, 3HDD:A, 4F6M:A, 4HN5:A, 4IQR:A.
For dimer binding site prediction, a nonredundant set of eight TF dimerDNA complex structures was used: 1 AM9, 1GU4, 1JNM, 1NKP, 1NLW, 1OZJ, 2QL2, and 2YPA.
Performance evaluation
Due to the differences in length of predicted binding motifs and the heterogeneity of the TF domains that are involved in experimental determination of TFBSs, we calculated the Information Content weighted Pearson Correlation Coefficient (ICweighted PCC) values and used the ICweighted PCC values to determine the number of correctly predicted positions in the aligned PWMs between the predicted and reference motifs. ICweighted PCC is a PWM comparison method developed by Persikov and Singh to measure the similarity of the corresponding columns between the predicted and the reference PWMs of the same base positions in the binding motif [47]. The information content was calculated using Eq. 4:
where the IC(m) is the information content function for column m in a PWM, and B represents DNA base frequencies in that column. The ICweighted PCC was then calculated using Eq. 5:
where \( {PCC}_{m, n}^{IC} \) is the ICweighted PCC between the reference column m, and the predicted column n. m _{ b } and n _{ b } are the frequencies of the DNA bases b, found in the rows of the corresponding reference and predicted PWM columns respectively. \( \overline{m} \) and \( \overline{n} \) are the mean frequencies in the reference and predicted columns respectively. A predicted column is considered a correct prediction when the ICweighted PCC between the corresponding predicted and reference columns is at least 0.25 [47]. The advantage of using ICweighted PCC measure is that it takes into consideration both the conservation of a baseposition in the reference binding motif (information content) and how well it matches the predicted binding motif (Pearson’s correlation coefficient). The statistical analysis was carried out using Wilcoxon Signedrank test to compare the number of correctly predicted columns in a dataset.
Averaged KullbackLiebler (AKL) divergence was also used to quantitatively measure the similarity between the predicted and reference PWMs as shown in Eq. 6 [48, 49].
where D _{ AKL } is the AKL divergence, i represents the corresponding columns of the base positions being compared in the predicted and reference matrices. B represents the four bases A, C, G and T. P _{ iB } and Q _{ iB } represent the frequency of a particular base B in corresponding columns i in the predicted and reference matrices respectively.
Results
In this work, in addition to the new pentamer algorithm, we proposed a simplified IE function for more efficient prediction over the previously developed IE function that requires an external program to calculate hydrogen bond energy (See Methods) [12]. To check if the new simplified IE energy function has comparable performance to the original IE function, we compared their prediction accuracy using the same prediction algorithm (termed fulllength algorithm in this study). Wilcoxon Signedrank test showed that there is no significant difference between the new IE and the original IE function when tested on the nonredundant dataset of 27 TFDNA complex structures using the fulllength prediction algorithm [12]. The pvalues in terms of AKL divergence and the number of correctly predicted columns are 0.65 and 0.39 respectively for the null hypothesis: there is no difference between the two IE functions. For individual cases, one IE function may work better than the other but overall there is no apparent difference between the original IE function with explicit hydrogen bond and πinteraction energy terms and the new IE function even though the new IE function is much easier and less computationally intensive to calculate (Additional file 1: Figure S1).
The pentamer algorithms with the simplified IE function were tested on the multifamily nonredundant dataset of 27 TF monomerDNA and 8 TF dimerDNA complex structures. The performance of the new pentamer algorithm was compared with our previous fulllength algorithm in terms of both speed and prediction accuracy. Computing time results on the monomer and the dimer cases clearly show that the pentamer algorithm is much faster than the fulllength methods (Fig. 4 and Additional file 1: Table S2 for all the cases). In the TF monomerDNA cases, the new algorithm requires an average of 4.66 (KmerSum) and 4.61 (PWM stacking) CPU hours for IE energy calculation (columns 4 and 5 respectively in Additional file 1: Table S2) while the fulllength IE method needs an average of 162 CPU hours (column 6 in Additional file 1: Table S2), about 35 times faster. Even though the pentamer algorithm requires a postprocessing step to predict the binding sites by combining the pentamer energy in KmerSum or PWM stacking (Fig. 3), the time for postprocessing is minimal (data not shown). The DDNA3 and MB energy function in the fulllength method required less time than the IE energy but they still used several fold (>3.7 for DDNA3 and >11 for MB) more time than the pentamer IE method (Additional file 1: Table S2) and they are less accurate than the IE function in TF binding site prediction as we demonstrated previously [12].
When tested on the dataset of TF dimers with longer DNA binding sites, the improvement in computing time is even more significant. The pentamer algorithm performs calculations at a linear time complexity for subsequences of the binding site, making prediction of longer binding motifs much faster (Additional file 1: Table S3). For example, the binding site for three TF dimers (1 AM9:human SREBP1, 1GU4:human C/EBPβ, and 1OZJ:human Smad3MH1) has 12 base pairs. The fulllength algorithm needs to evaluate 4^{12} = 16,777,216 TFDNA interaction energies while the pentamer algorithm only needs to calculate binding energy for 8*4^{5} = 8192 TFpentamer complexes. Not surprisingly, the runtime (CPU hours) analysis showed that the pentamer algorithm, including TFpentamer energy calculation and the KmerSum postprocessing time, has about 644 (for 1GU4) to 1209 (for 1OZJ) fold improvement over the fulllength IE method (Fig. 4 and Additional file 1: Table S2). Another factor that affects the running time is the protein size (Additional file 1: Figure S2). It took longer for 1OZJ than that for 1GU4 since 1OZJ (288 amino acids) is about twice the size of 1GU4 (156 amino acids). Even though MB and DDNA3 energy calculations require less time than the IE energy, the pentamer algorithm still showed on average 143 times faster than fulllength MB and 47.5 times faster than fulllength DDNA3.
Not only does the pentamer algorithm run much faster due to the reduced total number of energy calculations, it also produced overall better results than the original fulllength method in terms of the number of correctly predicted columns (Figs. 5 and 6). For example, more columns are predicted correctly using either the KmerSum or the PWM stacking pentamer algorithm for 1GU4:A and 1P7H:L (human NFAT1), which are also reflected in the binding motifs (Fig. 5a and b). (See Additional file 1: Figure S3 for all 27 predicted TF monomer binding motifs). We performed statistical analysis using Wilcoxon Signedrank test to test the alternative hypothesis that the pentamer algorithm generated a greater number of correctly predicted base positions than the previous algorithm. The pvalues are 0.0028 and 0.0029 for the KmerSum and PWM stacking algorithms respectively, suggesting that increases in prediction accuracy are statistically significant for both the KmerSum and the PWM stacking methods over the fulllength method using the IE function.
In several cases, even though both the pentamer and the fulllength methods show similar results in terms of the number of correctly predicted columns, the results from the pentamer algorithm are actually better than the previous method because of the broad definition of correctly predicted columns. The cutoff for correctly predicted columns is set at 0.25 (ICweighted PCC value) as proposed by Persikov and Singh [47], meaning there is a large range, from 0.25 to 1, of ICweighted PCC values for the correctly predicted columns. A closer look at the distributions of the ICweighted PCC values revealed that even though the correctly predicted columns are comparable between the pentamer and the fulllength algorithms, including 2A07:F (human Foxp2), 1NLW:A (human MAD protein), and 2DRP:A (Drosophila melanogaster Cys2His2 zinc finger), the actual ICweighted PCC values from the pentamer predictions are better (closer to 1) than the ICweighted PCC values from the previous method (Fig. 5a and c). The distributions of the ICweighted PCC of all 27 cases shows a similar trend; more data points are close to the perfect ICweighted PCC score in the pentamer algorithms than the fulllength method (Fig. 5d and Additional file 1: Figure S4).
While the pentamer method significantly reduces the time for energy calculation for longer binding sites (Fig. 4 and Additional file 1: Table S2), more importantly, we found that it also improved the prediction accuracy significantly. Six of the eight dimer cases showed much improved TFBS predictions when compared to predictions using the fulllength method (Fig. 6ac and Additional file 1: Figure S5). Wilcoxon Signedrank tests were performed to test the alternative hypothesis that the pentamer algorithms have more correctly predicted columns, which showed the differences are statistically significant with pvalues of 0.0168 and 0.0165 for the KmerSum and PWM stacking pentamer algorithms respectively.
In addition to TF dimers with known TFDNA complex structures and corresponding JASPAR motifs, we also made predictions for Hox proteins Extradenticle (Exd) and Ultrabithorax (Ubx). Ubx and Exd form a dimer to regulate gene expression [50, 51]. Though both Ubx and Exd have annotated JASPAR PWMs separately, there are no JASPAR binding motifs for the UbxExd dimer. However, binding sites of UbxExd dimer have been reported in several studies [50, 52]. The predicted UbxExd dimer (PDB ID: 1B8I) binding sites are consistent with the published data (Fig. 6d). Furthermore, the Drosophila limbpromoting gene Distalless regulatory element, which is in part regulated by UbxExd interactions, also validates the pentamer prediction results [53, 54].
Discussion
We have previously developed a structurebased method using an IE function for improving TF binding site prediction and demonstrated that it increased prediction accuracy for different TF families as well as different TFs within the homeodomain family [12]. The method needed to evaluate n ^{L} TFDNA complex structures (where L is the length of the binding sequence). Since a fixed sequence length (8 bps) was used for TF binding site prediction for each single TF domainDNA complex, each prediction required energy calculations for all 65,536 (4^{8}) possible permutations of the sequence, which could be calculated in a reasonable time frame [12]. However, as the length of the binding motif increases, the number of energy calculations increases exponentially, making it impractical for longer TF binding site predictions even with the availability of large computer clusters. There are many instances where we need to evaluate longer binding sequences. For example, we need to consider flanking sequences for binding site prediction as it has been demonstrated that flanking bases contribute to binding specificity even though these flanking sequences are not conserved [42]. Secondly, some binding sites are regulated by dimers or tetramers of either the same transcription factor (homo) or different transcription factors (hetero), which are typically much longer than 8 base pairs. Also in homology model based TF binding site prediction, it would be ideal to consider multiple homology models to increase the conformational coverage, which demands more energy calculations.
We addressed the problem in this study by developing a simplified IE function and a fragmentbased pentamer algorithm, which improve both the speed and prediction accuracy, especially for longer binding sites (Figs. 4, 5 and 6 and Additional file 1: Table S2). The increase of prediction speed is not surprising since we only need to calculate energies of 1024 (4^{5}) TFpentamer complexes times the number of fragments (Fig. 2 and Additional file 1: Table S3). The overall improvement of accuracy may lie in the fact that the longrange interactions from the coarse multibody function can introduce noise to our previous fulllength algorithm. In the pentamer algorithm the noise level is reduced since it only considers a short sequence environment. Another factor that may affect the prediction accuracy is the weights for the two energy terms in the simplified IE function. As mentioned in the Methods section, training for an optimal set of weights is not possible due to the limited availability of cases. The weights, 1 for MB term and 0.5 for the electrostatic energy term, were assigned based on our previous study and the characteristics of the energy terms [12]. For example, knowledgebased multibody potential already implicitly captured the electrostatic interactions that are important for specific proteinDNA interactions, including hydrogen bonds and πinteractions. To investigate the effects of weights on the prediction performance, we compared the prediction accuracy of different weight ratios. Statistical analyses showed that the weight combination in our study was among a small number of combinations that produced better predictions and the weights for the electrostatic energy term W _{ E } are between 0.25 (4^{−1}) to 0.5 (2^{−1}) (Additional file 1: Figure S6).
As a note, while the pentamer algorithm reduces the time complexity by lowering the number of TFDNA complexes for energy calculation, calculating the final binding motif using the KmerSum algorithm requires additional computing time especially for longer binding sites as it requires calculation for all possible sequence permutations for the full length binding site (Fig. 3a). Nevertheless, the total time used by the pentamer algorithm is significantly less than the fulllength method with better prediction accuracy. For dimer structures and multidomain TFs that have longer spacing (> = 4 bps) between two monomer binding sites, it may be more efficient to calculate each of the binding sites individually and then combine them to form one binding motif. Of the two pentamer combination algorithms, PWM stacking is marginally faster as it does not need to calculate the individual sequence binding energy while the KmerSum algorithm has slightly better prediction accuracy. In addition, the KmerSum method actually predicts energy scores for each binding sequences while PWM stacking can only produce a binding sequence profile.
Conclusion
We developed a fragmentbased pentamer algorithm with a simplified energy function that greatly speeds up the TFBS prediction by reducing the number of energy calculations and improves TFBS prediction accuracy. Two algorithms, KmerSum and PWM stacking, were used to combine the TFpentamer scores for binding motif prediction with overall better performance in terms of TFBS prediction accuracy. Our results also show that the longer the binding sites, the more speedup and accuracy can be achieved. For future studies, we will test this new approach for binding site prediction using TFDNA homology models.
Abbreviations
 AKL:

Average KullbackLiebler
 Exd:

Extradenticle
 HB:

Hydrogen bond
 IC:

Information content
 IE:

Integrative energy
 MB:

Multibody
 PCC:

Pearson’s correlation coefficient
 PDB:

Protein Data Bank
 PFM:

Position frequency matrix
 PWM:

Position weight matrix
 RMSD:

Root mean squared deviations
 TF:

Transcription factor
 TFBS:

Transcription factor binding site
 Ubx:

Ultrabithorax
 VDW:

Van der Waals
References
Lemon B, Tjian R. Orchestrated response: a symphony of transcription factors for gene control. Genes Dev. 2000;14(20):2551–69.
Levine M, Tjian R. Transcription regulation and animal diversity. Nature. 2003;424(6945):147–51.
Guo JT, Lofgren S, Farrel A. Structurebased prediction of transcription factor binding sites. Tsinghua Sci Technol. 2014;19(6):568–77.
Liu LA, Bradley P. Atomistic modeling of proteinDNA interaction specificity: progress and applications. Curr Opin Struct Biol. 2012;22(4):397–405.
Berman HM, Bhat TN, Bourne PE, Feng ZK, Gilliland G, Weissig H, Westbrook J. The protein data Bank and the challenge of structural genomics. Nat Struct Biol. 2000;7:957–9.
Endres RG, Schulthess TC, Wingreen NS. Toward an atomistic model for predicting transcriptionfactor binding sites. Proteins. 2004;57(2):262–8.
Kono H, Sarai A. Structurebased prediction of DNA target sites by regulatory proteins. Proteins. 1999;35(1):114–31.
Morozov AV, Havranek JJ, Baker D, Siggia ED. ProteinDNA binding specificity predictions with structural models. Nucleic Acids Res. 2005;33(18):5781–98.
Garvie CW, Wolberger C. Recognition of specific DNA sequences. Mol Cell. 2001;8(5):937–46.
Kaplan T, Friedman N, Margalit H. Ab initio prediction of transcription factor targets using structural knowledge. PLoS Comput Biol. 2005;1(1):e1.
Siggers TW, Honig B. Structurebased prediction of C2H2 zincfinger binding specificity: sensitivity to docking geometry. Nucleic Acids Res. 2007;35(4):1085–97.
Farrel A, Murphy J, Guo JT. Structurebased prediction of transcription factor binding specificity using an integrative energy function. Bioinformatics. 2016;32(12):i306–13.
MandelGutfreund Y, Margalit H. Quantitative parameters for amino acidbase interaction: implications for prediction of proteinDNA binding sites. Nucleic Acids Res. 1998;26(10):2306–12.
Aloy P, Moont G, Gabb HA, Querol E, Aviles FX, Sternberg MJ. Modelling repressor proteins docking to DNA. Proteins. 1998;33(4):535–49.
Liu Z, Mao F, Guo JT, Yan B, Wang P, Qu Y, Xu Y. Quantitative evaluation of proteinDNA interactions using an optimized knowledgebased potential. Nucleic Acids Res. 2005;33(2):546–58.
Takeda T, Corona RI, Guo JT. A knowledgebased orientation potential for transcription factorDNA docking. Bioinformatics. 2013;29(3):322–30.
Donald JE, Chen WW, Shakhnovich EI. Energetics of proteinDNA interactions. Nucleic Acids Res. 2007;35(4):1039–47.
Zhang C, Liu S, Zhu Q, Zhou Y. A knowledgebased energy function for proteinligand, proteinprotein, and proteinDNA complexes. J Med Chem. 2005;48(7):2325–35.
Robertson TA, Varani G. An allatom, distancedependent scoring function for the prediction of proteinDNA interactions from structure. Proteins. 2007;66(2):359–74.
Xu B, Yang Y, Liang H, Zhou Y. An allatom knowledgebased energy function for proteinDNA threading, docking decoy discrimination, and prediction of transcriptionfactor binding profiles. Proteins. 2009;76(3):718–30.
Lu XJ, Olson WK. 3DNA: a software package for the analysis, rebuilding and visualization of threedimensional nucleic acid structures. Nucleic Acids Res. 2003;31(17):5108–21.
Lu XJ, Olson WK. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of threedimensional nucleicacid structures. Nat Protoc. 2008;3(7):1213–27.
Simons KT, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol. 1997;268(1):209–25.
Zhang Y. Templatebased modeling and free modeling by ITASSER in CASP7. Proteins. 2007;69(S8):108–17.
Chiu TP, Yang L, Zhou T, Main BJ, Parker SC, Nuzhdin SV, Tullius TD, Rohs R. GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res. 2015;43(Database issue):D103–9.
Yang L, Orenstein Y, Jolma A, Yin Y, Taipale J, Shamir R, Rohs R. Transcription factor familyspecific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol. 2017;13(2):910.
Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, Bussemaker HJ, Gordan R, Rohs R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc Natl Acad Sci U S A. 2015;112(15):4654–9.
Zhou T, Yang L, Lu Y, Dror I, Dantas Machado AC, Ghane T, Di Felice R, Rohs R. DNAshape: a method for the highthroughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res. 2013;41(Web Server issue):W56–62.
Jacobs DJ, Rader AJ, Kuhn LA, Thorpe MF. Protein flexibility predictions using graph theory. Proteins. 2001;44(2):150–65.
ChemAxon [http://www.chemaxon.com]. Accessed July 2017.
Thorpe MF, Lei M, Rader AJ, Jacobs DJ, Kuhn LA. Protein flexibility and dynamics using constraint theory. J Mol Graph Model. 2001;19(1):60–9.
Dahiyat BI, Mayo SL. De novo protein design: fully automated sequence selection. Science. 1997;278(5335):82–7.
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF chimeraa visualization system for exploratory research and analysis. J Comput Chem. 2004;25(13):1605–12.
Word JM, Lovell SC, Richardson JS, Richardson DC. Asparagine and glutamine: using hydrogen atom contacts in the choice of sidechain amide orientation. J Mol Biol. 1999;285(4):1735–47.
Gallivan JP, Dougherty DA. Cationpi interactions in structural biology. Proc Natl Acad Sci U S A. 1999;96(17):9459–64.
Michael Gromiha M, Siebers JG, Selvaraj S, Kono H, Sarai A. Intermolecular and intramolecular readout mechanisms in proteinDNA recognition. J Mol Biol. 2004;337(2):285–94.
McGaughey GB, Gagne M, Rappe AK. Pistacking interactions. Alive and well in proteins. J Biol Chem. 1998;273(25):15458–63.
Wintjens R, Lievin J, Rooman M, Buisine E. Contribution of cationpi interactions to the stability of proteinDNA complexes. J Mol Biol. 2000;302(2):395–410.
SantaLucia J Jr, Allawi HT, Seneviratne PA. Improved nearestneighbor parameters for predicting DNA duplex stability. Biochemistry. 1996;35(11):3555–62.
Afek A, Schipper JL, Horton J, Gordan R, Lukatsky DB. ProteinDNA binding in the absence of specific basepair recognition. Proc Natl Acad Sci U S A. 2014;111(48):17140–5.
Barrera LA, Vedenko A, Kurland JV, Rogers JM, Gisselbrecht SS, Rossin EJ, Woodard J, Mariani L, Kock KH, Inukai S, et al. Survey of variation in human transcription factors reveals prevalent DNA binding changes. Science. 2016;351(6280):1450–4.
Gordan R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk ML. Genomic regions flanking Ebox binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Rep. 2013;3(4):1093–104.
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordan R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci. 2014;39(9):381–99.
Schneider TD, Stephens RM. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990;18(20):6097–100.
Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14(6):1188–90.
Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, WorsleyHunt R, et al. JASPAR 2016: a major expansion and update of the openaccess database of transcription factor binding profiles. Nucleic Acids Res. 2016;44(D1):D110–5.
Persikov AV, Singh M. De novo prediction of DNAbinding specificities for Cys2His2 zinc finger proteins. Nucleic Acids Res. 2014;42(1):97–108.
Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics. 2001;57(2):441–8.
Xu M, Su Z. A novel alignmentfree method for comparing transcription factor binding site motifs. PLoS One. 2010;5(1):e8797.
Passner JM, Ryoo HD, Shen L, Mann RS, Aggarwal AK. Structure of a DNAbound UltrabithoraxExtradenticle homeodomain complex. Nature. 1999;397(6721):714–9.
Crocker J, Abe N, Rinaldi L, McGregor AP, Frankel N, Wang S, Alsawadi A, Valenti P, Plaza S, Payre F, et al. Low affinity binding site clusters confer hox specificity and regulatory robustness. Cell. 2015;160(1–2):191–203.
Foos N, MaurelZaffran C, Mate MJ, Vincentelli R, Hainaut M, Berenger H, Pradel J, Saurin AJ, OrtizLombardia M, Graba Y. A flexible extension of the drosophila ultrabithorax homeodomain defines a novel Hox/PBC interaction mode. Structure. 2015;23(2):270–9.
Gebelein B, Culi J, Ryoo HD, Zhang W, Mann RS. Specificity of Distalless repression and limb primordia development by abdominal Hox proteins. Dev Cell. 2002;3(4):487–98.
Merabet S, Saadaoui M, Sambrani N, Hudry B, Pradel J, Affolter M, Graba Y. A unique Extradenticle recruitment mode in the drosophila Hox protein Ultrabithorax. Proc Natl Acad Sci U S A. 2007;104(43):16946–51.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Institutes of Health [R15GM110618 to J.G]; and National Science Foundation [DBI1356459 to J.G].
Availability of data and materials
The datasets used in this study are publically available from Protein Data Bank and JASPAR as cited in the paper.
Author information
Authors and Affiliations
Contributions
AF and JTG conceived the study and designed the experiment. AF carried out the experiment and performed data analysis. AF and JTG wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1:
Supplemental figures and tables. (PDF 3646 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Farrel, A., Guo, Jt. An efficient algorithm for improving structurebased prediction of transcription factor binding sites. BMC Bioinformatics 18, 342 (2017). https://doi.org/10.1186/s1285901717550
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901717550