Quality versus accuracy: result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence
- Junbai Wang^{1}Email author
DOI: 10.1186/1471-2105-15-289
© Wang; licensee BioMed Central Ltd. 2014
Received: 15 April 2014
Accepted: 18 August 2014
Published: 27 August 2014
Abstract
Background
Computational modeling transcription factor (TF) sequence specificity is an important research topic in regulatory genomics. A systematic comparison of 26 algorithms to learn TF-DNA binding specificity in in vitro protein-binding microarray (PBM) data was published recently, but the quality of those examined PBMs was not evaluated completely.
Results
Here, new quality-control parameters such as principal component analysis (PCA) ellipse is proposed to assess the data quality for either single or paired PBMs. Additionally, a biophysical model of TF-DNA interactions including adjacent dinucleotide interdependence was implemented in a new program - BayesPI2, where sparse Bayesian learning and relevance vector machine are used to predict unknown model parameters. Then, 66 mouse TFs from the DREAM5 challenge were classified into two groups (i.e. good vs. bad) based on the paired PBM quality-control parameters. Subsequently, computational methods to model TF sequence specificity were evaluated between the two groups.
Conclusion
Results indicate that both the algorithm performance and the predicted TF-binding energy-level of a motif are significantly influenced by PBM data quality, where poor PBM data quality is linked to specific protein domains (e.g. C_{2}H_{2} DNA-binding domain). Especially, the new dinucleotide energy-dependent model (BayesPI2) offers great improvement in testing prediction accuracy over the simple energy-independent model, for at least 21% of analyzed the TFs.
Background
Recently, a comprehensive evaluation of 26 algorithms, for modeling transcription factor (TF) sequence specificity in in vitro protein-binding microarray (PBM) data [1], was published by DREAM5 (the Dialogue for Reverse Engineering Assessments and Methods) consortium. Many interesting results were revealed through this work. For example, mononucleotide position weight matrices (PWM) methods perform similarly to more advanced dinucleotide PWM algorithms for modeling TF sequence specificity, and inferred binding energy-level of a motif has little effect on overall prediction accuracy. This study also briefly mentioned that PBM data quality may have a strong influence on algorithm performance across 66 mouse TFs. However, the actual data quality of the examined PBMs in the DREAM5 challenge (i.e. 66 training PBMs and 66 testing PBMs for the mouse TFs) is not investigated systematically. Generally, the microarray experiment is known for containing many kinds of biases [2, 3] such as nonlinearity, saturation, and dynamic range problems for the signal intensity. In DREAM5 challenge, for a pair of training and testing PBM experiments, two different array designs were used for a mouse TF. However, 8-mers that were used to compute the 8-mer median intensities for every PBM are identical. This unique feature provides an opportunity to assess the PBM data quality [4]. For instance, if both training and testing PBM experiments in good data quality, then the observed 8-mer median intensities between the training and testing PBMs will have good agreement. On the contrary, if one of the PBMs yields poor data quality, then the 8-mer median intensities between two PBMs will not match well. Consequently, the testing prediction accuracy is not a true reflection of the algorithm performance if paired PBMs have poor measurement agreements. In other words, computational algorithms will not predict a binding signal that only exists in the testing PBM experiment but it does not appear in the training PBM data, and vice versa. Thus, it is important to develop PBM quality-control parameters that can evaluate the data quality for either single or paired PBMs.
Free-energy-based biophysical modeling TF sequence specificity, from detailed theoretical studies [5–7] to rapid computational development in real applications [8–11], have been investigated for many years and several computer programs are publically available now [11–14]. Recently, dependent energy correction such as dinucleotide interdependence was also incorporated into TF-binding energy by BEEML-PBM and FeatureREDUCE [1]. In the DREAM5 challenge, performance of the dinucleotide-dependent model of the two new programs is not improved greatly over the simple energy-independent model (i.e. <10% of examined TFs were benefited by the energy-dependent model; increase in correlation coefficient > 0.05 [1]). However, in many earlier studies, sequence dependencies in TF-binding sites were widely observed [15–18]. Particularly, energy-dependent model needs to fit a large number of unknown model parameters, which often encounters the over-fitting data problem that impairs the algorithm performance [19]. Additionally, if the input data is large, then there is a memory issue to R and MATLAB programs which suffer from extremely slow computation (i.e. BEEML-PBM and many other programs in the DREAM5 challenge [1]). Therefore, it is worthy to design a novel algorithm which implements the dependent energy correction in an efficient programming language. Then, PBMs of 66 mouse TFs from the DREAM5 challenge can be reanalyzed by the new program. It may help revealing whether the limitation of previous algorithms hampers the discovery of motifs that contain nucleotide dependency in the binding sites.
Motivated by the above-mentioned challenges, new quality-control parameters for both single and paired PBMs, and a novel C implementation of biophysical modeling protein-DNA interactions including dinucleotide interdependence (BayesPI2) are presented here. The new methods and programs were applied on 66 mouse TFs in in vitro PBM experiments from the DREAM5 challenge. Overall, four major questions are investigated in this work: 1) the true data quality of paired PBM experiments for 66 mouse TFs; 2) the association between the PBM data quality and the algorithm performance; 3) whether the binding energy-dependent model offers a great improvement over the simple energy-independent model in testing prediction accuracy; 4) whether the low binding energy-level of a motif is a real biological phenomenon or a bias due to the data quality and the algorithm limitation.
Results
PBM data quality of two mouse TFs
The new PBM quality-control parameters were first tested at TF_7 (Mix) and TF_63 (Zkscan5), since they are extreme cases in the previous algorithm performance comparison [1]. For example, in Figure two of the original publication, 66 mouse TFs were sorted in decreasing order by the mean final algorithm performance scores, TF_7 and TF_63 were ranked as the first and the last TF, respectively. It suggests that the majority of evaluated algorithms performed significantly better at TF_7 than at TF_63 in the original study. In other words, the PBM data quality of TF_7 may be much better than that of TF_63. Thus, a quality analysis of the above-mentioned two TFs may tell the usefulness of new PBM quality-control parameters.
Figure 1C and 1D illustrate the quality of paired PBMs for TF_7 and TF_63, by applying PCA ellipse on the scatter plot of Z-score transformed and log normalized 8-mer median intensities between the two PBMs. In the scatter plots, the lengths of both the major and minor axes of the PCA ellipse are quite different between TF_7 (i.e. 9.4 and 2.4; Figure 1C) and TF_63 (i.e. 8.3 and 5.0; Figure 1D). This is consistent with the previous hypothesis in single PBM experiment that the lengths of the major and minor axes reflect the dynamical range of PBM signal intensities, and the difference of 8-mer median intensities between paired PBMs, respectively. Put differently, if a paired PBMs has good data quality (i.e. TF_7), then a PCA ellipse with long major axis but short minor axis will be expected. For paired PBMs, correlation coefficients of normalized 8-mer median intensities are also quite different between the good-quality PBMs (i.e. 0.87; TF_7) and the bad-quality ones (i.e. 0.46; TF_63). Nevertheless, it is not an indicator of measurement agreement between the two PBMs, because correlation coefficients measure the strength of a relationship between the two variables, and data with obvious poor agreement can produce high correlations [20]. Therefore, the new PBM quality-control parameters not only provided a visual inspection of data quality for either single or paired PBMs, but also suggested that TF rank order of algorithm performance comparison [1] is associated with the data quality of both training and testing PBMs.
PBM data quality of 66 mouse TFs
Classifying 66 mouse TFs into two groups based on PBM data quality
So far, the results of analyzing 66 mouse TFs imply that methods for modeling TF sequence specificity are strongly affected by the PBM data quality. It is better to group 66 TFs into two clusters (i.e. good versus bad quality) by PBM quality-control parameters, then to reevaluate the algorithm performance (i.e. BayesPI2 energy-independent model versus the energy-dependent model including dinucleotide dependence). As a consequence, 66 mouse TFs from the DREAM5 challenge were assigned to two clusters by applying unsupervised fuzzy neural gas methods on the single PBM quality parameters (i.e. training PBM experiment). Comparing to the known TF rank order from original work [1], the best classifications were achieved by two parameters (i.e. correlation coefficient between normalized signal intensities and background intensities, and the length of the major PCA axis). That is consistent with the earlier observations in Figure 2A and 2C. The clustering result is an average of ten times classifications based on the above-mentioned two-quality parameters, where cluster one contains more good-quality PBM experiments than that in cluster two. For example, ~79% of 24 TF that grouped in the first cluster belong to the top half of ranked TFs (i.e. TF rank order from 1 to 33), and ~67% of 42 TF that assigned to the second cluster are in the bottom half of ranked TFs (i.e. TF rank order from 34 to 66). In summary, the quality of training PBM experiments influences the algorithm performance comparison. In other words, the algorithm performance on testing data may be predicted by the corresponding training data quality.
Prediction results of TFs with good PBM quality by using BayesPI2 energy-independent model and energy-dependent model including dinucleotide interactions
TF family | Rank | CorrCoef (Ind) | Length (Ind) | Number (Ind) | CorrCoef (Dep) | Length (Dep) | Number (Dep) | |
---|---|---|---|---|---|---|---|---|
TF_7 | bHLH | 1 | 0.78 | 13 | 1 | 0.786 | 10 | 1 |
TF_26 | bHLH | 2 | 0.66 | 10 | 1 | 0.67 | 13 | 1 |
TF_56 | C2H2 Z F(4) | 3 | 0.76 | 12 | 1 | 0.788 | 13 | 1 |
TF_55 | AT hook | 4 | 0.8 | 8 | 1 | 0.8 | 9 | 1 |
TF_17 | NR | 5 | 0.76 | 9 | 1 | 0.757 | 8 | 1 |
TF_11 | NR | 6 | 0.826 | 10 | 1 | 0.833 | 10 | 1 |
TF_16 | Myb/SANT | 7 | 0.808 | 10 | 1 | 0.817 | 9 | 1 |
TF_31 | C2H2 ZF (13) | 8 | 0.585 | 12 | 1 | 0.59 | 11 | 1 |
TF_15 | Pou + Homeo | 9 | 0.62 | 11 | 1 | 0.655 | 11 | 1 |
TF_45 | Myb/SANT | 11 | 0.8 | 12 | 1 | 0.78 | 11 | 1 |
TF_42* | Forkhead | 12 | 0.75 | 12 | 1 | 0.805 | 13 | 1 |
TF_64 | C2H2 ZF (3) | 13 | 0.75 | 8 | 1 | 0.75 | 10 | 1 |
TF_52 | NR | 14 | 0.807 | 12 | 1 | 0.79 | 10 | 1 |
TF_3* | Forkhead | 16 | 0.67 | 10 | 1 | 0.724 | 11 | 1 |
TF_27* | bZIP | 17 | 0.526 | 9 | 1 | 0.635 | 11 | 1 |
TF_18 | Sox | 18 | 0.638 | 8 | 1 | 0.677 | 8 | 1 |
TF_22* | T-box | 19 | 0.675 | 9 | 1 | 0.746 | 12 | 1 |
TF_47 | Homeo | 21 | 0.726 | 12 | 1 | 0.767 | 11 | 1 |
TF_44 | GATA | 22 | 0.633 | 12 | 1 | 0.68 | 10 | 1 |
TF_28 | C2H2 ZF (8) | 23 | 0.58 | 11 | 1 | 0.6 | 11 | 1 |
TF_13* | Pou + Homeo | 24 | 0.584 | 9 | 1 | 0.675 | 13 | 1 |
TF_5 | C2H2 ZF (3) | 25 | 0.59 | 10 | 1 | 0.618 | 11 | 1 |
TF_43 | Forkhead | 27 | 0.577 | 10 | 1 | 0.614 | 12 | 1 |
TF_19 | Sox | 29 | 0.559 | 8 | 1 | 0.59 | 11 | 1 |
TF_39 | C2H2 ZF (3) | 30 | 0.63 | 11 | 1 | 0.668 | 13 | 1 |
TF_51* | Pou + Homeo | 31 | 0.615 | 12 | 1 | 0.67 | 13 | 1 |
TF_23 | T-box | 33 | 0.64 | 12 | 1 | 0.66 | 13 | 1 |
TF_12* | NR | 35 | 0.55 | 13 | 1 | 0.606 | 13 | 1 |
TF_49 | NR | 34 | 0.675 | 10 | 1 | 0.676 | 11 | 1 |
TF_53* | RFX | 39 | 0.696 | 12 | 1 | 0.77 | 13 | 1 |
TF_14 | Myb/SANT | 40 | 0.6 | 8 | 1 | 0.62 | 9 | 1 |
TF_48 | NR | 43 | 0.755 | 12 | 1 | 0.78 | 12 | 1 |
TF_38 | DM | 45 | 0.67 | 9 | 1 | 0.689 | 8 | 1 |
TF_32 | C2H2 ZF (6) | 55 | 0.587 | 9 | 1 | 0.62 | 8 | 1 |
Verification of BayesPI2 energy-dependent model including dinucleotide interaction energies
Applying BayesPI2 on 66 mouse TFs (good quality versus bad quality)
Both BayesPI2 energy-independent model and dinucleotide energy-dependent model were applied on Z-score transformed and log-normalized probe intensities of 66 training PBMs [1]. The inferred TF-binding energy matrices were then used to evaluate prediction accuracies at 66 testing PBMs from the same DREAM5 challenge [1]. Results are shown in Table 1 and Additional file 1: Table S1 for TFs with good-quality PBMs and those with bad-quality PBMs, respectively. The two groups were classified earlier based on the paired PBMs quality parameters. For BayesPI2 energy-independent model, the median correlation coefficients between testing probe intensities and BayesPI2 predicted intensities are 0.67 and 0.45 in Table 1 and Additional file 1: Table S1, respectively. A two-tailed T-test of correlation coefficients between Table 1 and Additional file 1: Table S1 gives P-value 6.0958 × 10^{-12}, which suggests that algorithm testing prediction accuracy from Table 1 (i.e. 34 TFs with good agreement between training and testing PBMs) is significantly better than that from Additional file 1: Table S1 (i.e. 32 TFs with poor agreement between the two PBMs). Similar T-tests were carried out for algorithm performance scores (i.e. Pearson correlation coefficient of probe intensities) of 26 algorithms published by the DREAM5 challenge [1], more than 85% of algorithms show significant better performance at good PBM quality group than that at poor PBM quality group (i.e. 22 and 23 algorithms with T-test P-value <0.01 in Additional file 1: Tables S2 and S3, respectively). Nevertheless, four algorithms seem to have little effect by the quality of PBMs, where k-mer sequence-specific model plus feature selections were used (i.e. Team_k, Team_B, Team_I, and Seed-and-Wobble).
Notable, in Table 1, 34 TFs are spread to almost 14 different DNA-binding domains such as bHLH (2 TFs) and C_{2}H_{2} (7 Tfs). However, in Additional file 1: Table S1, 32 TFs just belong to 6 different DNA-binding domains where almost half of them (15 TFs) are C_{2}H_{2} DNA-binding domain (Zinc finger protein). It indicates that the poor quality of some PBM experiments (i.e. Additional file 1: Table S1) may be protein domain specific (i.e. Zinc finger protein in Additional file 1: Table S1). Additionally, if we only consider protein domains with more than two TFs from both Table 1 and Additional file 1: Table S1, then there are three protein domains (i.e. around 67%, 67%, and 60% of TFs from Forkhead, Pou + Homeo, and bZIP, respectively) that received a great boost in testing prediction accuracy (i.e. increase in correlation coefficients > 0.05) after using BayesPI2 binding energy-dependent model. In other words, these three protein domains may more frequently encounter base pair interdependency in the DNA-binding sites than that in the other domains.
Results from both Table 1 and Additional file 1: Table S1 reveal that the testing prediction accuracy based on energy-dependent model is very sensitive to the data quality, and the Bayesian method is robust against the data noise. For example, by using the energy-dependent model of BayesPI2 or BEEML-PBM [1], about 14 and 5 TFs show great improvement (increase in Pearson correlation coefficient >0.05) in the testing prediction accuracy, respectively, over that by the simple energy-independent model; for the same test, no improvement was found by FeatureREDUCE. Among the 5 TFs provided by BEEML-PBM, 3 belong to the good-quality PBM group (i.e. TF_27, TF_32, and TF_53; Table 1) where 2 TFs (TF_27 and TF_53) were identified by BayesPI2, and the remaining 2 TFs are in the bad-quality PBM group (i.e. TF_21 and TF_60; Additional file 1: Table S1) where only one TF (TF_21) was recovered by BayesPI2. Thus, by applying various algorithms on the PBMs, the overlap of predictions is poor for TFs with bad-quality data but robust to TFs with good-quality ones. In summary, the better the PBM data quality, the better the testing prediction accuracy, and both the PBM data quality and the TF-binding site interdependency may be protein domain specific. Especially, the good-quality PBM experiments generally benefit more from biophysical modeling of protein-DNA interactions including dinucleotide interactions, than the poor ones from the same computation.
Predicted TF-binding energy-level versus paired PBM data quality
To verify the present finding, the same analysis was performed again on a set of PWMs provided by the DREAM5 challenge [1], where the PWMs were predicted by various methods based on the same training PBM experiments for 66 mouse TFs. Methods that were used to obtain those PWMs include both biophysical free-energy models (i.e. BEEML-PBM [22], FeatureREDUCE, and MatrixREDUCE [11]) and other model types (i.e. PWM + HWMs from Team_E [1], and RankMotif [23]). First, each PWM was converted to TF-binding energy matrix. Then, scatter plots of log-transformed median binding energy-level of a PWM versus sorted paired PBM quality parameters were made (i.e. Figure 5E, F, G, and H). P-values of regression coefficients to the length of the major and minor axes of the PCA ellipse, and the correlation coefficients are P < 0.03, P < 0.045, and P < 0.03, respectively. It is clear that there is a strong positive correlation between the predicted binding energy-level of a motif and the PBM data quality. Overall, the results of published PWMs are consistent with the previous findings by using the BayesPI2 predicted binding energy matrices. It reinforces the hypothesis that methods for modeling TF sequence specificity are extremely sensitive to the PBM data quality. Specifically, the low energy-level (information content) of a predicted binding energy matrix may be caused by the poor PBM data quality (i.e. the poor agreement between training and testing PBMs).
Algorithm performance comparison and verification of predicted PBEM in ChIP-seq data
A comparison of algorithm performance between the BayesPI2 and the other methods was carried out, where the median Pearson correlation coefficients (i.e. correlation between the predicted probe intensities and the actual intensities) of 66 mouse TFs were computed for each algorithm (Additional file 1: Table S4). The correlation coefficients based on bayesPI2 and 26 other algorithms were obtained from this study (i.e. Table 1 and Additional file 1: Table S1) and the earlier publication [1] (i.e. Supplementary Table 3 of original paper), respectively. A scatter plot of the sorted median Pearson correlation coefficients for all algorithms is shown in Additional file 1: Figure S6 where the performance of BayesPI2 is close to the top 10 ranked algorithms from the original paper. However, it should be noted that the present comparison may not tell the true merit of each algorithm because of the poor data quality in PBM experiments (i.e. Additional file 2 versus supplementary Figure 4 of original paper). For computational cost, BayesPI2 takes ~7 min and ~30 min to predict one PBEM (i.e. using ~40000 PBM probe sequences and ~600 MB memory) at a Linux cluster machine by applying energy-independent model and energy-dependent models, respectively. However, for BEEML-PBM, the same computer could not complete the prediction of one PBEM (i.e. using the same input data) including dinucleotide interaction energies after running for almost 14 days with ~12 GB memory. Taken together, the new program BayesPI2 is an efficient and robust tool to analyze large data sets such as PBM.
Correlations between the estimated TF-binding affinities based on inferred PBEM from in vitro data and the measured TF tag counts from in vivo ChIP-seq experiment
ChIP-Seq | Called peaks | Rank | Paired PBMs agreement | R (top 500) | R (top 1000) | R (top 2000) | R (all peaks) |
---|---|---|---|---|---|---|---|
TF_31 (Zfx) | 10338 | 8 | Good | 0.38978 | 0.35943 | 0.33566 | 0.25502 |
TF_44 (Gata4) | 16979 | 22 | Good | 0.29675 | 0.28458 | 0.2645 | 0.15614 |
TF_23 (Tbx20) | 4012 | 33 | Good | 0.077781 | 0.062428 | 0.043291 | 0.043338 |
TF_25 (Tbx5) | 56352 | 41 | Bad | 0.062865 | 0.090766 | 0.073079 | 0.053016 |
TF_40 (Esrrb) | 21647 | 59 | Bad | 0.16334 | 0.12283 | 0.10351 | 0.064026 |
Discussion
In this work, new quality-control parameters (i.e. PCA ellipse) were developed to assess the quality of PBM. Both single and paired PBM data quality can be illustrated in a scatter plot, where predefined control limit (i.e. ${T}_{2,n,p}^{2}$) by PCA quality-control ellipse gives a direct assessment of measurement attribute. For example, the lengths of the major and minor axes of the PCA ellipse represent the dynamical range of PBM signal intensities, and the overall difference between paired PBMs, respectively. For single PBM, algorithm performance at the testing data may be predicted for ~70% of TFs based on the corresponding training data quality. For paired PBMs, a visual inspection of PCA quality-control ellipse on the scatter plot can not only identify data outliers but also tell the robustness of agreement between the two observations. On the contrary, correlation coefficient is easily affected by the data outliers (Additional file 1: Figure S7), which is not suited to measure agreement between paired observations [20]. Based on the estimated PBM quality information for 66 mouse TFs from the DREAM5 challenge, several interesting findings were revealed: for instance, both training data quality and paired PBMs agreement (Figures 2 and 3) are significantly correlated to the TF rank order according to Figure 2 of original publication [1], where 66 TFs were sorted in decreasing order by the mean of final algorithm performance scores. The results indicate that the decrease in algorithm performances across 66 mouse TFs in the DREAM5 challenge is mostly due to the gradual reduction of PBM data quality, especially the poor agreement between training and testing PBMs.
In addition to the new PBM quality-control parameters, a biophysical model of protein-DNA interactions including adjacent dinucleotide interdependence was newly implemented in C – BayesPI2, where sparse Bayesian learning approach was used to infer free-energy model parameters. The new energy-dependent model is able to recover known nucleotide interdependent effects on the binding affinities for Egr1 and Hnf4a, respectively. The results also reveal that dinucleotide interdependence often occurs at low binding energy (or information content) sites (Figure 4), which are significantly influenced by PBM data quality. Particularly, the new BayesPI2 dinucleotide energy-dependent model offers great improvement in testing prediction accuracy over the simple energy-independent model, for at least 21% of the analyzed mouse TFs (i.e. Table 1 and Additional file 1: Table S1). The new improvement might have resulted from more motif lengths searched by this study. Alternatively, the over-fitting data problem, which hampers regression-based free-energy model [1, 19] to estimate a large number of unknown model parameters (i.e. BEEML-PBM in R), is minimized by the Bayesian implementation of nonlinear parameter fitting.
Equipped with both the new quality-control parameters for paired PBMs and the new free-energy model including dinucleotide interdependence, 66 mouse TFs from the DREAM5 challenge were first classified into two groups (i.e. good-quality and bad-quality PBMs in Table 1 and Additional file 1: Table S1, respectively) based on the agreement between training and testing PBMs, then the algorithm accuracy on the test sets and the improvement by the energy-dependent model over the simple energy-independent model were compared between the two groups. Four new observations were revealed by this work: 1) the algorithm testing accuracy at good-quality PBMs is significantly better (P > 7 × 10^{-12}) than that at the bad-quality ones (Additional file 1: Tables S2 and S3); 2) the poor PBM data quality is protein domain specific because almost half of the bad-quality PBMs (Additional file 1: Table S1) belong to C_{2}H_{2} DNA-binding domain (Zinc finger protein); 3) the improvement in algorithm testing prediction accuracy by using the energy-dependent model over that by the simple energy-independent model is not only associated with the PBM data quality, but also linked to the specific protein domains (i.e. Forkhead, Pou + Homeo, and bZIP); 4) the predicted binding energy-level (or information content) of a motif is significantly correlated to the quality of paired PBMs (i.e. the better paired the PBM data quality, the higher the predicted binding energy-level; Figure 5).
From these four new discoveries, two (i.e. the better the PBM quality, the better the prediction accuracy; and the better the PBM quality, the higher the binding energy-level of a motif) were observed in both the BayesPI2 predictions and the original results from the DREAM5 challenge [1]. Particularly, the predicted PBEMs from the good-quality PBMs (i.e. TF_31 and TF_44) performed significantly better than those inferred by the poor PBMs (i.e. TF_25 and TF_40), on subsequent genomic analysis in in vivo data such as ChIP-seq. Nevertheless, there is a poor correlation between the inferred TF-binding affinities and measured ChIP-seq signals for TF_23 (TBX20), a putative good-quality PBM in Table 2 and Figure 6. This may be caused by the limitation of in vivo experiment. For example, ChIP-seq may not necessarily identify the direct TF-DNA interactions [24]), and TBX20 is known to directly interact with a number of proteins in regulation of gene expression [25]. The remaining two findings are supported by the literature evidences: 1) for protein domain specific PBM data quality, it is known [21] that binding of C_{2}H_{2} zinc finger proteins are often not well measured in PBM experiments because many C_{2}H_{2} proteins do not bind specific DNA sequences in PBM experiments; 2) for protein domain specific dinucleotide interdependence, two of the protein domains (i.e. Pou + Homeo and bZIP) were known to contain nucleotide dependence at the binding sites [21] (i.e. the homeodomain recognition helix is associated to base pair interdependency to DNA-binding, and many bZIP factors frequently bind to two distinct half-sites that may result in dinucleotide interaction). Therefore, both PBM data quality and computational modeling of protein-DNA interactions are influenced by the specific protein domains, and certain protein domains may require a free-energy model including dinucleotide interdependence to obtain precise binding energy matrix. It is important to note that the associations of both the PBM data quality and the dinucleotide energy-dependence with the protein structure classes are only revealed by this study, after applying the new PBM quality-control parameters and BayesPI2.
To minimize the effect of PBM data quality on downstream data analysis, DNA microarray experiment design from the previous works [26] might be introduced. Especially, triplet PBM experiments may be better than paired PBMs design to distinguish the experimental failure (i.e. a poor PBM data quality) from the biological failure (i.e. the TF does not bind to a DNA sequence). An alternative computational solution, to the PBM data quality issue, is to integrate k-mer sequence specific model plus feature selections into the PWM energy model (i.e. the top-ranked algorithm – FeatureREDUCE in the original study [1]). That is because k-mer sequence specific model (i.e. Teams K, B, I, and Seed-and-Wobble; uses the highest-affinity k-mer) does not consider the intensity values of PBMs when learning the motif. It is robust against the data noise that may provide a good initial seed motif for regression-based free-energy model to estimate a precise PBEM.
Conclusion
In conclusion, both the new PBM quality-control parameters and the new biophysical modeling of TF-DNA interactions including dinucleotide interdependence are developed during this work. By applying both methods on paired PBMs for 66 mouse TFs from the DREAM5 challenge, we found that: Bayesian method is robust against the data noise, and mononucleotide PWM methods do not perform similarly to more advanced dinucleotide PWM algorithms for modeling TF sequence specificity. For instance, the BayesPI2 energy-dependent model offers great improvement, for ~21% of the examined TFs, in the testing prediction accuracy over that by the simple energy-independent model. Especially, the PBM data quality not only impacts the algorithm performance, but also influences the inferred binding energy-level of a motif (e.g. the better the PBM data quality, the higher the inferred binding energy-level (information content)). This work may help tremendously for future research in developing computational methods and designing PBM experiments.
Methods
Principal component analysis – quality control ellipse
To check the agreement of two measurements by principal component analysis (PCA) [27], sample mean, sample variance, and the covariance between two observations are needed. Let X_{1} and X_{2} be two observations (i.e. the normalized 8-mer median intensities of a mouse TF) under experiments one (i.e. training PBM) and two (i.e. testing PBM), respectively. X = [X_{1}, X_{2}] are two n × 1 vectors where n is the number of observations in the experiments. The vectors of sample mean are $\overline{X}=\left[{\overline{X}}_{1},{\overline{X}}_{2}\right]$ and the sample covariance matrix is $S=\left[\begin{array}{cc}\hfill {S}_{11}\hfill & \hfill {S}_{12}\hfill \\ \hfill {S}_{21}\hfill & \hfill {S}_{22}\hfill \end{array}\right]$. By performing PCA on the covariance matrix S, principal component coefficients are obtained: for example, U^{'}SU = L where U and L are the eigenvectors (characteristic vectors; $U=\left[\begin{array}{cc}\hfill {U}_{11}\hfill & \hfill {U}_{12}\hfill \\ \hfill {U}_{21}\hfill & \hfill {U}_{22}\hfill \end{array}\right]$) and eigenvalues (characteristic roots; $L=\left[\begin{array}{cc}\hfill {L}_{1}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill {L}_{2}\hfill \end{array}\right]$) of S, respectively. Then, the n variables X (i.e. the vector of X_{ 1 } and X_{ 2 }) are transformed to n uncorrelated principal components Z based on equation $Z={U}^{\text{'}}\left[X-\overline{X}\right]$. Subsequently, Z is scaled to y-score, $Y=\frac{Z}{\sqrt{L}}$ with unit variance. Finally, T-score to a pair of observations can be computed by T^{2} = Diag(Y^{'}Y) where Diag is the main diagonal of Y^{'}Y. It is an overall measurement of the conformance of an observation to its mean. Thus, the quantity T^{ 2 } gives a direct assessment of similarity between two measurements. Any observation vectors that produce values of T^{ 2 } greater than a predefined threshold ${T}_{2,n,p}^{2}$ will be out of control in a quality control (or PCA) ellipse. To compute the quality-control limit ${T}_{2,n,p}^{2}$, a probability value p needs to be defined. For example, in this work, the limit of quality-control ellipse is three times the standard deviation (~99.73%) from sample mean, then the probability value p is ~0.0027 and the ${T}_{2,n,p}^{2}$ is computed from F distribution [28] . To construct a unique quality-control ellipse in the two-dimensional case, the length of the major and minor axes, their orientation, and their interaction are needed. These information can be easily obtained from the length of the semi-major ($\sqrt{{L}_{1}{T}_{2,n,p}^{2}}$) and semi-minor ($\sqrt{{L}_{2}{T}_{2,n,p}^{2}}$) axes, the slope of the major (V_{ 21 }/V_{ 11 }) and minor (-V_{ 11 }/V_{ 21 }) axes where V = UL^{1/2}, and the major and minor axes of the ellipse intersect at the sample mean $\overline{X}$.
Quality control parameters for PBM
Two types of quality-control parameters are defined for PBM: one is single PBM data quality and the other is paired PBM data quality. For the former, correlation coefficient between normalized signal intensities and normalized background intensities, the length of the major and minor axes of PCA ellipse, and regression coefficient are considered. Both PCA ellipse and regression coefficient were based on an M versus A scatter plot, where $M=\frac{1}{2}\text{log}2\left(\mathit{\text{Signal}}*\mathit{\text{Background}}\right),\phantom{\rule{0.12em}{0ex}}A=\text{log}2\left(\raisebox{1ex}{$\mathit{\text{signal}}$}\!\left/ \!\raisebox{-1ex}{$\mathit{\text{Background}}$}\right.\right)$. For the second one, it includes the length of the major and minor axes of PCA ellipse, correlation coefficient of normalized 8-mer median intensities between a pair of training and testing PBM experiments, and regression coefficient. Here, both PCA ellipse and regression coefficient are obtained from a scatter plot of normalized 8-mer median intensities between a pair of PBMs. In the scatter plot, the orientation and the length of the major PCA ellipse axis is related to the first principal component, which represents the dynamical range of PBM signal intensities (i.e. with 99.73% limit of PCA quality control ellipse, ~99.73% of measured signals are included in the quality ellipse); the orientation and the length of the minor PCA ellipse axis is the second principal component, which shows disagreements between the two observations. Generally, if there is good agreement between two PBMs, then a narrow PCA ellipse is obtained [4], which means that both PBMs are of good quality. If the PCA ellipse is wide, then the observations of the two PBMs are in poor agreement, which indicates that data quality of one of the PBMs is in question. Subsequently, the algorithm testing prediction accuracy may not be reliable due to poor agreement between training and testing PBM experiments. Taken together, the length of the major axis of the PCA ellipse (i.e. the dynamical range of PBM signal intensities) is an essential quality-control parameter for both signal and paired PBMs.
Biophysical model of protein-DNA interaction including dinucleotide interdependence
where E_{i,a} is the interaction energy with nucleotide a ∈ (A, C, G, T) at position i = 1, 2, … L of the DNA sequence; S_{i,a} characterizes the sequence, S_{i,a} = 1 if i-th bases is a, otherwise S_{i,a} = 0; ${J}_{i,j}^{a,b}$ is a pair of dependent energy correction (i.e. a at position i, b at position j, a, b ∈ (A, C, G, T) ). In the calculation, only adjacent nucleotide interactions are considered in ${J}_{i,j}^{a,b}$, which reduces the number of pair-dependent energy corrections from 16L^{2} to 16(L - 1), L is the length of a motif. This simplified version of TF-DNA interaction including dinucleotide interdependence was used by an earlier paper [22] too.
Sparse Bayesian learning of model parameters
where w_{1} and b_{1} are the output layer parameters of neural networks, and μ, E_{i,a}, and ${J}_{i,j}^{a,b}$ represent the chemical potential, the TF-binding energy matrix and the dinucleotide-dependent energy correction, respectively. If α_{ i } is large, then the model parameters are close to zero, which are not important for the minimization function. Conversely, if the α_{ i } is small, then the corresponding model parameters are important for the data fitting. In this way, sparse Bayesian learning of model parameters can be realized. Following the new Bayesian parameter minimization framework, a set of new update functions (i.e. back-propagation neural networks, Bayesian evidence approximation with R-propagation algorithm) [12] were derived and implemented in BayesPI2, a new C program for inferring protein-DNA binding energy matrix.
Unsupervised classification of 66 mouse TFs by PBM quality-control parameters
An in-house made fuzzy neural gas algorithm, a combination of fuzzy logic and neural gas algorithm, was used to classify 66 TFs into two classes based on PBM quality-control parameters. The neural gas algorithm uses a similar “soft-max” adaptation rule as maximum-entropy clustering and self-organizing maps (SOMs) to summarize high-dimensional input space (i.e. TFs) to low-dimensional reference vector space (i.e. class prototype). Then, the TF is assigned to the nearest class prototype and the fuzzy membership estimates the confidence level of the classification. Though many other machine-learning methods can be utilized to perform the same task, the Fuzzy neural gas algorithm is capable of performing unsupervised learning and capturing nonlinear relationships between the features (i.e. quality-control parameters) and sample classes [31].
PBM experiments
PBM of mouse TFs Egr1 and Hnf4a were downloaded from UniProbe database [32]. For both TFs, one of the replicate PBM experiments was used as training set and the other was used as test set to verify the newly developed BayesPI2 including dinucleotide interdependence. Both training and testing PBM experiments of 66 mouse TFs were obtained from the DREAM5 challenge [1], where normalized raw probe intensities were used to evaluate algorithm testing performance and single PBM data quality, and normalized 8-mer median intensities were used to evaluate paired PBM data quality. For all calculations, the normalization of PBM data is based on Z-score transformation of log-normalized signal intensities.
ChIP-seq data
To evaluate the predicated PBEM on in vivo data, ChIP-seq data sets of five mouse TFs were obtained: Esrrb (TF_40; GEO accession GSM288355), Zfx (TF_31; GEO accession GSM288352), Tbx20 (TF_23; GEO accession GSM734426), Tbx5 (TF_25; GEO accession GSM558908), and Gata4 (TF_44; GEO accession GSM558904). For each in vivo data set, the predicted PBEM by BayesPI2 from in vitro PBM experiment was used to scan DNA sequences of all called peaks, where the TF-binding affinities were computed based on the middle 200 bases of each peak. The affinity-based analysis of DNA sequences is similar to MatrixREDUCE [33] but Fermi-Dirac form of TF-binding probability is used by BayesPI2. Then, a linear regression model is applied to test the dependence between the estimated TF-binding affinity by using in vitro PBEM and the observed TF tag counts from in vivo data. For example, the ChIP-seq tag counts represent response variables and the estimated TF-binding affinities are explanatory variables in a linear regression model. Finally, the T values and the correlation coefficients from the linear regression analysis are used to evaluate the significance of correlations between the estimated TF-binding affinities and the actual TF occupancy data.
Computer programs
Inferred PBEMs from PBM experiments for 66 mouse TFs by using BayesPI2, C version of BayesPI2 program, and MATLAB program of PCA quality control ellipse are publically available http://folk.uio.no/junbaiw/CBayesPI2
Declarations
Acknowledgments
JBW is supported by Oslo University Hospital, Norwegian Cancer Society (DNK 2192630-2012-33376 and DNK 2192630–2013) and Norwegian Research Council NOTUR project (nn4605k). Thank Kirill Batmanov for critical reading of the manuscript.
Authors’ Affiliations
References
- Weirauch MT, Cote A, Norel R, Annala M, Zhao Y, Riley TR, Saez-Rodriguez J, Cokelaer T, Vedenko A, Talukder S, Bussemaker HJ, Morris QD, Bulyk ML, Stolovitzky G, Hughes TR: Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013, 31 (2): 126-134. 10.1038/nbt.2486.View ArticlePubMed CentralPubMedGoogle Scholar
- Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP: Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res. 2002, 30 (4): e15-10.1093/nar/30.4.e15.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang J: Computational biology of genome expression and regulation–a review of microarray bioinformatics. J Environ Pathol Toxicol Oncol. 2008, 27 (3): 157-179. 10.1615/JEnvironPatholToxicolOncol.v27.i3.10.View ArticlePubMedGoogle Scholar
- Wang J, Nygaard V, Smith-Sorensen B, Hovig E, Myklebost O: MArray: analysing single, replicated or reversed microarray experiments. Bioinformatics. 2002, 18 (8): 1139-1140. 10.1093/bioinformatics/18.8.1139.View ArticlePubMedGoogle Scholar
- Djordjevic M, Sengupta AM, Shraiman BI: A biophysical approach to transcription factor binding site discovery. Genome Res. 2003, 13 (11): 2381-2390. 10.1101/gr.1271603.View ArticlePubMed CentralPubMedGoogle Scholar
- Gerland U, Moroz JD, Hwa T: Physical constraints and functional characteristics of transcription factor-DNA interaction. Proc Natl Acad Sci U S A. 2002, 99 (19): 12015-12020. 10.1073/pnas.192693599.View ArticlePubMed CentralPubMedGoogle Scholar
- Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987, 193 (4): 723-750. 10.1016/0022-2836(87)90354-8.View ArticlePubMedGoogle Scholar
- Granek JA, Clarke ND: Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. 2005, 6 (10): R87-10.1186/gb-2005-6-10-r87.View ArticlePubMed CentralPubMedGoogle Scholar
- Kinney JB, Murugan A, Callan CG, Cox EC: Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc Natl Acad Sci U S A. 2010, 107 (20): 9158-9163. 10.1073/pnas.1004290107.View ArticlePubMed CentralPubMedGoogle Scholar
- Roider HG, Kanhere A, Manke T, Vingron M: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics. 2007, 23 (2): 134-141. 10.1093/bioinformatics/btl565.View ArticlePubMedGoogle Scholar
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics. 2006, 22 (14): e141-e149. 10.1093/bioinformatics/btl223.View ArticlePubMedGoogle Scholar
- Wang J, Morigen : BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors. BMC Bioinformatics. 2009, 10: 345-10.1186/1471-2105-10-345.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhao Y, Granas D, Stormo GD: Inferring binding energies from selected binding sites. PLoS Comput Biol. 2009, 5 (12): e1000590-10.1371/journal.pcbi.1000590.View ArticlePubMed CentralPubMedGoogle Scholar
- Sun W, Hu X, Lim MH, Ng CK, Choo SH, Castro DS, Drechsel D, Guillemot F, Kolatkar PR, Jauch R, Prabhakar S: TherMos: Estimating protein-DNA binding energies from in vivo binding profiles. Nucleic Acids Res. 2013, 41 (11): 5555-5568. 10.1093/nar/gkt250.View ArticlePubMed CentralPubMedGoogle Scholar
- Tomovic A, Oakeley EJ: Position dependencies in transcription factor binding sites. Bioinformatics. 2007, 23 (8): 933-941. 10.1093/bioinformatics/btm055.View ArticlePubMedGoogle Scholar
- Benos PV, Bulyk ML, Stormo GD: Additivity in protein-DNA interactions: how good an approximation is it?. Nucleic Acids Res. 2002, 30 (20): 4442-4451. 10.1093/nar/gkf578.View ArticlePubMed CentralPubMedGoogle Scholar
- Bulyk ML, Johnson PL, Church GM: Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002, 30 (5): 1255-1261. 10.1093/nar/30.5.1255.View ArticlePubMed CentralPubMedGoogle Scholar
- Man TK, Stormo GD: Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 2001, 29 (12): 2471-2478. 10.1093/nar/29.12.2471.View ArticlePubMed CentralPubMedGoogle Scholar
- Tipping M: Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res. 2001, 1: 211-244.Google Scholar
- Bland JM, Altman DG: Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986, 1 (8476): 307-310.View ArticlePubMedGoogle Scholar
- Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, Luscombe NM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J: DNA-binding specificities of human transcription factors. Cell. 2013, 152 (1–2): 327-339.View ArticlePubMedGoogle Scholar
- Zhao Y, Ruan S, Pandey M, Stormo GD: Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012, 191 (3): 781-790. 10.1534/genetics.112.138685.View ArticlePubMed CentralPubMedGoogle Scholar
- Chen X, Hughes TR, Morris Q: RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007, 23 (13): i72-i79. 10.1093/bioinformatics/btm224.View ArticlePubMedGoogle Scholar
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009, 10 (10): 669-680. 10.1038/nrg2641.View ArticlePubMed CentralPubMedGoogle Scholar
- Stennard FA, Costa MW, Elliott DA, Rankin S, Haast SJ, Lai D, McDonald LP, Niederreither K, Dolle P, Bruneau BG, Zorn AM, Harvey RP: Cardiac T-box factor Tbx20 directly interacts with Nkx2-5, GATA4, and GATA5 in regulation of gene expression in the developing heart. Dev Biol. 2003, 262 (2): 206-224. 10.1016/S0012-1606(03)00385-3.View ArticlePubMedGoogle Scholar
- Yang YH, Speed T: Design issues for cDNA microarray experiments. Nat Rev Genet. 2002, 3 (8): 579-588.PubMedGoogle Scholar
- Jackson JE: A User's Guids to Principal Components. 1991, New York: John Wiley & Sons, Inc,View ArticleGoogle Scholar
- Jackson JE: Principal components and factor analysis: part I- principal components. J Qual Technol. 1980, 12 (4): 201-213.Google Scholar
- Wang J: The effect of prior assumptions over the weights in BayesPI with application to study protein-DNA interactions from ChIP-based high-throughput data. BMC Bioinformatics. 2010, 11: 412-10.1186/1471-2105-11-412.View ArticlePubMed CentralPubMedGoogle Scholar
- Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci. 1998, 23 (3): 109-113. 10.1016/S0968-0004(98)01187-6.View ArticlePubMedGoogle Scholar
- Wang J: A new framework for identifying combinatorial regulation of transcription factors: a case study of the yeast cell cycle. J Biomed Inform. 2007, 40 (6): 707-725. 10.1016/j.jbi.2007.02.003.View ArticlePubMedGoogle Scholar
- Newburger DE, Bulyk ML: UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2009, 37 (Database issue): D77-D82.View ArticlePubMed CentralPubMedGoogle Scholar
- Ward LD, Bussemaker HJ: Predicting functional transcription factor binding through alignment-free and affinity-based analysis of orthologous promoter sequences. Bioinformatics. 2008, 24 (13): i165-i171. 10.1093/bioinformatics/btn154.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.