The effect of prior assumptions over the weights in BayesPI with application to study protein-DNA interactions from ChIP-based high-throughput data
- Junbai Wang^{1}Email author
DOI: 10.1186/1471-2105-11-412
© Wang; licensee BioMed Central Ltd. 2010
Received: 5 December 2009
Accepted: 4 August 2010
Published: 4 August 2010
Abstract
Background
To further understand the implementation of hyperparameters re-estimation technique in Bayesian hierarchical model, we added two more prior assumptions over the weight in BayesPI, namely Laplace prior and Cauchy prior, by using the evidence approximation method. In addition, we divided hyperparameter (regularization constants α of the model) into multiple distinct classes based on either the structure of the neural networks or the property of the weights.
Results
The newly implemented BayesPI was tested on both synthetic and real ChIP-based high-throughput datasets to identify the corresponding protein binding energy matrices. The results obtained were encouraging: 1) there was a minor effect on the quality of predictions when prior assumptions over the weights were altered (e.g. the prior probability distributions to the weights and the number of classes to the hyperparameters) in BayesPI; 2) however, there was a significant impact on the computational speed when tuning the weight prior in the model: for example, BayesPI with a Laplace weight prior achieved the best performance with regard to both the computational speed and the prediction accuracy.
Conclusions
From this study, we learned that it is absolutely necessary to try different prior assumptions over the weights in Bayesian hierarchical model to design an efficient learning algorithm, though the quality of the final results may not be associated with such changes. In future, the evidence approximation method can be an alternative to Monte Carlo methods for computational implementation of Bayesian hierarchical model.
Background
In our previous study, we developed a Bayesian neural network type of model - BayesPI - to study protein-DNA interactions, using ChIP-based high-throughput data [1]. In BayesPI, the model error function (data error) is interpreted as defining a likelihood function, and the model regularizer (a penalty term to the error function) corresponds to a prior probability distribution over the weights, and such a framework is considered as a Bayesian hierarchical model. In addition to the common model parameters, BayesPI includes unknown hyperparameters (e.g. weight decay rate α and model noise level β) that need to be learned from the data. There are three possible implementations to control the model hyperparameters when using Bayesian neural networks to infer the model parameters: 1) using Markov chain Monte Carlo methods to simulate the probability distribution - MCMC [2]; 2) integrating out the model hyperparameters analytically before the application of Gaussian approximation of posterior distribution, and subsequently maximizing the true posterior over the model parameters - Maximum A Posterior Probability (MAP) [3]; and 3) integrating out the model parameters first, and then maximizing the resulting evidence over the hyperparameters - the Evidence Approximation [4]. Descriptions of the first two implementations can be found in the earlier papers [2, 3], and in this study, we will focus only on the last approach (the evidence approximation) implemented in BayesPI.
Three motivations inspired us to pursue an investigation on the effect of prior assumptions over the weights (the evidence approximation) in Bayesian neural networks to study protein-DNA interactions from ChIP-based high-throughput data: 1) With regard to others' concern, before BayesPI paper was published, we received some criticisms about the treatment of hyperparameters in Bayesian neural networks. For example, do alternative definitions of hyperparameters according to the model parameters (e.g. divide the hyperparameters α into several classes based on either the structure of neural networks or the property of the model parameters) strongly influence the model inference? 2) With regard to our own interest, how significant will a different assignment of prior distribution (e.g. Gaussian prior, Laplace prior or Cauchy prior) to weights affect the outcome of Bayesian neural networks (e.g. prediction accuracy and computational time cost)? 3) With respect to a general survey of the application of Bayesian inferences in ChIP-based experiments, we searched PubMed using the keywords "Bayesian, chip" or "Bayesian, ChIP-chip," and then downloaded the search results that had been recorded before May 28, 2010. From this search, we obtained 33 papers that contained the above-mentioned keywords. Subsequently, we carried out a literature study of these 33 papers. To our surprise, only 14 of the 33 papers had applied Bayesian methods on issues related to motif discovery (e.g. DNA binding site identification) by using ChIP-based high-throughput data, and the remaining 19 papers had applied Bayesian methods in data integration, clustering and network reconstructions, etc. A detailed examination of the 14 papers relevant to protein-DNA interaction study reveal that BayesPI applied used evidence approximation to solve the posterior distribution in Bayesian inference, while the remaining 12 papers utilized the sampling methods (e.g. MCMC and Gibbs sampling) to simulate the posterior distribution of the Bayesian models (one paper cannot be determined because of lack of method description; detailed information of the 33 papers is available in [Additional file 1: Supplemental Data]). Though the present implementation (the evidence approximation) in BayesPI for handling hyperparameters has been rarely applied earlier, there are clear advantages of using it to solve the data mining problems [5]. Thus, by being motivated by the last finding along with the earlier two inspirations, we decided to carry out a follow-up study on the effect of prior assumption over the weights in BayesPI. Our study may pave the way for the future development of evidence approximation in Bayesian inferences as well as for the further application of the Bayesian methods in bioinformatics research.
Results
Performance comparisons from simulated ChIP-chip datasets
Performance comparisons from real ChIP-chip datasets
Comparing motif similarity scores of nine yeast TFs from four different calculations.
TF Name (consensus sequence length) | Activated in stress conditions | BayesPI - Gaussian prior | BayesPI - Laplace prior | BayesPI - Cauchy prior | MatrixREDUCE |
---|---|---|---|---|---|
ACE2 (6) | No | 0.89 | 0.95 | 0.96 | 0.90 |
MSN2 (6) | Yes[17] | 0.76 | 0.93 | 0.79 | NA |
SWI4 (7) | No | 0.96 | 0.94 | 0.94 | 0.95 |
YAP1 (7) | Yes[18] | 0.93 | 0.92 | 0.92 | 0.93 |
INO4 (8) | No | 0.90 | 0.92 | 0.94 | 0.97 |
SKN7 (9) | Yes[19] | 0.86 | 0.87 | 0.86 | 0.82 |
FHL1 (10) | No | 0.95 | 0.95 | 0.93 | 0.88 |
ROX1 (12) | Yes[20] | 0.72 | 0.72 | 0.78 | 0.75 |
XBP1 (12) | Yes[21] | 0.76 | 0.77 | 0.76 | NA |
Performance comparisons from human ChIP-Seq datasets
Discussion
Nowadays, chromatin immunoprecipitation followed by massively paralleled sequencing (ChIP-Seq) is being used widely in various molecular biological researches such as investigating genome-wide protein-DNA interactions [7] and histone modification studies [10]. It is possible that the ChIP-Seq experiment may replace ChIP-chip technology completely [11] in future. That is because the ChIP-Seq experiment produces higher quality and higher resolution data than the ChIP-chip, which also avoids several pitfalls that accompany with the ChIP-chip technology: for example, array probe-specific behavior and dye bias [12]. In this work, we studied the effect of prior assumptions over the weight in BayesPI to predict the protein binding energy matrices from ChIP-based high-throughput datasets. The results on both synthetic and real experimental datasets were consistent: in general, the prior assumptions over the weights and the classification of regularization constants (e.g. hyperparameters α) into several classes did not strongly affect the final outcome of BayesPI (e.g. Figures 1, 2, and 3) if sufficient training datasets were provided; particularly, a change in the number of classes over the regularization constants had a much weaker impact on the requirement of computational resource than a change in the weight prior in BayesPI; nevertheless, the selection of prior approximation over the weights had the most significant influence on the CPU hours that were used for calculation (e.g. by using a Laplace prior, the computational time was reduced by more than 50 percent when compared with that utilized by the old BayesPI [1], a Gaussian prior.) Thus, the present study reveals the importance of defining a right weight prior to a Bayesian hierarchical model, which may dramatically speed up the calculation when the program is applied to a large dataset.
In addition to the above-mentioned findings that the computation efficiency of BayesPI is highly associated with prior assumptions over the weights, we also provided a detailed illustration of the hyperparameter re-estimation technique by using the evidence approximation method. We presume that the evidence method may become a popular approximate method for computational implementation of Bayesian hierarchical model (a deterministic algorithm), as well as become an alternative to Monte Carlo methods that are currently being widely used in bioinformatics research fields [13]. Particularly, the evidence method can overcome some of the inherent limitations of the sampling approaches, such as nonreproducible results, long burning period, and unknown stopping time.
Conclusions
The present study has clarified several doubts in the early implementation of BayesPI: 1) prediction accuracy of BayesPI is robust against dividing the hyperparameters (e.g. regularization constants α) into multiple distinct groups; 2) there is a minor effect on the quality of predictions by selecting alternative prior assumptions over the weights in BayesPI; 3) however, there is a strong impact on the computational requirement for calculation when a proper weight prior is chosen. Overall, we have derived the new re-estimation formulas for both Laplace prior and Cauchy prior over the weights in the Bayesian neural networks, and the new implements have been tested successfully in both synthetic and real ChIP-based high-throughput datasets.
Methods
Computational modeling of protein-DNA interactions in BayesPI
which can be used by the Bayesian neural networks [4] to determine the parameters (e.g. w, α, β). In the above-mentioned equation, E_{ D }, E_{ w }, D, and ⟨Λ, η, Γ⟩ are the model error function (data error), the model regularizer (a penalty term to the error function), the input data, and the hypothesis model space (e.g. Λ is the protein binding probability and Γ is the regularization function), respectively; α and β are the two unknown hyperparameters (e.g. weight decay rate and model noise level) that must be determined from the input data; and w indicates the model parameters (e.g. weights in the Bayesian neural networks), which represents the inferred the protein binding energy matrix and the chemical potentials from ChIP-based high-throughput data [1].
Based on the above-mentioned three weight priors, we applied the evidence approximation method [4] to determine the corresponding re-estimation formulas for both α and β, which can be used by Bayesian neural networks to fit the model (e.g. to learn the model parameters w from the data).
Bayesian choice of α and β through the evidence approximation
Evidence approximation
including both the model architecture and the regularizing parameters [4], where Z_{ w }(α) and Z_{ D }(β) are the normalization factors given by Z_{ w }(α) = ∫ dw exp(-αE_{w}) and Z_{ D }(β) = ∫ dD exp(-βE_{D}), respectively. By maximizing the log evidence of equation (10), we can determine the re-estimation formulas for hyperparameters α and β according to the weight assumptions E_{ w }in BayesPI.
Gaussian prior
Where λ_{ q }are the eigenvalues of the β∇∇E_{ D }and the negative λ_{ q }are omitted from the sum. Thus, for a Gaussian weight prior, we used equation (21) to update the hyperparameters α and β through equations (19) and (20).
Laplace prior
for the hyperparameters, when assuming a Laplace prior over the weights.
Cauchy prior
where λ_{ q }are the eigenvalues of data error β∇∇E_{ D }. Thus, for a Cauchy prior, equation (35) can be used to compute hyperparameters α and β through equations (33) and (34). Detailed derivations of hyperparameters update functions for above three priors are available in [Additional file 1: Supplemental Methods].
Application of R-propagation algorithm
By following the above-mentioned R-back-propagation procedures, $R(\frac{\partial E}{\partial {w}_{q}})$ can be estimated, which is equivalent to computing the second derivative ∇∇E_{ w }[14]. Detailed description of application of R-propagation algorithm is available in [Additional file 1: Supplemental Methods]. The source code of BayesPI2 is public available http://folk.uio.no/junbaiw/bayesPI2.
Multiple regularization constants α
For simplicity, we assumed that there is only one class of weights in BayesPI [1]. For example, the weights are modeled as coming from a single Gaussian prior (e.g. equation (3)). However, in a real study, weights may fall into multiple distinct groups [4]. Therefore, it is desirable to divide the weights into several classes c, with independent regularization constants α_{ c }. In the new version of BayesPI, there are five types of assignment of weight decay rate α to each of the three weight priors (e.g. Gaussian, Laplace, and Cauchy). The term αE_{w} in equation (1) is replaced by $\sum _{c}{\alpha}_{c}{E}_{w}^{c}$, in which c is the number of classes to the regularization constants α: 1) if c equals 1, then all the weights have the same regularization constant α; 2) if c equals 2, then we can divide the weights into two groups, namely the weights in the hidden layer and the weights in the output layer; 3) if c equals 3, then it suggests that there are two distinct weight classes in the hidden layer (e.g. weights from the motif energy matrix and weight from the chemical potential), but only a single weight class in the output layer [1]); 4) if c equals 4, then it suggests that there are two independent weight classes in both the hidden layer and output layer; 5) if c is greater than 5, then it suggests that each binding position of the motif energy matrix has its own regularization constant α_{c} as well as the chemical potential, and that the two weights in the output layer have their own regularization constants, respectively (e.g. if TF motif length equals 8, then the regularization constant α has 11 classes).
Motif similarity score and Microarray datasets
To access the quality of the predicted motif binding sites, we used a published method (motif similarity score [15]) to estimate the similarity between the predicted motif energy matrices and the corresponding consensus sequences from the SGD database [16]. Detailed description of these calculations can be found in the previous publication [1]. Synthetic ChIP-chip datasets and real ChIP-chip experiments for nine yeast transcription factors were adopted from the earlier works [1, 7]. ChIP-Seq datasets for three human TFs (STAT1, NRSF, and CTCF) were obtained from Jothi et al. [9]. More information about the preprocessing of both ChIP-chip and ChIP-Seq datasets are available in [1].
Declarations
Acknowledgements
Junbai Wang is supported by the Norwegian Cancer Society, the cluster facilities of the University of Oslo and the NOTUR project.
Authors’ Affiliations
References
- Wang J: BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors. BMC bioinformatics 2009, 10: 345. 10.1186/1471-2105-10-345View ArticlePubMedPubMed Central
- Neal RM: Bayesian Learning for Neural Networks. PhD thesis. University of Toronto; 1994.
- Williams PM: Bayesian Regularization and Pruning Using a Laplace Prior. Neural Computation 1995, 7(1):117–143. 10.1162/neco.1995.7.1.117View Article
- Mackay D: Bayesian Methods for Adaptive Models. PhD thesis. California Institute of Technology; 1991.
- Mackay DJC: Comparison of Approximate Methods for Handling Hyperparameters. Neural Computation 1999, 11(5):1035–1068. 10.1162/089976699300016331View Article
- Chen CY, Tsai HK, Hsu CM, May Chen MJ, Hung HG, Huang GT, Li WH: Discovering gapped binding sites of yeast transcription factors. Proc Natl Acad Sci USA 2008, 105(7):2527–2532. 10.1073/pnas.0712188105View ArticlePubMedPubMed Central
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104. 10.1038/nature02800View ArticlePubMedPubMed Central
- Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 2006, 22(14):e141–149. 10.1093/bioinformatics/btl223View ArticlePubMed
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 2008, 36(16):5221–5231. 10.1093/nar/gkn488View ArticlePubMedPubMed Central
- Schones DE, Zhao K: Genome-wide approaches to studying chromatin modifications. Nature reviews 2008, 9(3):179–191. 10.1038/nrg2270View ArticlePubMed
- Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004, 83(3):349–360. 10.1016/j.ygeno.2003.11.004View ArticlePubMed
- Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, et al.: High-resolution computational models of genome binding events. Nature biotechnology 2006, 24(8):963–970. 10.1038/nbt1233View ArticlePubMed
- Wilkinson DJ: Bayesian methods in bioinformatics and computational systems biology. Briefings in bioinformatics 2007, 8(2):109–116. 10.1093/bib/bbm007View ArticlePubMed
- Pearlmutter BA: Fast exact multiplication by the Hessian. Neural Computation 1994., 6(1): 10.1162/neco.1994.6.1.147
- Tsai HK, Huang GT, Chou MY, Lu HH, Li WH: Method for identifying transcription factor binding sites in yeast. Bioinformatics 2006, 22(14):1675–1681. 10.1093/bioinformatics/btl160View ArticlePubMed
- Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al.: SGD: Saccharomyces Genome Database. Nucleic Acids Res 1998, 26(1):73–79. 10.1093/nar/26.1.73View ArticlePubMedPubMed Central
- Gorner W, Durchschlag E, Martinez-Pastor MT, Estruch F, Ammerer G, Hamilton B, Ruis H, Schuller C: Nuclear localization of the C2H2 zinc finger protein Msn2p is regulated by stress and protein kinase A activity. Genes Dev 1998, 12(4):586–597. 10.1101/gad.12.4.586View ArticlePubMedPubMed Central
- Lee J, Godon C, Lagniel G, Spector D, Garin J, Labarre J, Toledano MB: Yap1 and Skn7 control two specialized oxidative stress response regulons in yeast. J Biol Chem 1999, 274(23):16040–16046. 10.1074/jbc.274.23.16040View ArticlePubMed
- Raitt DC, Johnson AL, Erkine AM, Makino K, Morgan B, Gross DS, Johnston LH: The Skn7 response regulator of Saccharomyces cerevisiae interacts with Hsf1 in vivo and is required for the induction of heat shock genes by oxidative stress. Mol Biol Cell 2000, 11(7):2335–2347.View ArticlePubMedPubMed Central
- Deckert J, Perini R, Balasubramanian B, Zitomer RS: Multiple elements and auto-repression regulate Rox1, a repressor of hypoxic genes in Saccharomyces cerevisiae. Genetics 1995, 139(3):1149–1158.PubMedPubMed Central
- Mai B, Breeden L: Xbp1, a stress-induced transcriptional repressor of the Saccharomyces cerevisiae Swi4/Mbp1 family. Mol Cell Biol 1997, 17(11):6491–6501.View ArticlePubMedPubMed Central
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.