Bayesian modeling of ChIPchip data using latent variables
 Mingqi Wu^{1},
 Faming Liang^{1}Email author and
 Yanan Tian^{2}
https://doi.org/10.1186/1471210510352
© Wu et al; licensee BioMed Central Ltd. 2009
Received: 30 December 2008
Accepted: 26 October 2009
Published: 26 October 2009
Abstract
Background
The ChIPchip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants. Various methods have been proposed in the literature for analyzing the ChIPchip data, such as the sliding window methods, the hidden Markov modelbased methods, and Bayesian methods. Although, due to the integrated consideration of uncertainty of the models and model parameters, Bayesian methods can potentially work better than the other two classes of methods, the existing Bayesian methods do not perform satisfactorily. They usually require multiple replicates or some extra experimental information to parametrize the model, and long CPU time due to involving of MCMC simulations.
Results
In this paper, we propose a Bayesian latent model for the ChIPchip data. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model more robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length.
Conclusion
The Bayesian latent method is successfully applied to real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the Bayesian latent method can outperform other methods, especially when the data contain outliers.
Background
The chromatin immunoprecipitation (ChIP) coupled with microarray (chip) analysis, provides the researchers an efficient way of mapping proteinDNA interactions across a whole genome. The ChIPchip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants [1–4]. Data from ChIPchip experiments encompass DNAprotein interaction measurements on millions of short oligonucleotides (also known as probes) which often tile one or several chromosomes or even the whole genome. The data analysis consists of two steps: (1) identifying the bound regions where DNA and the protein are crosslinked in the experiments; and (2) identifying the binding sites through sequence analysis of the bound regions. The goal of this paper is to develop an effective method for the first step analysis.
Analysis of the ChIPchip data is very challenging, due to the large amount of probes and the small number of replicates. The existing methods in the literature can be roughly grouped into three categories, the sliding window methods [1, 5–7], the hidden Markov Model (HMM) methods [6, 8–10], and the Bayesian methods [11–13]. Other methods have been suggested, e.g., by Zheng [14], Huber [15] and Reiss [16], but are less common.
The sliding window methods are to test a hypothesis for each probe using the information from the probes within a certain genomic distance sliding window, and then try to correct for the multiple hypothesis tests. The test statistics used are varied. Cawley [1] used Wilcoxon's rank sum test, Keles [7] used a scan statistic which is the average of tstatistics within the sliding window, and Ji and Wong [6] used a scan statistic which is the average of empirical Bayesian tstatistics within the sliding window. Since each test uses information from neighboring probes, the tests are not independent, rendering a difficult adjustment in the multiple hypothesis testing step. We note that the power of the sliding window tests is usually low, especially for the tests for the regions where the probe density is low. This is because there will be only very limited neighboring information available for those tests. Since, in the ChIPchip experiments, the DNA samples hybridized to the microarrays are prepared by PCR, which is known to perform independently of the form of DNA, the far probes should have similar intensity patterns as long as they are of similar positions to their nearest bound regions. This provides a basis for us to devise powerful methods that make use of information from all probes.
The HMM methods have the potential to make use of all probe information, where the model parameters are estimated using all available data. However, in most of the existing implementations of HMMs, the model parameters are estimated in an ad hoc way. For example, Li [8] estimated the model parameters using previous results on Affymetrix SNPs arrays. An exception is tileHMM [10], where the model parameters are estimated using the BaumWelch and Viterbi training algorithms [17]. However, as pointed out by Humburg et al. [10], the BaumWelch algorithm tends to converge to a local maximum of the likelihood function, and the Viterbi training algorithm even fails to converge to a local maximum of the likelihood in some cases. This renders that the parameter estimates and thus followed inference often suboptimal to the problem.
Bayesian methods have also the potential to make use of all probe information. Like the HMM methods, the Bayesian methods estimate the model parameters using all available data. However, these methods usually require multiple replicates or some extra experimental information to parameterize the model. For example, the joint binding deconvolution model [11] requires one to know the DNA fragment lengths, measured separately for each sample via extrophoretic analysis; and the hierarchical gamma mixture model(HGMM) [12] requires one to first divide the data into genomic regions containing at most one bound region, but such information is, in general, unavailable. The Bayesian hierarchical model [13] models the probe intensities using essentially a mixture of normal distributions, and models the spatial structure of the probes using a Gaussian intrinsic autoregression model [18]. Gottardo [13] developed a software for the model, Bayesian analysis of ChIPchip (or BAC for short). Using BAC [13] does not need extra experimental information, but it is extremely slow, roughly 10 hours for a dataset with 300,000 probes on a personal computer. One reason for the slow speed is the use of MCMC simulations.
In this paper, we propose a Bayesian latent variable model for tiling array data. Our method differs from the existing Bayesian methods, such as the joint binding deconvolution model [11], the HGMM [12], and the Bayesian hierarchical model [13], in several respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model rather robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. Thirdly, it does not require multiple replicates or extra experimental information. As described below, it can work on a single intensity measurement for the probes. The Bayesian latent model has been successfully applied to several real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the Bayesian latent model can outperform the others, especially when the data contain outliers. Our method is also computationally efficient; it takes about 30 minutes for a dataset with 300,000 probes on a personal computer.
The remainder of this paper is organized as follows. In Section 2, we describe the Bayesian latent model and its MCMC implementation. In Section 3, we test the Bayesian latent model on real and simulated datasets with comparisons with tileHMM, BAC and some sliding window methods. In Section 4, we discuss possible extensions of our methods and provide an explanation why our method outperforms tileHMM and BAC via a detailed comparison of the models used by them. In Section 5, we conclude the paper.
Methods
where is the intensity measurement of probe i averaged over m_{ l }replicates.
The underlying assumption for the summary statistic in (1) is that the intensity measurements for each probes has a variance independent of its genomic position. The rationale is that the DNA samples used in the experiments are prepared by PCR, which is known to perform independently of the form of DNA, and that the amount of the DNA samples provides the main sources for the variation of probe intensities. We note that a similar assumption has also been made in other Bayesian software, e.g., tileHMM [10]. Otherwise, Y_{ i }can be adjusted by its standard error to a shrinkage tstatistic [19] or an empirical Bayes tstatistic [6], depending on the estimate of the standard error. Note that both the adjustments are toward the constant variance of probes. Even with the adjustments, the Bayesian latent model developed in this paper can still work reasonably well, as the normality assumption approximately holds for the modified tstatistics.
The Bayesian latent model
which makes the prior (5) a proper distribution. The rationale behind this prior can be explained as follows. Since the length of each bound region is very short comparing to the chromosome or the whole genome, it is reasonable to view each bound region as a single point, and thus, following the standard theory of Poisson process, the total number of bound regions can be modeled as a Poisson random variable. Conditioning on the total number of bound regions, as implied by (5), we put an equal prior probability on all possible configurations of z, i.e., assuming a noninformative prior for z. The prior (5) penalizes a large value of K, where the parameter λ represents the strength of penalty. We do not recommend to use a large value of λ, as the number of true bound regions is usually small and a large value of λ will lead to discovery of too many false bound regions. Our experience shows that a value of λ around 0.01 usually works well for the ChIPchip data. In this paper, we set λ = 0.01 in all simulations. The parameter K_{max} is usually set to a large number. We set K_{max} = 5000 in all simulations of this paper. As long as the value of K_{ max }has been reasonably large, increasing it further would have a negligible effect on simulations. Finally, we would like to point out that the bound region identification problem can also be viewed as a changepoint identification problem that has been widely studied in statistics. For the changepoint identification problem, the same truncated Poisson prior has been used for modeling the total number of changepoints by many authors, see, e.g., Phillips and Smith [20], Dension et al. [21], Liang and Wong [22], and Liang [23].
otherwise, the posterior is equal to 0.
Due to the design of ChIPchip experiments, it is obvious that the intensity measurements of the neighboring probes are positively dependent. To model this dependence, we use a latent indicator vector z. This makes our model different from the existing models, such as the joint binding deconvolution model [11], the HGMM [12], and the Bayesian hierarchical model [13]. Both the joint binding deconvolution model and the Bayesian hierarchical model model the mean of probe intensities through the Gaussian random field (GMF), although their formulations may not be in the standard form of the GMF. Like the Bayesian latent model, the HGMM models the mean of probe intensities by a piecewise constant function. The difference is that the HGMM requires one to first divide the data into genomic regions containing at most one bound regions, and thus it allows different nonbound regions to have different means. Considering the physical property of PCR, which performs independently of the form of DNA, allowing different nonbound regions to have different mean values may not be necessary.
MCMC simulations
To simulate from the posterior distribution (6), we used the MetropoliswithinGibbs sampler [24]; see Appendix for the details. Note that when a component of z is updated, the sum of square terms in the posterior density can be calculated in a recursive manner, and this simplifies the computation of the posterior density greatly.
Inference of bound regions
where i is the index of the probes, w is a prespecified halfwindow size, and m is the minimum number of probes belonging to the bound region. As explained in Gottardo [13], the purpose of introducing the joint posterior probability is to remove the false bound regions, which usually consists of only few isolated probes with large enrichment effects. We found that the choice w = 5 and m = 5 works well in practice. This choice of w is consistent with the moving window size used in other work, such as Ji and Wong [6], Keles [12], and Gottardo [13]. The choice of m is chosen for robustness to false bound regions. It also reflects our belief that a bound region should consist of at least five consecutive probes with large enrichment effects.
Note that estimation of ρ_{ i }is trivial based on the samples simulated from the posterior distribution. The value of ρ_{ i }depends on a lot of parameters, such as w, m and the hyperparameters of the model. However, we found that the orders of ρ_{ i }are rather robust to these parameters. This suggests us to treat ρ_{ i }as a conventional testing pvalue, and to control the false discovery rate (FDR) of the bound regions using a FDR control method, e.g., the empirical Bayes method [25] (Efron, 2004) or the stochastic approximationbased empirical Bayes method [26]. Both the methods allow for the dependence between testing statistics and an empirical determination of the density of the testing statistics.
Although a strict control of FDR is important to the detection of bound regions, it is not the focus of this paper. In this paper, we will follow other Bayesian methods, such as BAC, to simply set a cut off value of ρ_{ i }. We classify probe i as a probe in bound regions if ρ_{ i }≥ 0.5, and classify probe i as a probe in nonbound region otherwise. As we will see in the numerical examples, the joint posterior probability can lead to a good detection of true bound regions.
Results
The Estrogen Receptor data (ER data)
The ER data were generated by Carroll [27], which mapped the association of the estrogen receptor on chromosomes 21 and 22. Here we just used a subset of the data to illustrate how the Bayesian latent model works. The subset we used is available from the BAC software [28]. It consists of intensity measurements for 30001 probes under the treatment and control conditions with three replicates each. The same subset has been used by BAC for a demonstration purpose.
For comparison, BAC and tileHMM (available at [30]) were also applied to this dataset. Both BAC and tileHMM produced a probability measure for each probe, similar to ρ_{ i }, on how likely it belongs to a bound region. The results were shown in Figures 1(c) and 1(d), respectively. The comparison shows that all the three methods produced very similar results for this dataset. However, the results produced by the Bayesian latent model are neater; the joint posterior probabilities produced by it tend to be dichotomized, either close to 1 or close to 0. This gives the user a clear classification for the bound and nonbound regions. To provide some numerical evidence for this statement, we calculated the ratio #{i : P_{ i }> 0.5}/#{i : P_{ i }> 0.05}, where #{i : P_{ i }> a} denotes the number of probes with P_{ i }greater than a. Here P_{ i }refers to the joint posterior probability for the Bayesian latent model and BAC, and the conditional probability for tileHMM. The ratios resultant from the Bayesian latent model, BAC and tileHMM are 0.816, 0.615 and 0.674, respectively.
Later, we assessed the sensitivity of the Bayesian latent method to the values of the hyperparameters ν_{min} and λ with other parameters fixed, α = β = 0.05 and ν_{max} = max_{ i }y_{ i }. The cross settings {0.5s_{ y }, 1.0s_{ y }, 1.5s_{ y }, 2s_{ y }, 2.5s_{ y }, 3s_{ y }} × [0.0001, 0.1] for (ν_{min}, λ) were tried for this dataset. For each setting, the algorithm was run 5 times, and each run consisted of 11,000 iterations. To measure the similarity of the bound regions resultant from different settings of the hyperparameters, we propose to use the adjusted Rand index [31, 32]. The adjusted Rand index is usually used in the literature of clustering, which measures the degree of agreement between two partitions of the same set of observations even when the comparing partitions having different numbers of clusters. It is obvious that the problem of bound region identification can also be viewed as a clustering problem; where the genome was partitioned into a series of segments, nonbound or bound regions, and each of the segments forms a cluster.
Refer to Hubert and Arabie [32] for the theoretical justification of r.
Sensitivity analysis for the parameters w and m.
Adjusted Rand Index  m  

3  5  7  
w  2  0.987(0.006)     
5  0.987(0.006)  0.994(0.005)  0.839(0.020)  
7  0.991(0.005)  0.985(0.006)  0.834(0.010)  
10  0.994 (0.001)  0.987(0.007)  0.831(0.007) 
The robustness of the results with respect to changes of α, β and ν_{ max }are not studied in the paper. The reason is that ν_{max} is completely determined by the data, and the values of α and β we used form a vague prior for the variance σ^{2}.
p53 data
In a ChIPchip experiment, Cawley [1] mapped the binding sites of four human transcription factors Sp1, cMyc, p53FL, and p53DO1 on chromosomes 21 and 22. The experiment consisted of 6 treatment and 6 input control arrays, and the chromosomes spanned over three chips A, B and C. Refer to Cawley [1] for the details of the experiment. For the testing purpose, p53FL data on chips A, B and C were used in this paper, which contains 14 quantitative PCR verified regions. As in Cawley [1], the data were preprocessed by filtering out the local repeats, quantilenormalized [33], rescaled to have a median feature intensity of 1000 for the purpose of adjusting batch effect, and then logtransformed. Since the normalization is not the focus of this paper, we skipped the details.
The Bayesian latent method was first applied to the p53 data. The data on chip A, chip B, and chip C were analyzed separately. Each run consisted of 11,000 iterations. Diagnostic plot for the convergence of these runs indicates that they can converge within several hundreds of iterations, even the data on each chip consists of more than 300,000 probes. Accordingly, the first 1000 iterations were discarded for the burnin process, and the samples from other iterations are used for further analysis.
Computational results for the p53FL data with a cutoff of 0.5.
Chip A  Chip B  Chip C  p53  

Method  V(2)  Total  V(3)  Total  V(9)  Total  V(14)  Total 
Bayesian latent  2  15  2  28  8  27  12  70 (127) 
BAC  2  38  1  29  9  33  12  100 (1864) 
tileHMM  2  29708  3  1944  9  2144  14  33796 
Both the Bayesian latent method and BAC work well for this example. At a cutoff of 0.5, BAC identified 100 bound regions, which cover 12 out of 14 experimentally validated bound regions. The Bayesian latent method works even better. At the same cutoff, it only identified 70 bound regions, but which also cover 12 out of 14 experimentally validated bound regions. For further comparison of the Bayesian latent method and BAC, we relaxed the cutoff value and counted the total number of regions needed to cover all experimentally validated regions. We found that the Bayesian latent method only needs to increase the total number of regions to 127, while BAC needs to increase to 1864 regions. Note that the BAC and tileHMM's results reported here may be a little different from those reported by other authors, due to the difference of the normalization methods.
Simulated data
To have a careful assessment of the performance of the Bayesian latent model, we simulated 10 datasets based on the Sp1 data of Cawley's [1] experiment. Each dataset consists of 200,000 probes, two conditions (control and IPenriched), and six replicates under each condition. The probe genomic coordinates we used in simulations were the first 200,000 genomic positions used in the Sp1 data. Each dataset consisted of 996 bound probes, forming 50 bound regions. As in Gottardo [13], the bound regions were assumed to describe a peak with the intensity function given by A exp{4(g_{ i } C)^{2}/B^{2}}, where A is the amplitude of the peak, B controls the width of the peak, C represents the center of the peak, and g_{ i }is the genomic position of probe i. We also followed Gottard [13] to generate the centers of the bound regions randomly across the set of possible coordinates while imposing a separation of at least 3000 bps between peaks; and to generate the values of parameter B uniformly between 600 and 1000 bps. The values of parameter A were generated uniformly between 3 and 5. The variance of the probe intensity was estimated from the Sp1 data.
Computational results for the simulated datasets.
Method  Total  ND  FD  r  pvalue 

Bayesian Latent  50.5 (0.58)  2.3 (0.33)  2.8 (0.57)  0.9545 (0.0080)   
tileHMM  48 (0.77)  4.2 (0.57)  2.2 (0.55)  0.9250 (0.0107)  0.02 
BAC  2934.7 (6.60)  0 (0)  2884.7 (6.6)  0.0609 (0.0003)  0.00 
Wilcox  56.1 (0.95)  3.9 (0.48)  6.4 (0.62)  0.9221 (0.0088)  0.007 
tscan  78.9 (2.11)  3.1 (0.31)  27.6 (1.71)  0.9047 (0.0089)  0.0003 
EB tScan  71.5 (1.52)  3.0 (0.39)  20.9 (1.38)  0.9176 (0.0068)  0.001 
For a thorough comparison, we also applied the sliding window methods, including the Wilcoxon rank sum test method [1], tscan statistic [7] and empirical Bayesian tscan statistic [6], to the 10 datasets. For the testing purpose, we identified the most significant 996 probes, which is the same as the true number of bound probes, as the bound probes for each of the datasets and each of the sliding window methods. We note that this cutoff number should be determined by a multiple hypothesis test in practice, and this choice makes the comparison a little favorly biased toward the sliding window methods. The results were summarized in the lower panel of Table 3, which indicate that the Bayesian latent model also outperforms the sliding window methods.
Discussion
The Bayesian latent model can be generalized in a few ways. Firstly, it can be generalized to allow different bound regions to have different variances. This generalization has been implemented by us. The numerical results are very similar to those reported in the paper.
Secondly, it can be generalized to work on the multiple replicates directly. This can be simply done by modifying (2) to multivariate normals. This generalization will certainly slow down the simulations, but the results may not be improved significantly. The reason is that under the assumption of constant variances for probe intensities, the statistic (1) is sufficient for the mean intensity of probes, while the latter has been designed in the experiment as the main measure for differentiating bound and nonbound regions.
The reason why the Bayesian latent method outperforms tileHMM and BAC can be explained as follows, through a detailed comparison of the models used by them. TileHMM implemented a standard twostate hidden Markov model, with the emission distribution of state S_{ i }, i = 1, 2, being modeled as a tdistribution. TileHMM and the Bayesian latent model are mainly different in two respects.

TileHMM is a nonBayesian method, where maximum likelihood estimates are used for all model parameters and inference for the bound regions are based on the conditional probability of the hidden states. TileHMM is trained using the BaumWelch algorithm and the Viterbi algorithm. It is known that the BaumWelch algorithm is an EM algorithm implemented in the context of HMM, and that it tends to converge to a local maximum of the likelihood function. The Viterbi algorithm provides a fast alternative to the BaumWelch algorithm, but may not converge to a local maximum. The Bayesian latent method is a Bayesian method, where inference for bound regions is based on the posterior distribution of the latent variable. The posterior distribution is simulated using the MetropoliswithinGibbs sampler, which is known to converge to its target distribution when the number of iterations becomes large.

TileHMM models all bound regions to have the same mean value, while the Bayesian latent model allows different bound regions to have different mean values. Our model fits the real data better.
The mixed performance of tileHMM on the simulated and real datasets indicates that the inferiority of tileHMM is mainly due to its training algorithm. In addition, as indicated by our simulated examples, tileHMM tends to misidentify the bound regions with relatively low probe intensities, because it models all bound regions to have the same mean value.
where ∂_{ p }corresponds to the probes p' immediately adjacent to p, n_{ p }is the cardinality of ∂_{ p }, κ is a smoothing parameter, and n is the number of neighboring probes used. The model is trained using a MCMC algorithm.
The main difference between the BAC and the Bayesian latent methods is that BAC models the control and treatment samples jointly, while the Bayesian latent method models the difference between the averaged treatment and control samples. Since BAC models the treatment and control samples jointly, it has to include the probespecific effect in the model and assume a complicated structure for the random error, assuming the variance depends on both the probe and the type of samples (control or treatment). By working on the difference between the averaged treatment and control samples, the Bayesian latent method eliminates the probe effect in the model and the dependence of the random error on the probe and the type of samples. This simplifies the model greatly and enables an efficient MCMC simulation from the the posterior distribution. In addition, due to the complicated structure of the model, BAC includes too many parameters, and this makes the model potentially overfitted, especially when the number of replicates is small. This explains why BAC always tends to identify too more bound regions than does the Bayesian latent model. On the other hand, the simplicity of the Bayesian latent model makes it rather robust to outlying probes. As indicated by our examples, it work well for all examples studied in this paper.
Conclusion
We have proposed a Bayesian latent model for the ChIPchip experiments. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution, and also makes the model fairly robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length.
The Bayesian latent model has been successfully applied to the ER, p53, and some simulated datasets, with comparisons with BAC, tileHMM, and some sliding window methods. The numerical results indicate that the Bayesian latent model can outperform others, especially when the dataset contains outlying probes.
Availability and requirements
An R software package called LatentChIP, which implements the Bayesian latent model under linux operating system, is available at the author's webpage [35].
Appendix
 (a)
 (b)
When a component of z is updated in step (b), the sum of square terms in the posterior density function can be calculated in a recursive manner, i.e., only the terms related to z_{ i }need to be recalculated.
Declarations
Acknowledgements
Liang's research was supported in part by the grant (DMS0607755) made by the National Science Foundation and the award (KUSC101604) made by King Abdullah University of Science and Technology (KAUST). Tian's research is supported in part by ESO9859. The authors thank the editor, the associate editor, and the referee for their comments which have led to significant improvement of this paper.
Authors’ Affiliations
References
 Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, Wheeler R, Wong B, Drenkow J, Yamanaka M, Patel S, Brubaker S, Tammana H, Helt G, Struhl K, R GT: Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004, 116: 499–509. 10.1016/S00928674(04)001278View ArticlePubMedGoogle Scholar
 Zhang X, Yazakij J, Sundaresan A, Cokus S, Chan S, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE, Ecker JR: Genomewide highresolution mapping and functional analysis of DNA methylation in arabidopsis. Cell 2006, 126: 1189–1201. 10.1016/j.cell.2006.08.003View ArticlePubMedGoogle Scholar
 Bernstein BE, Kamal M, LindbladToh K, Bekiranov S, Bailey D, Huebert DJ, McMahon S, Karlsson EK, III EJK, Gingeras TR, Schreiber SL, Lander ES: Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 2005, 120: 169–181. 10.1016/j.cell.2005.01.001View ArticlePubMedGoogle Scholar
 Zhang X, Clarenz O, Cokus S, Bernatavichute YV, Goodrich J, Jacobsen SE: WholeGenome analysis of histone H3 lysine 27 trimethylation in arabidopsis. PLoS Biol 2007, 5(5):e129. 10.1371/journal.pbio.0050129PubMed CentralView ArticlePubMedGoogle Scholar
 Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, gerstein M, Snyder M: Global identification of human transcribed sequences with genome tiling arrays. Science 2004, 306(5750):2242–2246. 10.1126/science.1103388View ArticlePubMedGoogle Scholar
 Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 2005, 21(18):3629–3636. 10.1093/bioinformatics/bti593View ArticlePubMedGoogle Scholar
 Keles S, Laan MJ, Dudoit S, Cawley SE: Multiple testing methods for ChIPchip high density oligonucleotide array data. Journal of Computational Biology 2006, 13(3):579–613. 10.1089/cmb.2006.13.579View ArticlePubMedGoogle Scholar
 Li W, Meyer CA, Liu XS: A hidden markov model for analayzing ChIPchip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 2005, 21(Suppl 1):274–282. 10.1093/bioinformatics/bti1046View ArticleGoogle Scholar
 Munch K, Gardner PP, Arctander P, Krogh A: A hidden markov model approach for determining expression from genomic tiling microarrays. BMC Bioinformatics 2006, 7: 239. 10.1186/147121057239PubMed CentralView ArticlePubMedGoogle Scholar
 Humburg P, Bulger D, Stone G: Parameter estimation for robust HMM analysis of ChIPchip data. BMC Bioinformatics 2008, 9: 343. 10.1186/147121059343PubMed CentralView ArticlePubMedGoogle Scholar
 Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: HighResolution computational models of genome binding events. Nature Biotechnology 2006, 24(8):963–970. 10.1038/nbt1233View ArticlePubMedGoogle Scholar
 Keles S: Mixture modeling for genomewide localization of transcription factors. Biometrics 2007, 63: 10–21. 10.1111/j.15410420.2005.00659.xView ArticlePubMedGoogle Scholar
 Gottardo R, Li W, Johnson WE, Liu XS: A flexible and powerful bayesian hierarchical model for ChIPchip experiments. Biometrics 2008, 64: 468–478. 10.1111/j.15410420.2007.00899.xView ArticlePubMedGoogle Scholar
 Zheng M, Barrera LO, Ren B, Wu YN: ChIPchip: data, model, and analysis. Biometrics 2007, 63(3):787–796. 10.1111/j.15410420.2007.00768.xView ArticlePubMedGoogle Scholar
 Huber W, Toedling J, Steinmetz LM: Transcript mapping with highdensity oligonucleotide tiling arrays. Bioinformatics 2006, 22(16):1963–1970. 10.1093/bioinformatics/btl289View ArticlePubMedGoogle Scholar
 Reiss DJ, Facciotti MT, Baliga NS: Modelbased deconvolution of genomewide DNA binding. Bioinformatics 2008, 24(3):396–403. 10.1093/bioinformatics/btm592View ArticlePubMedGoogle Scholar
 Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77: 257–286. 10.1109/5.18626View ArticleGoogle Scholar
 Besag J, Kooperberg C: On conditional and intrinsic autoregrssions. Biometrika 1995, 82: 733–746.Google Scholar
 OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distributionfree shrinkage approach. Statistical Applications in Genetics and Molecular Biology 2007., 6: Article 9 Article 9Google Scholar
 Phillips DB, Smith AFM: Bayesian model comparison via jump diffusions. In Markov chain Monte Carlo in Practice. Edited by: Gilks WR, Richardson S, Spiegelhalter DJ. Chapman and Hall; 1996:215–239.Google Scholar
 Denison DG, Mallick BK, Smith AFM: Automatic Bayesian curve fitting. J Royal Statist Soc B 1998, 60: 333–350. 10.1111/14679868.00128View ArticleGoogle Scholar
 Liang F, Wong WH: Evolutionary Monte Carlo sampling: applications to C_{ p }model sampling and changepoint problem. Statistica Sinica 2000, 10: 317–342.Google Scholar
 Liang F: Improving SAMC using smoothing methods: theory and applications to bayesian model selection problems. The Annals of Statistics 2009, 37: 2626–2654. 10.1214/07AOS577View ArticleGoogle Scholar
 Müller P: A generic approach to posterior integration and gibbs sampling. In Technical Report. Volume 09. Purdue University, West Lafayette, Indiana; 1991.Google Scholar
 Efron B: Largescale simultaneous hypothesis testing:the choice of a null hypothesis. Journal of the American Statistical Association 2004, 99: 96–104. 10.1198/016214504000000089View ArticleGoogle Scholar
 Liang F, Zhang J: Estimation the false discovery rate using the stochastic approximation algorithm. Biometrika 2008, 95(4):961–977. 10.1093/biomet/asn036View ArticleGoogle Scholar
 Carroll JS, Liu XS, Brodsky AS, Li W, Meyer CA, Szary AJ, Shao W, Hestermann EV, Geistlinger TR, Fox EA, Silver PA, Brown M: Chromosomewide mapping of estrogen receptor binding reveals longrange regulation requiring the forkhead protein foxal. Cell 2005, 122: 33–43. 10.1016/j.cell.2005.05.008View ArticlePubMedGoogle Scholar
 BAC software[http://www.bioconductor.org/packages/2.2/bioc]
 Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences (with discussion). Statistical Science 1992, 7: 457–511. 10.1214/ss/1177011136View ArticleGoogle Scholar
 BAC and tileHMM software[http://cran.rproject.org/web/packages]
 Rand WM: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971, 66: 846–850. 10.2307/2284239View ArticleGoogle Scholar
 Hubert L, Arabie P: Comparing partitions. Journal of Classification 1985, 2: 193–218. 10.1007/BF01908075View ArticleGoogle Scholar
 Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density ologonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
 Bradley A: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997, 30: 1145–1159. 10.1016/S00313203(96)001422View ArticleGoogle Scholar
 R package: LatentChIP[http://www.stat.tamu.edu/~fliang]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.