 Methodology article
 Open Access
 Published:
Bayesian modeling of ChIPchip data using latent variables
BMC Bioinformatics volume 10, Article number: 352 (2009)
Abstract
Background
The ChIPchip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants. Various methods have been proposed in the literature for analyzing the ChIPchip data, such as the sliding window methods, the hidden Markov modelbased methods, and Bayesian methods. Although, due to the integrated consideration of uncertainty of the models and model parameters, Bayesian methods can potentially work better than the other two classes of methods, the existing Bayesian methods do not perform satisfactorily. They usually require multiple replicates or some extra experimental information to parametrize the model, and long CPU time due to involving of MCMC simulations.
Results
In this paper, we propose a Bayesian latent model for the ChIPchip data. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model more robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length.
Conclusion
The Bayesian latent method is successfully applied to real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the Bayesian latent method can outperform other methods, especially when the data contain outliers.
Background
The chromatin immunoprecipitation (ChIP) coupled with microarray (chip) analysis, provides the researchers an efficient way of mapping proteinDNA interactions across a whole genome. The ChIPchip technology has been used in a wide range of biomedical studies, such as identification of human transcription factor binding sites, investigation of DNA methylation, and investigation of histone modifications in animals and plants [1–4]. Data from ChIPchip experiments encompass DNAprotein interaction measurements on millions of short oligonucleotides (also known as probes) which often tile one or several chromosomes or even the whole genome. The data analysis consists of two steps: (1) identifying the bound regions where DNA and the protein are crosslinked in the experiments; and (2) identifying the binding sites through sequence analysis of the bound regions. The goal of this paper is to develop an effective method for the first step analysis.
Analysis of the ChIPchip data is very challenging, due to the large amount of probes and the small number of replicates. The existing methods in the literature can be roughly grouped into three categories, the sliding window methods [1, 5–7], the hidden Markov Model (HMM) methods [6, 8–10], and the Bayesian methods [11–13]. Other methods have been suggested, e.g., by Zheng [14], Huber [15] and Reiss [16], but are less common.
The sliding window methods are to test a hypothesis for each probe using the information from the probes within a certain genomic distance sliding window, and then try to correct for the multiple hypothesis tests. The test statistics used are varied. Cawley [1] used Wilcoxon's rank sum test, Keles [7] used a scan statistic which is the average of tstatistics within the sliding window, and Ji and Wong [6] used a scan statistic which is the average of empirical Bayesian tstatistics within the sliding window. Since each test uses information from neighboring probes, the tests are not independent, rendering a difficult adjustment in the multiple hypothesis testing step. We note that the power of the sliding window tests is usually low, especially for the tests for the regions where the probe density is low. This is because there will be only very limited neighboring information available for those tests. Since, in the ChIPchip experiments, the DNA samples hybridized to the microarrays are prepared by PCR, which is known to perform independently of the form of DNA, the far probes should have similar intensity patterns as long as they are of similar positions to their nearest bound regions. This provides a basis for us to devise powerful methods that make use of information from all probes.
The HMM methods have the potential to make use of all probe information, where the model parameters are estimated using all available data. However, in most of the existing implementations of HMMs, the model parameters are estimated in an ad hoc way. For example, Li [8] estimated the model parameters using previous results on Affymetrix SNPs arrays. An exception is tileHMM [10], where the model parameters are estimated using the BaumWelch and Viterbi training algorithms [17]. However, as pointed out by Humburg et al. [10], the BaumWelch algorithm tends to converge to a local maximum of the likelihood function, and the Viterbi training algorithm even fails to converge to a local maximum of the likelihood in some cases. This renders that the parameter estimates and thus followed inference often suboptimal to the problem.
Bayesian methods have also the potential to make use of all probe information. Like the HMM methods, the Bayesian methods estimate the model parameters using all available data. However, these methods usually require multiple replicates or some extra experimental information to parameterize the model. For example, the joint binding deconvolution model [11] requires one to know the DNA fragment lengths, measured separately for each sample via extrophoretic analysis; and the hierarchical gamma mixture model(HGMM) [12] requires one to first divide the data into genomic regions containing at most one bound region, but such information is, in general, unavailable. The Bayesian hierarchical model [13] models the probe intensities using essentially a mixture of normal distributions, and models the spatial structure of the probes using a Gaussian intrinsic autoregression model [18]. Gottardo [13] developed a software for the model, Bayesian analysis of ChIPchip (or BAC for short). Using BAC [13] does not need extra experimental information, but it is extremely slow, roughly 10 hours for a dataset with 300,000 probes on a personal computer. One reason for the slow speed is the use of MCMC simulations.
In this paper, we propose a Bayesian latent variable model for tiling array data. Our method differs from the existing Bayesian methods, such as the joint binding deconvolution model [11], the HGMM [12], and the Bayesian hierarchical model [13], in several respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution of the model, and also makes the model rather robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. Thirdly, it does not require multiple replicates or extra experimental information. As described below, it can work on a single intensity measurement for the probes. The Bayesian latent model has been successfully applied to several real and ten simulated datasets, with comparisons with some of the existing Bayesian methods, hidden Markov model methods, and sliding window methods. The numerical results indicate that the Bayesian latent model can outperform the others, especially when the data contain outliers. Our method is also computationally efficient; it takes about 30 minutes for a dataset with 300,000 probes on a personal computer.
The remainder of this paper is organized as follows. In Section 2, we describe the Bayesian latent model and its MCMC implementation. In Section 3, we test the Bayesian latent model on real and simulated datasets with comparisons with tileHMM, BAC and some sliding window methods. In Section 4, we discuss possible extensions of our methods and provide an explanation why our method outperforms tileHMM and BAC via a detailed comparison of the models used by them. In Section 5, we conclude the paper.
Methods
Consider a ChIPchip experiment with two conditions, treatment and control. Let X_{1} and X_{2} denote, respectively, the samples measured under the treatment and control conditions. Each sample has m_{ l }, l = 1, 2, replicates providing measurements for n genomic locations along a chromosome or the genome. Suppose that these samples have been normalized and logtransformed. In this paper, we summarize the measurements for each probe by
where is the intensity measurement of probe i averaged over m_{ l }replicates.
The underlying assumption for the summary statistic in (1) is that the intensity measurements for each probes has a variance independent of its genomic position. The rationale is that the DNA samples used in the experiments are prepared by PCR, which is known to perform independently of the form of DNA, and that the amount of the DNA samples provides the main sources for the variation of probe intensities. We note that a similar assumption has also been made in other Bayesian software, e.g., tileHMM [10]. Otherwise, Y_{ i }can be adjusted by its standard error to a shrinkage tstatistic [19] or an empirical Bayes tstatistic [6], depending on the estimate of the standard error. Note that both the adjustments are toward the constant variance of probes. Even with the adjustments, the Bayesian latent model developed in this paper can still work reasonably well, as the normality assumption approximately holds for the modified tstatistics.
The Bayesian latent model
Suppose that the data consists of a total of K bound regions, and that region k consists of n_{ k }(k = 1,...,K) consecutive probes. For convenience, we call all the nonbound regions by region 0 and denote by n_{0}, the total number of probes contained in all the nonbound regions, although the probes in which may be nonconsecutive. Thus, we have . Let z= (z_{1},...,z_{ n }) be a latent binary vector associated with the probes, where z_{ i }= 1 indicates that probe i belongs to a bound region and 0 otherwise. Given z, we can reindex (y_{1},...,y_{ n }), a realization of (Y_{1},...,Y_{ n }), by y_{ kj }, k = 0,...,K, j = 1,...,n_{ k }. Then y_{ kj }can be modeled as follows,
where μ_{0} is the overall mean, which models the difference of sample effects (between the treatment samples and the control samples); ν_{0} = 0 and ν_{ k }> 0, k = 1,...,K accounts for the difference of probe intensities in different bound regions; ϵ_{ kj }s are random errors independently and identically distributed as N(0, σ^{2}). In words, model (2) assumes that, conditioning on the latent vector z, y_{ kj }s are mutually independent and also identically distributed within the same bound region. We are aware that for the tiling array data, the probe intensities tend to form a peak around the true binding site. Since, given z, the order of probes is meaningless to us, the model (2) is appropriate if ignoring the order of the probes. We note that a similar assumption has also been used in the HGMM and HMM methods. Conditioning on z, the likelihood of the model can be written as
To conduct a Bayesian analysis for the model, we specify the following prior distributions for the model parameters:
where IG(·,·) denotes an inverse Gamma distribution, U(·,·) denotes a uniform distribution, and α, β, ν_{ min }, ν_{ max }are hyperparameters. In this paper, we set α = β = 0.05, which form a vague prior for σ^{2}; and set ν_{ min }= 2s_{ y }and ν_{ max }= max_{ i }y_{ i }, where s_{ y }is the sample standard error of y_{ i }. Different values of ν_{ min }, e.g., s_{ y }and 1.5s_{ y }, have also been tried in our simulations, and the results are similar. The sensitivity issue of the Bayesian latent model to the hyperparameters will be further discussed in Section 3. In addition, we assume that the latent vector z follows a truncated Poisson distribution,
where K, denoting the total number of bound regions specified by z, and is thus a function of z; λ is a hyperparameter; K_{max} is the largest number of bounded regions allowed by the model; and
which makes the prior (5) a proper distribution. The rationale behind this prior can be explained as follows. Since the length of each bound region is very short comparing to the chromosome or the whole genome, it is reasonable to view each bound region as a single point, and thus, following the standard theory of Poisson process, the total number of bound regions can be modeled as a Poisson random variable. Conditioning on the total number of bound regions, as implied by (5), we put an equal prior probability on all possible configurations of z, i.e., assuming a noninformative prior for z. The prior (5) penalizes a large value of K, where the parameter λ represents the strength of penalty. We do not recommend to use a large value of λ, as the number of true bound regions is usually small and a large value of λ will lead to discovery of too many false bound regions. Our experience shows that a value of λ around 0.01 usually works well for the ChIPchip data. In this paper, we set λ = 0.01 in all simulations. The parameter K_{max} is usually set to a large number. We set K_{max} = 5000 in all simulations of this paper. As long as the value of K_{ max }has been reasonably large, increasing it further would have a negligible effect on simulations. Finally, we would like to point out that the bound region identification problem can also be viewed as a changepoint identification problem that has been widely studied in statistics. For the changepoint identification problem, the same truncated Poisson prior has been used for modeling the total number of changepoints by many authors, see, e.g., Phillips and Smith [20], Dension et al. [21], Liang and Wong [22], and Liang [23].
If ν_{1},...,ν_{ K }∈ (ν_{ min }, ν_{ max }), combining the likelihood and prior distributions, integrating out σ^{2}, and taking the logarithm, we get the following logposterior density function
otherwise, the posterior is equal to 0.
Due to the design of ChIPchip experiments, it is obvious that the intensity measurements of the neighboring probes are positively dependent. To model this dependence, we use a latent indicator vector z. This makes our model different from the existing models, such as the joint binding deconvolution model [11], the HGMM [12], and the Bayesian hierarchical model [13]. Both the joint binding deconvolution model and the Bayesian hierarchical model model the mean of probe intensities through the Gaussian random field (GMF), although their formulations may not be in the standard form of the GMF. Like the Bayesian latent model, the HGMM models the mean of probe intensities by a piecewise constant function. The difference is that the HGMM requires one to first divide the data into genomic regions containing at most one bound regions, and thus it allows different nonbound regions to have different means. Considering the physical property of PCR, which performs independently of the form of DNA, allowing different nonbound regions to have different mean values may not be necessary.
MCMC simulations
To simulate from the posterior distribution (6), we used the MetropoliswithinGibbs sampler [24]; see Appendix for the details. Note that when a component of z is updated, the sum of square terms in the posterior density can be calculated in a recursive manner, and this simplifies the computation of the posterior density greatly.
Inference of bound regions
Let p_{ i }= P(z_{ i }= 1y) be the marginal posterior probability that probe i belongs to a bound region. Since the bound regions are expected to consist of several consecutive probes with positive IPenrichment effects, the regions which consists of several consecutive probes with high marginal posterior probabilities are likely to be bound regions. To identify such regions, we follow Gottardo [13] to consider the joint posterior probability
where i is the index of the probes, w is a prespecified halfwindow size, and m is the minimum number of probes belonging to the bound region. As explained in Gottardo [13], the purpose of introducing the joint posterior probability is to remove the false bound regions, which usually consists of only few isolated probes with large enrichment effects. We found that the choice w = 5 and m = 5 works well in practice. This choice of w is consistent with the moving window size used in other work, such as Ji and Wong [6], Keles [12], and Gottardo [13]. The choice of m is chosen for robustness to false bound regions. It also reflects our belief that a bound region should consist of at least five consecutive probes with large enrichment effects.
Note that estimation of ρ_{ i }is trivial based on the samples simulated from the posterior distribution. The value of ρ_{ i }depends on a lot of parameters, such as w, m and the hyperparameters of the model. However, we found that the orders of ρ_{ i }are rather robust to these parameters. This suggests us to treat ρ_{ i }as a conventional testing pvalue, and to control the false discovery rate (FDR) of the bound regions using a FDR control method, e.g., the empirical Bayes method [25] (Efron, 2004) or the stochastic approximationbased empirical Bayes method [26]. Both the methods allow for the dependence between testing statistics and an empirical determination of the density of the testing statistics.
Although a strict control of FDR is important to the detection of bound regions, it is not the focus of this paper. In this paper, we will follow other Bayesian methods, such as BAC, to simply set a cut off value of ρ_{ i }. We classify probe i as a probe in bound regions if ρ_{ i }≥ 0.5, and classify probe i as a probe in nonbound region otherwise. As we will see in the numerical examples, the joint posterior probability can lead to a good detection of true bound regions.
Results
The Estrogen Receptor data (ER data)
The ER data were generated by Carroll [27], which mapped the association of the estrogen receptor on chromosomes 21 and 22. Here we just used a subset of the data to illustrate how the Bayesian latent model works. The subset we used is available from the BAC software [28]. It consists of intensity measurements for 30001 probes under the treatment and control conditions with three replicates each. The same subset has been used by BAC for a demonstration purpose.
The Bayesian latent model was first applied to the dataset. The algorithm was run 5 times. Each run consisted of 11000 iterations, and cost about 4.4 minutes CPU time on a personal computer (Intel Xeon 2.80 GHz, 1 G memory, Linux operating system). All computations of this paper were done on this computer. Based on the GelmanRubin diagnostic plot [29] [See Additional file 1 and 2], we discarded the first 1000 iterations for the burnin process, and used the remaining 10,000 iterations for further inference. Figure 1(b) shows the estimates of the joint posterior probabilities resulted from one run.
For comparison, BAC and tileHMM (available at [30]) were also applied to this dataset. Both BAC and tileHMM produced a probability measure for each probe, similar to ρ_{ i }, on how likely it belongs to a bound region. The results were shown in Figures 1(c) and 1(d), respectively. The comparison shows that all the three methods produced very similar results for this dataset. However, the results produced by the Bayesian latent model are neater; the joint posterior probabilities produced by it tend to be dichotomized, either close to 1 or close to 0. This gives the user a clear classification for the bound and nonbound regions. To provide some numerical evidence for this statement, we calculated the ratio #{i : P_{ i }> 0.5}/#{i : P_{ i }> 0.05}, where #{i : P_{ i }> a} denotes the number of probes with P_{ i }greater than a. Here P_{ i }refers to the joint posterior probability for the Bayesian latent model and BAC, and the conditional probability for tileHMM. The ratios resultant from the Bayesian latent model, BAC and tileHMM are 0.816, 0.615 and 0.674, respectively.
Later, we assessed the sensitivity of the Bayesian latent method to the values of the hyperparameters ν_{min} and λ with other parameters fixed, α = β = 0.05 and ν_{max} = max_{ i }y_{ i }. The cross settings {0.5s_{ y }, 1.0s_{ y }, 1.5s_{ y }, 2s_{ y }, 2.5s_{ y }, 3s_{ y }} × [0.0001, 0.1] for (ν_{min}, λ) were tried for this dataset. For each setting, the algorithm was run 5 times, and each run consisted of 11,000 iterations. To measure the similarity of the bound regions resultant from different settings of the hyperparameters, we propose to use the adjusted Rand index [31, 32]. The adjusted Rand index is usually used in the literature of clustering, which measures the degree of agreement between two partitions of the same set of observations even when the comparing partitions having different numbers of clusters. It is obvious that the problem of bound region identification can also be viewed as a clustering problem; where the genome was partitioned into a series of segments, nonbound or bound regions, and each of the segments forms a cluster.
The adjusted Rand index is defined as follows. Let Ω denote a set of n observations, let C = {c_{1},...,c_{ s }} and represent two partitions of Ω, let n_{ ij }be the number of observations that are in both cluster c_{ i }and cluster , let n_{ i }. be the number of observations in cluster c_{ i }, and let n._{ j }be the number of observations in cluster . The adjusted Rand index is
A higher value of r means a higher correspondence between the two partitions. When the two partitions are identical, r is 1. When a partition is random, the expectation of r is 0. Under the generalized hypergeometric model, it can be shown [32] that
Refer to Hubert and Arabie [32] for the theoretical justification of r.
In calculations of the adjusted Rand indices for the sensitivity experiments, we used the result shown in Figure 1(b) as the standard; that is, if a partition is identical to that partition, r will be 1. The results are summarized in Figure 2, where the adjusted R and index is plotted as a function of log(λ) for different setting of ν_{min}. Figure 2 shows that, for each value of ν_{min}, the adjusted R and index varies between 0.9 and 1.0 as λ runs from 0.0001 to 0.1. This indicates that the performance of the Bayesian latent model is rather robust to the choices of ν_{min} and λ.
Finally, we examined the robustness of the Bayesian latent model to different choice of w and m with other parameters fixed at α = β = 0.05, λ = 0.01, and ν_{ min }= 2s_{ y }. The cross settings {2, 5, 7, 10} × {3, 5, 7} for (w, m) were tried for this dataset. Again, the adjusted R and index is used as the similarity criterion and the result shown in Figure 1(b) as the standard. The results were summarized in Table 1, which indicates, for this dataset, the Bayesian latent model is quite robust to the choices of w and m. In practice, to achieve robustness to outlying probes, we suggest to avoid choosing a small m. In all the following simulations, we set m = 5.
The robustness of the results with respect to changes of α, β and ν_{ max }are not studied in the paper. The reason is that ν_{max} is completely determined by the data, and the values of α and β we used form a vague prior for the variance σ^{2}.
p53 data
In a ChIPchip experiment, Cawley [1] mapped the binding sites of four human transcription factors Sp1, cMyc, p53FL, and p53DO1 on chromosomes 21 and 22. The experiment consisted of 6 treatment and 6 input control arrays, and the chromosomes spanned over three chips A, B and C. Refer to Cawley [1] for the details of the experiment. For the testing purpose, p53FL data on chips A, B and C were used in this paper, which contains 14 quantitative PCR verified regions. As in Cawley [1], the data were preprocessed by filtering out the local repeats, quantilenormalized [33], rescaled to have a median feature intensity of 1000 for the purpose of adjusting batch effect, and then logtransformed. Since the normalization is not the focus of this paper, we skipped the details.
The Bayesian latent method was first applied to the p53 data. The data on chip A, chip B, and chip C were analyzed separately. Each run consisted of 11,000 iterations. Diagnostic plot for the convergence of these runs indicates that they can converge within several hundreds of iterations, even the data on each chip consists of more than 300,000 probes. Accordingly, the first 1000 iterations were discarded for the burnin process, and the samples from other iterations are used for further analysis.
For comparison, BAC and tileHMM were also applied to this example. Given the posterior probabilities, a cutoff of 0.5 was used for all methods to detect bound regions. All resultant bound regions having less than 3 probes or 100 bps were considered to be spurious and removed, and those regions separated by 500 bps or less were merged together to form a predicted bound regions following the approach taken by Cawley [1]. The results were summarized in Table 2. Although tileHMM detected all the 14 validated regions, it essentially fails for this example. It identified a total of 33796 bound regions, which should contain too many false bound regions. We suspect that the failure of tileHMM for this example is due to its training algorithm; it is very likely that tileHMM converged to a local maximum of the likelihood function. This have been noted by Humburg et al. [10], tileHMM may converge to a local maximum of the likelihood function with either the BaumWelch algorithm or the Viterbi training algorithm, rendering an ineffective inference for the model.
Both the Bayesian latent method and BAC work well for this example. At a cutoff of 0.5, BAC identified 100 bound regions, which cover 12 out of 14 experimentally validated bound regions. The Bayesian latent method works even better. At the same cutoff, it only identified 70 bound regions, but which also cover 12 out of 14 experimentally validated bound regions. For further comparison of the Bayesian latent method and BAC, we relaxed the cutoff value and counted the total number of regions needed to cover all experimentally validated regions. We found that the Bayesian latent method only needs to increase the total number of regions to 127, while BAC needs to increase to 1864 regions. Note that the BAC and tileHMM's results reported here may be a little different from those reported by other authors, due to the difference of the normalization methods.
Simulated data
To have a careful assessment of the performance of the Bayesian latent model, we simulated 10 datasets based on the Sp1 data of Cawley's [1] experiment. Each dataset consists of 200,000 probes, two conditions (control and IPenriched), and six replicates under each condition. The probe genomic coordinates we used in simulations were the first 200,000 genomic positions used in the Sp1 data. Each dataset consisted of 996 bound probes, forming 50 bound regions. As in Gottardo [13], the bound regions were assumed to describe a peak with the intensity function given by A exp{4(g_{ i } C)^{2}/B^{2}}, where A is the amplitude of the peak, B controls the width of the peak, C represents the center of the peak, and g_{ i }is the genomic position of probe i. We also followed Gottard [13] to generate the centers of the bound regions randomly across the set of possible coordinates while imposing a separation of at least 3000 bps between peaks; and to generate the values of parameter B uniformly between 600 and 1000 bps. The values of parameter A were generated uniformly between 3 and 5. The variance of the probe intensity was estimated from the Sp1 data.
Firstly, the performance of different models is assessed using the area under the receiving operating characteristic (ROC) curve [34] and the error rate. The former is a standard measure for the performance of a multiple hypothesis testing method, which shows the true positive discovery rate (sensitivity) against the false positive discovery rate (1  specificity) at probe level. The later is a standard measure for the performance of a classification method, which shows the proportion of totally incorrect probe calls, including both false positives and false negatives, against different cutoff values. All the three methods, Bayesian latent method, BAC and tileHMM, were applied to the 10 full datasets. The averaged ROC curve and error rate across a range of cutoffs are obtained and plotted in Figure 3. As indicated by Figure 3(a), the Bayesian latent method and tileHMM have very similar performances on these datasets, and both are much better than BAC. By further examining the plot on the right, which provides a closer view of the area enclosed by the dotted line and axis on the left, it is easy to see that the Bayesian latent method is better than tileHMM for this example. Next, we checked the error rate for each model. The results were shown in Figure 3(b). Again, the Bayesian latent method and tileHMM perform very well and both are much better than BAC. From the right plot of Figure 3(b), we can see that the optimal cutoff for tileHMM is close to 0.3, while it is close to 0.5 for the Bayesian latent method. Figure 3(b) also suggest that both the Bayesian latent method and tileHMM are robust to the choice of cutoff values, ranging from 0.2 to 0.8, while BAC is not.
Later, based on the true bound regions which are known for these 10 simulate datasets, we use the adjusted Rand index r to assess the quality of the results produced by the above three algorithms. In addition, we calculated pvalues of the twosample ttests, H_{0}: r_{ BL }= r_{ O }vs H_{1}: r_{ BL }> r_{ O }, where r_{ BL }denotes the rvalue produced by the Bayesian latent model, and r_{ O }denotes the rvalue produced by the other method. The results were summarized in Table 3. The tests indicate that the Bayesian latent model can lead to more accurate identifications of true bound regions than BAC and tileHMM.
For a thorough comparison, we also applied the sliding window methods, including the Wilcoxon rank sum test method [1], tscan statistic [7] and empirical Bayesian tscan statistic [6], to the 10 datasets. For the testing purpose, we identified the most significant 996 probes, which is the same as the true number of bound probes, as the bound probes for each of the datasets and each of the sliding window methods. We note that this cutoff number should be determined by a multiple hypothesis test in practice, and this choice makes the comparison a little favorly biased toward the sliding window methods. The results were summarized in the lower panel of Table 3, which indicate that the Bayesian latent model also outperforms the sliding window methods.
Discussion
The Bayesian latent model can be generalized in a few ways. Firstly, it can be generalized to allow different bound regions to have different variances. This generalization has been implemented by us. The numerical results are very similar to those reported in the paper.
Secondly, it can be generalized to work on the multiple replicates directly. This can be simply done by modifying (2) to multivariate normals. This generalization will certainly slow down the simulations, but the results may not be improved significantly. The reason is that under the assumption of constant variances for probe intensities, the statistic (1) is sufficient for the mean intensity of probes, while the latter has been designed in the experiment as the main measure for differentiating bound and nonbound regions.
The reason why the Bayesian latent method outperforms tileHMM and BAC can be explained as follows, through a detailed comparison of the models used by them. TileHMM implemented a standard twostate hidden Markov model, with the emission distribution of state S_{ i }, i = 1, 2, being modeled as a tdistribution. TileHMM and the Bayesian latent model are mainly different in two respects.

TileHMM is a nonBayesian method, where maximum likelihood estimates are used for all model parameters and inference for the bound regions are based on the conditional probability of the hidden states. TileHMM is trained using the BaumWelch algorithm and the Viterbi algorithm. It is known that the BaumWelch algorithm is an EM algorithm implemented in the context of HMM, and that it tends to converge to a local maximum of the likelihood function. The Viterbi algorithm provides a fast alternative to the BaumWelch algorithm, but may not converge to a local maximum. The Bayesian latent method is a Bayesian method, where inference for bound regions is based on the posterior distribution of the latent variable. The posterior distribution is simulated using the MetropoliswithinGibbs sampler, which is known to converge to its target distribution when the number of iterations becomes large.

TileHMM models all bound regions to have the same mean value, while the Bayesian latent model allows different bound regions to have different mean values. Our model fits the real data better.
The mixed performance of tileHMM on the simulated and real datasets indicates that the inferiority of tileHMM is mainly due to its training algorithm. In addition, as indicated by our simulated examples, tileHMM tends to misidentify the bound regions with relatively low probe intensities, because it models all bound regions to have the same mean value.
BAC models the probe intensity using a mixedeffect model:
where c = 1 denotes the control sample, c = 2 denotes the treatment sample, r is the index of replicates; μ_{ p }is a random probe effect distributed as ; γ_{ cp }is the probe enrichment effect with γ_{1p}= 0; and ϵ_{ cpr }is the random error distributed as . The authors further modeled the probe enrichment effect by a mixture of a point mass at zero and a truncated Gaussian distribution, i.e.,
where TN_{+}() denotes a truncated Gaussian distribution truncated at zero, and w_{ p }is the a priori proportion of probes belonging to nonbound regions. The a priori proportion depends on a latent Markov random field prior θ= {θ_{ p }, 1 ≤ p ≤ P }, through a logistic transformation
and a Gaussian intrinsic autoregressive model [18] for θ,
where ∂_{ p }corresponds to the probes p' immediately adjacent to p, n_{ p }is the cardinality of ∂_{ p }, κ is a smoothing parameter, and n is the number of neighboring probes used. The model is trained using a MCMC algorithm.
The main difference between the BAC and the Bayesian latent methods is that BAC models the control and treatment samples jointly, while the Bayesian latent method models the difference between the averaged treatment and control samples. Since BAC models the treatment and control samples jointly, it has to include the probespecific effect in the model and assume a complicated structure for the random error, assuming the variance depends on both the probe and the type of samples (control or treatment). By working on the difference between the averaged treatment and control samples, the Bayesian latent method eliminates the probe effect in the model and the dependence of the random error on the probe and the type of samples. This simplifies the model greatly and enables an efficient MCMC simulation from the the posterior distribution. In addition, due to the complicated structure of the model, BAC includes too many parameters, and this makes the model potentially overfitted, especially when the number of replicates is small. This explains why BAC always tends to identify too more bound regions than does the Bayesian latent model. On the other hand, the simplicity of the Bayesian latent model makes it rather robust to outlying probes. As indicated by our examples, it work well for all examples studied in this paper.
Conclusion
We have proposed a Bayesian latent model for the ChIPchip experiments. The new model mainly differs from the existing Bayesian models, such as the joint deconvolution model, the hierarchical gamma mixture model, and the Bayesian hierarchical model, in two respects. Firstly, it works on the difference between the averaged treatment and control samples. This enables the use of a simple model for the data, which avoids the probespecific effect and the sample (control/treatment) effect. As a consequence, this enables an efficient MCMC simulation of the posterior distribution, and also makes the model fairly robust to the outliers. Secondly, it models the neighboring dependence of probes by introducing a latent indicator vector. A truncated Poisson prior distribution is assumed for the latent indicator variable, with the rationale being justified at length.
The Bayesian latent model has been successfully applied to the ER, p53, and some simulated datasets, with comparisons with BAC, tileHMM, and some sliding window methods. The numerical results indicate that the Bayesian latent model can outperform others, especially when the dataset contains outlying probes.
Availability and requirements
An R software package called LatentChIP, which implements the Bayesian latent model under linux operating system, is available at the author's webpage [35].
Appendix
The scheme for simulating samples from the posterior distribution:

(a)
Conditioned on z^{(t)}, updating
using the MetropolisHastings (MH) algorithm, where t indexes the number of iteration cycles.

(b)
Conditioned on
, updating each component of z^{(t)}according to the following rule: Given
: change
to
using the MH algorithm.
When a component of z is updated in step (b), the sum of square terms in the posterior density function can be calculated in a recursive manner, i.e., only the terms related to z_{ i }need to be recalculated.
References
 1.
Cawley S, Bekiranov S, Ng HH, Kapranov P, Sekinger EA, Kampa D, Piccolboni A, Sementchenko V, Cheng J, Williams AJ, Wheeler R, Wong B, Drenkow J, Yamanaka M, Patel S, Brubaker S, Tammana H, Helt G, Struhl K, R GT: Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 2004, 116: 499–509. 10.1016/S00928674(04)001278
 2.
Zhang X, Yazakij J, Sundaresan A, Cokus S, Chan S, Chen H, Henderson IR, Shinn P, Pellegrini M, Jacobsen SE, Ecker JR: Genomewide highresolution mapping and functional analysis of DNA methylation in arabidopsis. Cell 2006, 126: 1189–1201. 10.1016/j.cell.2006.08.003
 3.
Bernstein BE, Kamal M, LindbladToh K, Bekiranov S, Bailey D, Huebert DJ, McMahon S, Karlsson EK, III EJK, Gingeras TR, Schreiber SL, Lander ES: Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 2005, 120: 169–181. 10.1016/j.cell.2005.01.001
 4.
Zhang X, Clarenz O, Cokus S, Bernatavichute YV, Goodrich J, Jacobsen SE: WholeGenome analysis of histone H3 lysine 27 trimethylation in arabidopsis. PLoS Biol 2007, 5(5):e129. 10.1371/journal.pbio.0050129
 5.
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, gerstein M, Snyder M: Global identification of human transcribed sequences with genome tiling arrays. Science 2004, 306(5750):2242–2246. 10.1126/science.1103388
 6.
Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 2005, 21(18):3629–3636. 10.1093/bioinformatics/bti593
 7.
Keles S, Laan MJ, Dudoit S, Cawley SE: Multiple testing methods for ChIPchip high density oligonucleotide array data. Journal of Computational Biology 2006, 13(3):579–613. 10.1089/cmb.2006.13.579
 8.
Li W, Meyer CA, Liu XS: A hidden markov model for analayzing ChIPchip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 2005, 21(Suppl 1):274–282. 10.1093/bioinformatics/bti1046
 9.
Munch K, Gardner PP, Arctander P, Krogh A: A hidden markov model approach for determining expression from genomic tiling microarrays. BMC Bioinformatics 2006, 7: 239. 10.1186/147121057239
 10.
Humburg P, Bulger D, Stone G: Parameter estimation for robust HMM analysis of ChIPchip data. BMC Bioinformatics 2008, 9: 343. 10.1186/147121059343
 11.
Qi Y, Rolfe A, MacIsaac KD, Gerber GK, Pokholok D, Zeitlinger J, Danford T, Dowell RD, Fraenkel E, Jaakkola TS, Young RA, Gifford DK: HighResolution computational models of genome binding events. Nature Biotechnology 2006, 24(8):963–970. 10.1038/nbt1233
 12.
Keles S: Mixture modeling for genomewide localization of transcription factors. Biometrics 2007, 63: 10–21. 10.1111/j.15410420.2005.00659.x
 13.
Gottardo R, Li W, Johnson WE, Liu XS: A flexible and powerful bayesian hierarchical model for ChIPchip experiments. Biometrics 2008, 64: 468–478. 10.1111/j.15410420.2007.00899.x
 14.
Zheng M, Barrera LO, Ren B, Wu YN: ChIPchip: data, model, and analysis. Biometrics 2007, 63(3):787–796. 10.1111/j.15410420.2007.00768.x
 15.
Huber W, Toedling J, Steinmetz LM: Transcript mapping with highdensity oligonucleotide tiling arrays. Bioinformatics 2006, 22(16):1963–1970. 10.1093/bioinformatics/btl289
 16.
Reiss DJ, Facciotti MT, Baliga NS: Modelbased deconvolution of genomewide DNA binding. Bioinformatics 2008, 24(3):396–403. 10.1093/bioinformatics/btm592
 17.
Rabiner LR: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 1989, 77: 257–286. 10.1109/5.18626
 18.
Besag J, Kooperberg C: On conditional and intrinsic autoregrssions. Biometrika 1995, 82: 733–746.
 19.
OpgenRhein R, Strimmer K: Accurate ranking of differentially expressed genes by a distributionfree shrinkage approach. Statistical Applications in Genetics and Molecular Biology 2007., 6: Article 9 Article 9
 20.
Phillips DB, Smith AFM: Bayesian model comparison via jump diffusions. In Markov chain Monte Carlo in Practice. Edited by: Gilks WR, Richardson S, Spiegelhalter DJ. Chapman and Hall; 1996:215–239.
 21.
Denison DG, Mallick BK, Smith AFM: Automatic Bayesian curve fitting. J Royal Statist Soc B 1998, 60: 333–350. 10.1111/14679868.00128
 22.
Liang F, Wong WH: Evolutionary Monte Carlo sampling: applications to C_{ p }model sampling and changepoint problem. Statistica Sinica 2000, 10: 317–342.
 23.
Liang F: Improving SAMC using smoothing methods: theory and applications to bayesian model selection problems. The Annals of Statistics 2009, 37: 2626–2654. 10.1214/07AOS577
 24.
Müller P: A generic approach to posterior integration and gibbs sampling. In Technical Report. Volume 09. Purdue University, West Lafayette, Indiana; 1991.
 25.
Efron B: Largescale simultaneous hypothesis testing:the choice of a null hypothesis. Journal of the American Statistical Association 2004, 99: 96–104. 10.1198/016214504000000089
 26.
Liang F, Zhang J: Estimation the false discovery rate using the stochastic approximation algorithm. Biometrika 2008, 95(4):961–977. 10.1093/biomet/asn036
 27.
Carroll JS, Liu XS, Brodsky AS, Li W, Meyer CA, Szary AJ, Shao W, Hestermann EV, Geistlinger TR, Fox EA, Silver PA, Brown M: Chromosomewide mapping of estrogen receptor binding reveals longrange regulation requiring the forkhead protein foxal. Cell 2005, 122: 33–43. 10.1016/j.cell.2005.05.008
 28.
BAC software[http://www.bioconductor.org/packages/2.2/bioc]
 29.
Gelman A, Rubin DB: Inference from iterative simulation using multiple sequences (with discussion). Statistical Science 1992, 7: 457–511. 10.1214/ss/1177011136
 30.
BAC and tileHMM software[http://cran.rproject.org/web/packages]
 31.
Rand WM: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971, 66: 846–850. 10.2307/2284239
 32.
Hubert L, Arabie P: Comparing partitions. Journal of Classification 1985, 2: 193–218. 10.1007/BF01908075
 33.
Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density ologonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185
 34.
Bradley A: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997, 30: 1145–1159. 10.1016/S00313203(96)001422
 35.
R package: LatentChIP[http://www.stat.tamu.edu/~fliang]
Acknowledgements
Liang's research was supported in part by the grant (DMS0607755) made by the National Science Foundation and the award (KUSC101604) made by King Abdullah University of Science and Technology (KAUST). Tian's research is supported in part by ESO9859. The authors thank the editor, the associate editor, and the referee for their comments which have led to significant improvement of this paper.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
MW developed the R software package, analyzed the data, and drafted the paper. FL conceived this work, developed the model, and drafted the paper. YT participated in the discussion of ChIPchip technology. All authors read and approved the final manuscript.
Electronic supplementary material
Convergence study of Bayesian latent model
Additional file 1: . This file provides a convergence study of Bayesian latent model for the ER dataset. (PDF 45 KB)
Comparison results for simulated data
Additional file 2: . This file provides a comparison study for BAC, tileHMM and the latent model on simulated data. (PDF 228 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Wu, M., Liang, F. & Tian, Y. Bayesian modeling of ChIPchip data using latent variables. BMC Bioinformatics 10, 352 (2009). https://doi.org/10.1186/1471210510352
Received:
Accepted:
Published:
Keywords
 Posterior Distribution
 Outlying Probe
 Probe Intensity
 Bayesian Hierarchical Model
 Adjusted Rand Index