MultiChIPmixHMM: an R package for ChIPchip data analysis modeling spatial dependencies and multiple replicates
 Caroline Bérard^{1, 2, 6}Email author,
 Michael Seifert^{8, 9},
 Tristan MaryHuard^{1, 2, 7} and
 MarieLaure MartinMagniette^{1, 2, 3, 4, 5}
DOI: 10.1186/1471210514271
© Bérard et al.; licensee BioMed Central Ltd. 2013
Received: 12 March 2013
Accepted: 21 August 2013
Published: 9 September 2013
Abstract
Background
Chromatin immunoprecipitation coupled with hybridization to a tiling array (ChIPchip) is a costeffective and routinely used method to identify proteinDNA interactions or chromatin/histone modifications. The robust identification of ChIPenriched regions is frequently complicated by noisy measurements. This identification can be improved by accounting for dependencies between adjacent probes on chromosomes and by modeling of biological replicates.
Results
MultiChIPmixHMM is a userfriendly R package to analyse ChIPchip data modeling spatial dependencies between directly adjacent probes on a chromosome and enabling a simultaneous analysis of replicates. It is based on a linear regression mixture model, designed to perform a joint modeling of immunoprecipitated and input measurements.
Conclusion
We show the utility of MultiChIPmixHMM by analyzing histone modifications of Arabidopsis thaliana. MultiChIPmixHMM is implemented in R and including functions in C, freely available from the CRAN web site: http://cran.rproject.org.
Background
Chromatin immunoprecipitation coupled with hybridization to a tiling array (ChIPchip) is a costeffective and routinely used method for identifying target genes of transcription factors, for analyzing histone modifications or for studying the methylome on a genomewide scale [1]. In a ChIPchip experiment, a chromatin immunoprecipitation sample (IP) is compared against a reference sample of genomic DNA (Input). In recent years, different methods for the identification of ChIPenriched regions have been developed. Among them, [2] proposed a linear regression mixture model named ChIPmix, designed to perform a joint modeling of IP and Input measurements. This twocomponent mixture model discriminates the population of enriched probes from nonenriched ones. Over the last years, ChIPmix has successfully been applied to the identification of methylated gene promoters, histone modifications or transcription factor target genes (e.g. [37]). However, ChIPmix has basically two important limitations: it does not model spatial dependencies between adjacent probes on chromosomes and it also does not handle the joint analysis of multiple biological replicates.
Here, we present MultiChIPmixHMM for ChIPchip analyses enabling modeling of spatial dependencies and a simultaneous analysis of replicates to further improve the identification of enriched probes. We demonstrate improved performance of MultiChIPmixHMM compared to ChIPmix for the target identification of the chromatin mark H3K27me3 of the model plant Arabidopsis thaliana.
Implementation
with specific mean ${a}_{{z}_{t}r}+{b}_{{z}_{t}r}{x}_{\mathit{\text{tr}}}$ and variance ${\sigma}_{r}^{2}$ for each replicate r∈{1,…,R}. Dependencies between adjacent genomic probes t and t+1 are modeled by a firstorder Markov chain defining that the next state z_{t+1} is depending on the predecessor state z_{ t }. All parameters of the HMM are estimated using the BaumWelch algorithm [8] representing a special case of the EM algorithm [9]. To obtain relevant initial values of the emission distribution parameters (slopes and intercepts of the regressions), we applied a Principal Component Analysis to each biological replicate and used the first axis to derive the intercept and slope of the regression. All initial transition parameters are set to 0.5. This reflects the typical case where no biological information is available. We observed on simulations that alternative choices for the transition matrix initialization lead to similar results (not shown). Identification of enriched probes is based on conditional probabilities. A probe is declared enriched if its enriched conditional probability (stateposterior probability of the enriched state) is higher than 1−α, where α is chosen by the user. This strategy has been proved to yield in controlling the proportion of misclassification in mixture models [10].
Results and discussion
Simulations
In this section, we first compare ChIPmix, MultiChIPmixHMM and TileHMM [11], which is a method based on an HMM model to analyze the logratios (IP over Input). Moreover TileHMM can handle multiple replicates. We simulated data according to a twostate HMM with statespecific Gaussian emission distributions modeling immunoprecipated signals as a linear regression of reference input signals. We considered two test scenarios: (i) wellseparated nonenriched and enriched probes (slope parameters 0.6 and 0.99) and (ii) overlapping populations of nonenriched and enriched probes (slope parameters 0.5 and 0.65). Two biological replicates are simulated for each scenario. The transition matrix is set to $\left(\begin{array}{cc}0.97& 0.03\\ 0.1& 0.9\end{array}\right)$ and the variances are set to 0.7 for the first replicate and 0.75 for the second. We used the corresponding methodspecific conditional probabilities for probes to be enriched to display ROC curves. For ChIPmix, that returns a set of probe conditional probabilities per replicate, we summarized the results by taking either the minimal (resp. maximal) conditional probabilities over the two replicates.
Comparison of ChIPmix, MultiChIPmixHMM and TileHMM after classification
Scenario 1, classification with α = 0.01  

false positive rate  true positive rate  
ChIPmix Union  3.79e04  0.32 
ChIPmix Intersection  0  0.1 
MultiChIPmixHMM  1.12e04  0.42 
TileHMM  0.13  0.83 
Arabidopsis dataset analysis
To illustrate the benefit of using MultiChIPmixHMM compared to standard ChIPmix, we use a normalized ChIPchip data set of the model plant Arabidopsis thaliana by [6] to compare the identification of genomic regions marked by histone H3 trimethylated at lysine 27 (H3K27me3). We applied both methods to analyze the two biological replicates and identified probes enriched in H3K27me3 using a stringent cutoff of 1−α=0.99. Since ChIPmix does not handle multiple replicates, both replicates were analyzed separately and only probes declared as enriched in both replicates were finally considered as enriched (considering probes declared enriched for at least one of the replicates leads to similar results).
To further validate these findings, we use known H3K27me3 target genes based on independent prior studies by [12] and by [13]. Among the 311 genes found by both studies, 298 were commonly identified by ChIPmix and MultiChIPmixHMM. Additionally, MultiChIPmixHMM identifies 11 genes exclusively, which have already been identified as target genes in at least one of the two studies. Importantly, this increase of detection power comes without an additional computational time, because the main algorithm of MultiChIPmixHMM is implemented in C.
Conclusions
The R package MultiChIPmixHMM implements a linear regression mixture model to analyse ChIPchip data. In order to provide a more accurate identification of enriched probes, it enables to take into account spatial dependencies between directly adjacent probes and a simultaneous analysis of replicates. The benefits of MultiChIPmixHMM have been shown by analyzing both simulated and real datasets, and by comparing competing softwares.
Availability and requirements
MultiChIPmixHMM is publicly available as an R package from CRAN [14]. Two functions are implemented and refer to the models describe before. To distinguish between the model and the function, the first letter of the name of the function is a lower case: (i) multiChIPmixHMM for modeling spatial dependencies and multiple replicates and (ii) multiChIPmixto model multiple replicates ignoring spatial dependencies between probes. Both functions take as input a vector of filenames (one biological replicate per file), and display as output a file containing the enriched conditional probability and status of each probe.

Project name: MultiChIPmixHMM

Project home page: http://cran.rproject.org/web/packages/MultiChIPmixHMM/index.html

Operating system(s): platform independent

Programming language: R and C

Other requirements: No

License: GNU GENERAL PUBLIC LICENSE

Any restrictions to use by nonacademics: it is available for free download.
Declarations
Acknowledgements
This work was funded by MIA, BAP and MICA departments of INRA, and supported by the TAG ANR/Genoplante project. MS was supported by the DAAD PROCOPE (grant 50748812).
Authors’ Affiliations
References
 Buck M, Lieb J: ChIPchip : considerations for the design, analysis, and application of genomewide chromatin immunoprecipitation experiments. Genomics. 2004, 83 (3): 349360. 10.1016/j.ygeno.2003.11.004.View ArticlePubMed
 MartinMagniette M, MaryHuard T, Bérard C, Robin S: ChIPmix: mixture model of regressions for twocolor ChIPchip analysis. Bioinformatics. 2008, 24: i181i186. 10.1093/bioinformatics/btn280.View ArticlePubMed
 Kubo A, Suzuki N, et al: Genomic cisregulatory networks in the early Ciona intestinalis embryo. Development. 2010, 137: 16131623. 10.1242/dev.046789.View ArticlePubMed
 Long T, Tsukagoshi H, et al: The bHLH transcription factor POPEYE regulates response to Iron deficiency in Arabidopsis roots. Plant Cell. 2010, 22: 22192236. 10.1105/tpc.110.074096.PubMed CentralView ArticlePubMed
 Moghaddam A, Roudier F, Seifert M, Bérard C, et al: Additive inheritance of histone modifications in Arabidopsis thaliana intraspecific hybrids. Plant J. 2011, 67 (4): 691700. 10.1111/j.1365313X.2011.04628.x.View ArticlePubMed
 Roudier F, Ahmed I, Bérard C, Sarazin A, MaryHuard T, et al: Integrative epigenomic mapping defines four main chromatin states in Arabidopsis. EMBO J. 2011, 30: 19281938. 10.1038/emboj.2011.103.PubMed CentralView ArticlePubMed
 Seifert M, et al: MeDIPHMM: Genomewide identification of distinct DNA methylation states from highdensity tiling arrays. Bioinformatics. 2012, 28 (22): 29302939. 10.1093/bioinformatics/bts562.View ArticlePubMed
 Rabiner L: A tutorial on hidden markov models and selected applications in speech recognition. Proc IEEE. 1989, 77: 257286. 10.1109/5.18626.View Article
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, series B. 1977, 39: 138.
 MaryHuard T, et al: Error rate control for classification rules in multiclass mixture models. Journées de la société française de statistique. SFDS Proceedings;. 2013, Toulouse
 P Humburg DB, Stone G: Parameter estimation for robust HMM analysis of ChIPchip data. BMC Bioinformatics. 2008, 9: 34310.1186/147121059343.View Article
 Turck F, et al: Arabidopsis tfl2/lhp1 specifically associates with genes marked by trimethylation of histone h3 lysine 27. PLoS Genet.v. 2007, 3 (6): e8610.1371/journal.pgen.0030086.View Article
 Zhang X, et al: Wholegenome analysis of Histone H3 Lysine 27 Trimethylation in Arabidopsis. PLoS Biol. 2007, 5 (5): e12910.1371/journal.pbio.0050129.PubMed CentralView ArticlePubMed
 Team RDC: A language and environment for statistical computing. 2013, Vienna, Austria: R Foundation for Statistical Computing, ISBN(3900051070)
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.