Starr: Simple Tiling ARRay analysis of Affymetrix ChIP-chip data
© Zacher et al; licensee BioMed Central Ltd. 2010
Received: 6 October 2009
Accepted: 17 April 2010
Published: 17 April 2010
Chromatin immunoprecipitation combined with DNA microarrays (ChIP-chip) is an assay used for investigating DNA-protein-binding or post-translational chromatin/histone modifications. As with all high-throughput technologies, it requires thorough bioinformatic processing of the data for which there is no standard yet. The primary goal is to reliably identify and localize genomic regions that bind a specific protein. Further investigation compares binding profiles of functionally related proteins, or binding profiles of the same proteins in different genetic backgrounds or experimental conditions. Ultimately, the goal is to gain a mechanistic understanding of the effects of DNA binding events on gene expression.
We present a free, open-source R/Bioconductor package Starr that facilitates comparative analysis of ChIP-chip data across experiments and across different microarray platforms. The package provides functions for data import, quality assessment, data visualization and exploration. Starr includes high-level analysis tools such as the alignment of ChIP signals along annotated features, correlation analysis of ChIP signals with complementary genomic data, peak-finding and comparative display of multiple clusters of binding profiles. It uses standard Bioconductor classes for maximum compatibility with other software. Moreover, Starr automatically updates microarray probe annotation files by a highly efficient remapping of microarray probe sequences to an arbitrary genome.
Starr is an R package that covers the complete ChIP-chip workflow from data processing to binding pattern detection. It focuses on the high-level data analysis, e.g., it provides methods for the integration and combined statistical analysis of binding profiles and complementary functional genomics data. Starr enables systematic assessment of binding behaviour for groups of genes that are alingned along arbitrary genomic features.
Chromatin-ImmunoPrecipitation on chip (ChIP-chip) is a technique for identifying Protein-DNA interactions. For this purpose, the chromatin is bound to the protein of interest, then trimmed to yield a protein-bound fraction of DNA. The protein-bound fraction of DNA is then immunoprecipitated with a protein-specific antibody and hybridized to tiling microarrays . The complex experimental procedure and the high dimensionality of the output data require thorough bioinformatical analyses which assess the quality of the experiments and ensures the reliability of the results [2, 3]. The practical need for a ChIP-chip analysis tool has led to the development of either GUI-based or command line-oriented software (see [4, 5], and [6, 7], respectively). We favor the command line solution, which has been realized in our software, because virtually every ChIP-chip experiment requires flexible adaptations to its individual design as well as customized methods to test the hypotheses under investigation.
We present the open-source software package Starr, which is available as part of the open source Bioconductor project . It is an extension package for the programming language and statistical environment R. Starr facilitates analysis of ChIP-chip data, with particular but not exclusive support of the Affymetrix™ microarray platform. Its functionality comprises remapping of probe sequences to the genome, data import, quality assessment, and visual data exploration. Starr provides new high level analysis tools, e.g., the alignment of ChIP signals along annotated gene features, and combined analysis of the ChIP signals and complementary gene expression measurements. It uses the standard microarray data structures of Bioconductor, thus building on and fully exploiting the package Ringo. The sequence mapping algorithm and some functions for peak finding are implemented in C to increase computation speed. The mapping of the probes to the position of the genome are stored in an object of the Bioconductor class probeAnno. Intensity measurements from the ChIP experiments are stored in an ExpressionSet object, which makes the results of Starr accessible to all other R packages that operate on these common classes.
Time for remapping of Affymetrix reporter sequences to a genome
genome size (bp)
S. cerevisiae Tiling 1.0R
2 697 594
12 495 682
Drosophila Tiling 2.0R
1 min 16 s
2 907 359
122 653 977
Human Promoter 1.0R
14 min 22 s
4 315 643
3.3 * 109
We facilitated data import as much as possible, since in our experience, this is a major obstacle for the widespread use of R packages in the field of ChIP-chip analysis. Data import from the microarray manufacturers Nimblegen and Agilent has already been implemented in Ringo, the Affymetrix array platform is covered by Starr. There are two kinds of files that must be known to Starr: the .bpmap file which contains the mapping of the reporter sequences to their physical position on the array, and the .cel files which contain the actual measurement values. All data, no matter from which platform, are stored in the common Bioconductor object ExpressionSet, which makes them accessible to a number of R packages operating on that data structure. The built-in import procedure of Starr furthermore automatically creates R objects containing additional annotation (probeAnno, phenoData, sequence information), which is indespensible for our purposes. There exist alternative import functions, e.g., in the packages AffyTiling, oligo or rMAT , but these do not extract all the information we need, and often they use a different format. Genomic annotation can either be read directly from a gff file or obtained via the biomaRt package .
It would be desirable to discuss the structure of cel and gff files and of the ExpressionSet/probeAnno classes at greater length, but this is beyond the scope of this paper. We refer to the vignette of the Starr package, which addresses these more technical aspects in detail.
We demonstrate the utility of Starr by applying it to a yeast RNA-Polymerase II (PolII for short) ChIP experiment. One of the most prominent purposes of ChIP experiments is the identification and localization of peaked binding events on the genome. Although, by virtue of compatibility, we can draw on the facilities of other peak detection algorithms like Ringo, ACME  or BAC , we implemented a novel algorithm - CMARRT - which was developed by P.F. Kuan  and performs well in practice. For further postprocessing of ChIP-enriched regions, we suggest the R package ChIPpeakAnno.
Starr provides functions for the visualization of a set of "profiles" (e.g. time series, or signal levels) along genomic positions. Our profileplot function relates to the conventional mean value plot like a box plot relates to an individual sample mean: Let the profiles be given as the rows of a samples × positions matrix that contains the respective signal of a sample at a given position. Instead of plotting a line for each profile (row of the matrix), the q-quantiles for each position (column of the matrix) are calculated, where q runs through a set of representative quantiles. Then for each q, the profile line of the q-quantiles is plotted. Color coding of the quantile profiles further aids the interpretation of the plot.
Apart from covering the standard processes of data acquisition and preprocessing, Starr is a Bioconductor package that offers a range of novel high-level tools that greatly enhance the exploration of ChIP-chip experiments. Those include functions like peak finding, summary visualization of gene groups, and correlation analysis with expression data. On the side of the low-level analysis, we implemented a convenient probe remapping algorithm that helps to keep annotation standards high. By relying on standard Bioconductor object classes, Starr can easily interface other Bioconductor packages. It therefore makes the full functionality of Ringo amenable to the analysis of Affymetrix tiling arrays. All functions and methods in the Starr package are well documented in help pages and in a vignette, which also contains a sample workflow in R. Altogether, Starr constitutes a powerful and comprehensive tool for tiling array analysis across established one- and two-color technologies like Affymetrix, Agilent and Nimblegen.
Availability and requirements
The R-package Starr is available from the Bioconductor web site at http://www.bioconductor.org and runs on Linux, Mac OS and MS-Windows. It requires an installed version of R (version > = 2.10.0), which is freely available from the Comprehensive R Archive Network (CRAN) at http://cran.r-project.org, and other Bioconductor packages, namely Ringo, affy, affxparser, and vsn plus the CRAN package pspline and MASS. The easiest way to obtain the most recent version of the software, with all its dependencies, is to follow the instructions at http://www.bioconductor.org/download. Support is provided by the Bioconductor mailing list and the package maintainer. Starr is distributed under the terms of the Artistic License 2.0. An R script reproducing the entire results of this paper, together with the data files can be found in the supplements as Additional file 1, and on the website http://www.lmb.uni-muenchen.de/tresch/starr.html. ChIP-chip data of yeast PolII binding was published by Venters and Pugh in 2009  and is available on array express under the accession number E-MEXP-1676. The gene expression data used here is available under accession number E-MEXP-2123. Transcription start and termination sites were obtained from David et al. .
We thank Michael Lidschreiber, Andreas Mayer, Matthias Siebert, Johannes Soeding and Kemal Akman for useful comments on the package, Joern Toedling for help on Ringo, and Anna Ratcliffe for proofreading. This work is supported by the 'Sonderforschungsbereich' SFB646.
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA binding proteins. Science 2000, 290(5500):2306–2309. 10.1126/science.290.5500.2306View ArticlePubMedGoogle Scholar
- Royce TE, Rozowsky JS, Gerstein MB: Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics 2007, 23(8):988–997. 10.1093/bioinformatics/btm052View ArticlePubMedGoogle Scholar
- Zeller G, Henz S, Laubinger S, Weigel D, Raetsch G: Transcript Normalization and Segmentation of Tiling Array Data. Pacific Symposium on Biocomputing 2008, 13: 527–538.Google Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–1300. 10.1038/nbt.1505View ArticlePubMedPubMed CentralGoogle Scholar
- Benoukraf T, Cauchy P, Fenouil R, Jeanniard A, Koch F, Jaeger S, Thieffry D, Imbert J, Andrau JC, Spicuglia S, Ferrier P: CoCAS: a ChIP-on-chip analysis suite. Bioinformatics 2009, 25(7):954–955. 10.1093/bioinformatics/btp075View ArticlePubMedPubMed CentralGoogle Scholar
- Toedling J, Skylar O, Krueger T, Fischer JJ, Sperling S, Huber W: Ringo-an R/Bioconductor package for analyzing ChIP-chip readouts. BMC Bioinformatics 2007, 8: 221. 10.1186/1471-2105-8-221View ArticlePubMedPubMed CentralGoogle Scholar
- He K, Li X, Zhou J, Deng XW, Zhao H, Luo J: NTAP: for NimbleGen tiling array ChIP-chip data analysis. Bioinformatics 2009, 25: 1838–1840. 10.1093/bioinformatics/btp320View ArticlePubMedPubMed CentralGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5(10):R80. 10.1186/gb-2004-5-10-r80View ArticlePubMedPubMed CentralGoogle Scholar
- Ihaka R, Gentleman R: R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996, 5: 299–314. 10.2307/1390807Google Scholar
- Li W, Carroll JS, Brown M, Liu S: xMAN: extreme MApping of OligoNucleotides. BMC Genomics 2008, 9(Suppl 1):S20. 10.1186/1471-2164-9-S1-S20View ArticlePubMedPubMed CentralGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004, 5(2):R12. 10.1186/gb-2004-5-2-r12View ArticlePubMedPubMed CentralGoogle Scholar
- Aho AV, Corasick MJ: Efficient string matching: an aid to bibliographic search. Communications of the ACM 1975, 18(36):333–340. 10.1145/360825.360855View ArticleGoogle Scholar
- Droit A, Cheung C, Gottardo R: rMAT-an R/Bioconductor package for analyzing ChIP-chip experiments. Bioinformatics 2010, 26(5):678–679. 10.1093/bioinformatics/btq023View ArticlePubMedGoogle Scholar
- Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A, Huber W: BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics 2005, 21(16):3439–3440. 10.1093/bioinformatics/bti525View ArticlePubMedGoogle Scholar
- Johnson WE, Li W, Meyer CA, Gottardo R, Carroll JS, Brown M, Liu XS: Model-based analysis of tiling-arrays for ChIP-chip. Proc Natl Acad Sci USA 2006, 103(33):12457–12462. 10.1073/pnas.0601180103View ArticlePubMedPubMed CentralGoogle Scholar
- Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004, 83(3):349–360. 10.1016/j.ygeno.2003.11.004View ArticlePubMedGoogle Scholar
- Chung HR, Vingron M: Comparison of sequence-dependent tiling array normalization approaches. BMC Bioinformatics 2009, 10: 204. 10.1186/1471-2105-10-204View ArticlePubMedPubMed CentralGoogle Scholar
- Siebert M, Lidschreiber M, Hartmann H, Soeding J: A Guideline for ChIP - Chip Data Quality Control and Normalization (PROT 47). Tech. rep., Gene Center Munich, Ludwig-Maximilians-Universitaet 2009. [http://www.epigenome-noe.net/researchtools/protocol.php?protid=47]Google Scholar
- Judy JT, Ji H: TileProbe: modeling tiling array probe effects using publicly available data. Bioinformatics 2009, 25: 2369–2375. 10.1093/bioinformatics/btp425View ArticlePubMedPubMed CentralGoogle Scholar
- Toedling J, Huber W: Analyzing ChIP-chip data using Bioconductor. PLoS Computational Biology 2008., 4(11): 10.1371/journal.pcbi.1000227Google Scholar
- Bourgon R: Chromatin-immunoprecipitation and high-density tiling microarrays: a generative model, methods for analysis, and methodology assessment in the absence of a gold standard. PhD thesis. University of California Berkeley, Berkeley, California, United States of America; 2006.Google Scholar
- Scacheri PC, Crawford GE, Davis S: Statistics for ChIP-chip and DNase hypersensitivity experiments on NimbleGen arrays. Methods Enzymol 2006, 411: 270–282. 10.1016/S0076-6879(06)11014-9View ArticlePubMedGoogle Scholar
- Gottardo R, Li W, Johnson WE, Liu XS: A flexible and powerful bayesian hierarchical model for ChIP-Chip experiments. Biometrics 2008, 64(2):468–478. 10.1111/j.1541-0420.2007.00899.xView ArticlePubMedGoogle Scholar
- Kuan PF, Chun H, Keles S: CMARRT: A tool for the analysis of ChIP-chip data from tiling arrays by incorporating the correlation structure. Proc. Pacific Symposium of Biocomputing 2008, (13):515–526. full_text
- Dengl S, Mayer A, Sun M, Cramer P: Structure and in vivo requirement of the yeast Spt6 SH2 domain. J Mol Biol 2009, 389: 211–225. 10.1016/j.jmb.2009.04.016View ArticlePubMedGoogle Scholar
- David L, Huber W, Granovskaia M, Toedling J, Palm CJ, Bofkin L, Jones T, Davis RW, Steinmetz LM: A high-resolution map of transcription in the yeast genome. Proc Natl Acad Sci USA 2006, 103(14):5320–5325. 10.1073/pnas.0601091103View ArticlePubMedPubMed CentralGoogle Scholar
- Venters BJ, Pugh BF: A canonical promoter organization of the transcription machinery and its regulators in the Saccharomyces genome. Genome Res 2009, 19(3):360–371. 10.1101/gr.084970.108View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.