- Open Access
PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci
© Salmon-Divon et al; licensee BioMed Central Ltd. 2010
- Received: 14 April 2010
- Accepted: 6 August 2010
- Published: 6 August 2010
Functional genomic studies involving high-throughput sequencing and tiling array applications, such as ChIP-seq and ChIP-chip, generate large numbers of experimentally-derived signal peaks across the genome under study. In analyzing these loci to determine their potential regulatory functions, areas of signal enrichment must be considered relative to proximal genes and regulatory elements annotated throughout the target genome Regions of chromatin association by transcriptional regulators should be distinguished as individual binding sites in order to enhance downstream analyses, such as the identification of known and novel consensus motifs.
PeakAnalyzer is a set of high-performance utilities for the automated processing of experimentally-derived peak regions and annotation of genomic loci. The programs can accurately subdivide multimodal regions of signal enrichment into distinct subpeaks corresponding to binding sites or chromatin modifications, retrieve genomic sequences encompassing the computed subpeak summits, and identify positional features of interest such as intersection with exon/intron gene components, proximity to up- or downstream transcriptional start sites and cis-regulatory elements. The software can be configured to run either as a pipeline component for high-throughput analyses, or as a cross-platform desktop application with an intuitive user interface.
PeakAnalyzer comprises a number of utilities essential for ChIP-seq and ChIP-chip data analysis. High-performance implementations are provided for Unix pipeline integration along with a GUI version for interactive use. Source code in C++ and Java is provided, as are native binaries for Linux, Mac OS X and Windows systems.
- Transcriptional Start Site
- Peak Region
- Motif Discovery
- Motif Analysis
- Peak Caller
Next-generation sequencing technologies and tiling microarrays are frequently employed for genome-wide identification of regulatory elements and chromatin modifications. These applications generate vast numbers of experimental data points, which are compiled into extensive sets of genomic loci representing the units of biological activity measured in the particular assay. Researchers must then discern functionally-relevant results from these large-scale datasets, a process that poses significant bioinformatic challenges for research groups with limited computational support. For example, a common aim of transcription factor location analysis is to determine the relationship between ChIP-enriched loci and annotated genes; identifying the cis-regulatory elements occupied by the factor can reveal the set of genes it is likely to regulate across the genome. Correlating global transcription factor binding-site occupancy with target genes quickly becomes intractable in the absence of software tools to automate aspects of large-scale data analysis.
Sequence patterns occurring repeatedly among enriched loci are indicative of regulatory elements such as transcription factor-binding sites, and can often be identified by DNA motif analysis. Successful motif discovery relies on a set of candidate loci that exclude extraneous sequences while still containing the binding site consensus; however, since many peak-finding utilities merge overlapping areas of enrichment, the resulting peaks tend to be much larger than the actual binding sites. Peak regions often comprise more than one functional element (e.g. co-located transcription factor-binding sites or chromatin modifications), and these must be distinguished into individual loci in order to accurately interpret experimental results. The ability to subdivide composite peak regions into a finer-resolution set of individual binding sites (subpeaks) can improve the accuracy of sequence motif analysis.
Here we describe PeakAnalyzer, a set of standalone tools for the automated post-processing of large-scale chromatin profiling data. The programs are able to identify discrete enrichment peaks from loci corresponding to transcription factor binding or chromatin modification, retrieve individual peak sequences and annotate experimental data against various classes of functional elements, such as genes, CpG islands, regulatory features or DNase I hypersensitive sites. Results can also be compared across multiple datasets to report overlapping features, as well as those unique to a given experimental sample. The software is freely available and flexible in implementation, providing both high-performance solutions for pipeline integration and a GUI version for desktop users.
The function NDG locates the nearest downstream genes on both strands and calculates their distances. If the peak region intersects a gene, the program determines if the overlap is within an exon, intron, 5' UTR or 3' UTR. Multiple transcripts or genes overlapping a given location are all reported, providing a means to identify putative bi-directional promoters where the peak is proximal to genes on both strands. TSS locates the nearest transcriptional start site relative to each locus, scanning both downstream or upstream of the experimental peak to account for transcription initiation on either the sense or antisense strand. The ODS function calculates the overlap in positions/peaks between datasets, where peak loci intersecting by at least one nucleotide on either strand are reported. To compute a P-value of overlap enrichment, a random dataset is generated having peak lengths and chromosomal distribution matching the experimental dataset; the overlap between experimental and artificial loci is then determined, and through successive iterations a P-value representing the statistical significance of experimental signal over random is calculated.
Software distribution and input requirements
PeakAnalyzer is implemented as a unified Java program encompassing the software components described above. Equivalent versions of PeakSplitter and PeakAnnotator are also implemented in C++ and Java so that users can choose a distribution suited to their particular requirements. Core facilities processing numerous datasets have the option to incorporate the faster C++ version into a Unix pipeline, whereas the Java implementations can either be run as separate command-line utilities or as a single cross-platform desktop application using an intuitive graphical interface.
PeakAnalyzer requires only a single peak file and a feature annotation file in BED or GTF format; complete annotation files for the current builds of the human (HG19) and mouse (MM9) genomes are provided with the software distribution. The input files required by PeakSplitter are those commonly generated by peak-finding programs: a .bed-formatted peak file containing chromosome start and end locations of signal enrichment loci, and a .wig signal file describing the size and shape of each peak.
We adopted the peak-splitting approach proposed by Fejes et al.  and implemented as the function subpeaks in recent versions of their FindPeaks tool. The method identifies multiple peaks within a given locus and accurately subdivides those containing more than one site of signal enrichment. In addition to incorporating the algorithm into PeakAnalyzer we provide a standalone version as the PeakSplitter utility, thereby enabling its application to signal loci called by any such program (e.g., [2–9]). Local maxima are identified in the peak region by scanning for relative peak heights, where those of adjacent maxima are compared and the lowest value is multiplied by a user-adjustable parameter to arrive at the read depth required for subpeak division. Binding sites are most likely to appear at or near subpeak summit regions, and these sequences can be retrieved directly from the Ensembl database .
The PeakAnnotator component scans the target genome to identify and report functional elements proximal to peak loci. Rather than comparing each peak with all possible features, PeakAnnotator uses a combination of binary search and a modified version of the nested containment list (NCList) algorithm (see below and ) to rapidly identify proximal features among the full set of annotated elements. Proof of correctness of the algorithms described below and a discussion of their runtime complexity can be found in Additional file 1.
Generating a containment list
Determining the set of intersecting genomic regions across multiple experiments and data sources is not straightforward, because for a given dataset the regions queried may not be contiguous and some regions may be embedded within others. Thus, when sorting the regions by start position, the corresponding end positions could be out of sequence. This is more likely to be the case in higher eukaryotes where some loci encode overlapping genes.
The NCList algorithm constitutes a solution to this problem . In this method the set of genomic regions is partitioned into a primary category of positionally-independent loci, and all remaining loci are segregated into a second category. We adopted this approach in our algorithm, where for each gene in the list PeakAnnotator creates a sublist of all genes containing it. A pseudocode description of the process is listed in Additional file 1, Figure S1.
Finding proximal downstream genes
The NDG utility determines the most proximal non-overlapping downstream genes on both strands. If a gene intersects a signal peak it will be stored in a separate list of overlapping genes. For simplicity, we define here a gene that is transcribed from the forward strand pos_gene, and a gene transcribed from the reverse strand neg_gene. The algorithm works as follows: the first non-overlapping gene located 3' to the peak, G 3' , is found using a binary search strategy such that G 3'-start > Peak end . If G 3' is a pos_gene, it is the closest downsteam gene on the forward strand; if not, genes located downstream to G 3' are visited until a pos_gene is found.
Next, the first gene located upstream to G 3' that does not overlap with the current experimental peak is found, termed G 5' . If G 5' is a neg_gene it has the potential to be the closest downstream gene on the reverse strand. However, if G 5' is contained within another gene transcribed from the reverse strand, this gene is potentially closer to or even intersecting the current peak. Hence, the next step is to determine the closest neg_gene and overlapping genes in the set of G 5' and the gene(s) containing G 5' . If G 5' is a pos_gene, genes located upstream are visited until a neg_gene is found. Finally, the closest downstream neg_gene is searched within the set of that gene and those containing it.
Finding proximal transcription start sites
The TSS function works as follows: the first gene located downstream to the peak's central position, G 3' , is found using a binary search strategy, and its distance to the current peak is calculated. Genes located downsteam to G 3' are visited until a gene that starts downstream of the G 3' locus is found. The gene having the lowest distance from the signal peak is then marked as the closest downstream gene. Next, the first gene upstream to G 3' , termed G 5' , whose end position < G 5'-start (i.e., G 5' = G3'-1 ) is found. Its distance, and the distance of all genes that contain it, is calculated in order to find the nearest upstream gene. The one representing the minimal absolute distance to the peak among the set of proximal downstream and upstream genes will be reported.
Finding overlapping data sets
The ODS function operates on two sets of peaks, denoted here S 1 and S 2 , and iterates over all loci in S 1 to find those intersecting by at least one nucleotide with loci in S 2 . For each locus Ln in S 1 , the first non-overlapping peak from S 2 located 3' to L1, termed L2 3' , is found using a binary search strategy such that L2 3'-start > L1 end . The algorithm then searches upstream of L2 3' to determine if any peak intersects L1, until the first locus in S 2 , termed L2 5' , is found having coordinates outside the boundaries of L1. Peaks containing L2 5' can potentially overlap L1, and are also considered.
To illustrate typical applications of PeakAnalyzer, we analyzed the genome-wide binding profiles of a series of transcriptional regulators (Ctcf, E2f1, Esrrb, Klf4, c-Myc, n-Myc, Nanog, Oct4, Stat3, Smad1, Sox2, Suz12, Tcfcp2l1 and Zfx) in mouse embryonic stem (ES) cells, determined using the ChIP-seq method . We obtained the primary data from the NCBI GEO database (series GSE11431), mapped the sequencing reads to the mouse genome assembly using the Bowtie alignment program , and detected significant peaks of signal enrichment with MACS . Subsequent analyses were performed on the set of chromatin-binding regions from each of these re-processed ChIP-seq datasets.
Identification and subdivision of signal peaks
Characteristics of transcription factor-bound peaks and subpeaks
Genome-wide annotation of transcription factor binding sites
Binding sites identified from ChIP-based experiments are usually categorized relative to genomic features, such as the frequency of binding to promoters, enhancers, gene structures or unannotated intergenic regions. Of primary interest in determining transcription factor targets is the location of binding sites relative to known transcriptional start sites. The relationship between promoter occupancy and differential gene expression can often identify genes directly regulated by a factor, but can also provide insight into the mechanisms by which it mediates transcriptional activation or repression. For example, factors that bind close to transcriptional start sites have been proposed to promote gene expression by stabilizing the association of general transcription factors at the core promoter elements; factors that bind to distal regions, either upstream or downstream of a gene locus, may regulate transcription by mediating, through a chromatin looping mechanism, the protein-protein contacts between distal complexes and the general transcriptional machinery bound at the promoter.
From this analysis, it appears evident that the binding profile of c-Myc is comparable with that of n-Myc, and that of Nanog is similar to both Sox2 and Smad1. Over 50% of c-Myc and n-Myc binding sites are located within or very close to target genes (up to 1 Kb), whereas only 25% correspond to distal binding sites (farther than 10 Kb). In contrast, distal binding sites constitute the predominant fraction (70%) of Sox2, Nanog and Smad1 loci; fewer than 10% of binding sites are found within or in close proximity to genes, suggesting that these factors bind preferentially to remote enhancer elements.
Unsurprisingly, c-Myc and n-Myc exhibit similar peak profiles, as Myc family members share gene and protein structural features  and function through common pathways [16–18]. Moreover, when expressed from the c-Myc locus, n-Myc is regulated in a similar fashion and functionally complementary to c-Myc in the context of various cellular growth and differentiation processes . Although these two regulatory proteins display similar binding profiles, its not yet clear whether they share the same binding loci and regulate common target genes. To address this question we used PeakAnnotator's ODS utility to determine if c-Myc and n-Myc occupy the same binding loci in the ChIP-seq profiles examined. We found 7,039 (82%) of c-Myc binding sites to overlap those of n-Myc, with P-values < 0.001 compared to random peak locations. This observation indicates that, in the context of self-renewing ES cells, c-Myc and n-Myc are likely to participate in tandem to regulate the transcription of a large number of common target genes.
Identification of regulated target genes
We next sought to correlate the number of peaks and subpeaks found either in the promoter regions of genes (up to 2 kb upstream) or within gene loci, relative to their corresponding expression levels in mouse ES cells. For this analysis, we obtained relevant microarray data from the GNF SymAtlas database , where expression levels from C57BL/6 mice were measured on the Affymetrix 430 2.0 array. Microarray probesets were mapped to 16,595 Ensembl-annotated genes, and these were subsequently partitioned into 7 gene sets based on log2 intensity values (from 2 to 16 in increments of 2).
In contrast, Ctcf occupies more binding sites in genes displaying lower expression levels, suggesting that in this context Ctcf acts as transcriptional repressor. Furthermore, the number of subpeaks resolved by PeakSplitter was much lower in this case, indicating the presence of single high-affinity Ctcf binding sites, possibly comprising several recognition sequences in close proximity. Interestingly, Etf1 occupies roughly the same number of loci per gene at all levels of expression, but these regions are split into significantly more subpeaks in highly expressed genes. This suggests that the frequency of binding to regulatory elements may enhance the expression of Etf1 target genes.
Identification of binding motifs
A common aim in transcription factor-binding site analysis is to identify known and novel sequence patterns occurring within peak regions. To determine whether consensus binding sites are present in a set of ChIP DNA fragments, statistically over-represented subsequences can be found using motif discovery software. The accuracy of motif analysis relies on the specificity of the input sequences, as the presence of excessive flanking regions will often inhibit the detection of common patterns. It is therefore advantageous to reduce non-specific sequence content in order to minimize the amount of uninformative background from which motifs must be distinguished .
The ability to refine the set of input sequences can improve both the accuracy and success rate of motif discovery. In addition to subdividing signal peaks into discrete loci, PeakSplitter can be used to extract genomic DNA sequences corresponding to subpeak summit regions, which can then be used as input candidates for motif analysis. This feature is particularly useful when applied in conjunction with peak-calling software that does not report locations of greatest read depth within peak regions.
We employed MEME  to assess the performance of motif discovery using the subpeak summit sequences output by PeakSplitter relative to entire peak regions. The detection of new sequence motifs has been shown to plateau with a high number of input sequences . Therefore we divided each ChIP-seq dataset into groups of 500 peaks, retrieved genomic DNA sequences corresponding to peak regions and used these as input to MEME. We then repeated this procedure using subpeak summit sequences as reported by PeakSplitter.
Influence of peak-calling methods on motif discovery
The motif analysis described above could potentially be biased toward subpeak division if a particular peak-calling algorithm consistently reports longer peak regions than others. To verify whether this is the case, we compared the performance of PeakSplitter and subsequent motif discovery on the output of several alternative peak-calling utilities, using the Oct4 ChIP-seq profile as a representative example.
Numbers of peaks and length distributions reported by various peak-calling programs
Numbers and length distributions of subpeaks
Finally, we used these results to determine whether peak subdivision enhances motif discovery. The merged peak and subpeak datasets were divided into groups of 300 sequences and used as input to MEME. Since individual peak summit information is lost when regions called by different programs are merged, we used the entire peak sequences for motif analysis rather than regions flanking the summit. Following this analysis the canonical Oct4 binding sequence was not identified in any of the datasets containing the original peaks. After PeakSplitter was applied the motif was found in all of the subpeak datasets, aside from one instance where an Oct4 half-site was reported. These results indicate that subdividing signal peaks is essential for accurate motif discovery, independent of the original peak-calling method used.
Regulatory elements identified through functional genomic assays are commonly determined based on signal peaks from tiling array fluorescence data or aligned reads from massively parallel sequencing. In order to interpret the results of such experiments, they must be considered in context with genes and regulatory elements in proximity to peak regions. Methods to automate the functional annotation of chromatin binding and modification loci can greatly ease characterization of their biological significance in genome-wide analyses.
A variety of tools are available for processing the primary data generated by ChIP-seq experiments, such as mapping sequence reads to a reference genome and identifying areas of significant enrichment. However, this is not the case for downstream analysis and data integration. Existing solutions that address these issues either rely on the transfer of large datasets via the Web for remote processing , require local installation of target genome databases , or operate within a specific computing environment .
PeakAnalyzer is a standalone solution amenable to a wide range of applications, including comparison of data generated on different experimental platforms. The software can accept any genomic loci as input and therefore can be used to process datasets spanning various methods, such as ChIP-seq, ChIP-chip, DamID, MeDIP and bisulfite sequencing. The PeakAnnotator component facilitates the automated annotation of numerous experimental results, and obviates the need to import large datasets into a genome browser for manual visualization and assessment.
Subdividing genomic loci with PeakSplitter is particularly useful for discerning individual binding sites that may be present in aggregate peak regions, and in extracting candidate sequences for motif analysis. We observe an increase in both accuracy and efficiency in motif search when ChIP data are processed by PeakSplitter. Partitioning broad signal peaks into discrete loci enriches the dataset for sequences containing transcription factor-binding sites and other regulatory elements, and can enhance the discovery of new consensus motifs by providing a more focused set of candidate sequences for alignment and/or model building.
Project name: PeakAnalyzer
Operating system(s): Platform independent
Programming language: Java, C++
Other requirements: Java 1.5 or higher, R for graphical output (optional)
License: MIT/X Consortium
Restrictions to use by non-academics: none
The authors thank Pär Engström and Tao Liu (DFCI, Harvard) for discussions and program testing. Support is acknowledged from EMBL and BBSRC grant BBG0156781.
- Fejes A, Robertson G, Bilenky M, Varhol R, Bainbridge M, Jones S: FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 2008, 24(15):1729–30. 10.1093/bioinformatics/btn305View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, Nussbaum C, Myers R, Brown M, Li W, Liu X: Model-based analysis of ChIP-Seq (MACS). Genome Biol 2008, 9(9):R137. 10.1186/gb-2008-9-9-r137View ArticlePubMedPubMed CentralGoogle Scholar
- Spyrou C, Stark R, Lynch A, Tavaré S: BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics 2009, 10(0):299. 10.1186/1471-2105-10-299View ArticlePubMedPubMed CentralGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach R, Zhang Z, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein M: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27: 66–75. 10.1038/nbt.1518View ArticlePubMedPubMed CentralGoogle Scholar
- Nix D, Courdy S, Boucher K: Empirical methods for controlling false positives and estimating confidence in ChIP-Seq peaks. BMC Bioinformatics 2008, 9(0):523. 10.1186/1471-2105-9-523View ArticlePubMedPubMed CentralGoogle Scholar
- Mortazavi A, Williams B, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 2008, 5(7):621–8. 10.1038/nmeth.1226View ArticlePubMedGoogle Scholar
- Valouev A, Johnson D, Sundquist A, Medina C, Anton E, Batzoglou S, Myers R, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat Methods 2008, 5(9):829–34. 10.1038/nmeth.1246View ArticlePubMedPubMed CentralGoogle Scholar
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res 2008, 36(16):5221–31. 10.1093/nar/gkn488View ArticlePubMedPubMed CentralGoogle Scholar
- Zang C, Schones D, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics 2009, 25(15):1952–8. 10.1093/bioinformatics/btp340View ArticlePubMedPubMed CentralGoogle Scholar
- Hubbard T, Aken B, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keefe D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, Meidl P, Overduin B, Parker A, Pritchard B, Rios D, Schuster M, Slater G, Smedley D, Spooner W, Spudich G, Trevanion S, Vilella A, Vogel J, White S, Wilder S, Zadissa A, Birney E, Cunningham F, Curwen V, Durbin R, Fernandez-Suarez X, Herrero J, Kasprzyk A, Proctor G, Smith J, Searle S, Flicek P: Ensembl 2009. Nucleic Acids Res 2009, 37(0):D690–7. 10.1093/nar/gkn828View ArticlePubMedPubMed CentralGoogle Scholar
- Alekseyenko A, Lee C: Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases. Bioinformatics 2007, 23(11):1386–93. 10.1093/bioinformatics/btl647View ArticlePubMedGoogle Scholar
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega V, Wong E, Orlov Y, Zhang W, Jiang J, Loh Y, Yeo H, Yeo Z, Narang V, Govindarajan K, Leong B, Shahab A, Ruan Y, Bourque G, Sung W, Clarke N, Wei C, Ng H: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell 2008, 133(6):1106–17. 10.1016/j.cell.2008.04.043View ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg S: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 2009, 10(3):R25. 10.1186/gb-2009-10-3-r25View ArticlePubMedPubMed CentralGoogle Scholar
- Mikkelsen T, Ku M, Jaffe D, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim T, Koche R, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander E, Bernstein B: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–60. 10.1038/nature06008View ArticlePubMedPubMed CentralGoogle Scholar
- Henriksson M, Lüscher B: Proteins of the Myc network: essential regulators of cell growth and differentiation. Adv Cancer Res 1996, 68(0):109–82. full_textView ArticlePubMedGoogle Scholar
- Mukherjee B, Morgenbesser S, DePinho R: Myc family oncoproteins function through a common pathway to transform normal cells in culture: cross-interference by Max and trans-acting dominant mutants. Genes Dev 1992, 6(8):1480–92. 10.1101/gad.6.8.1480View ArticlePubMedGoogle Scholar
- Amati B, Brooks M, Levy N, Littlewood T, Evan G, Land H: Oncogenic activity of the c-Myc protein requires dimerization with Max. Cell 1993, 72(2):233–45. 10.1016/0092-8674(93)90663-BView ArticlePubMedGoogle Scholar
- O'Hagan R, Schreiber-Agus N, Chen K, David G, Engelman J, Schwab R, Alland L, Thomson C, Ronning D, Sacchettini J, Meltzer P, DePinho R: Gene-target recognition among members of the myc superfamily and implications for oncogenesis. Nat Genet 2000, 24(2):113–9. 10.1038/72761View ArticlePubMedGoogle Scholar
- Malynn B, de Alboran I, O'Hagan R, Bronson R, Davidson L, DePinho R, Alt F: N-myc can functionally replace c-myc in murine development, cellular growth, and differentiation. Genes Dev 2000, 14(11):1390–9.PubMedPubMed CentralGoogle Scholar
- Su A, Wiltshire T, Batalov S, Lapp H, Ching K, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke M, Walker J, Hogenesch J: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101(16):6062–7. 10.1073/pnas.0400782101View ArticlePubMedPubMed CentralGoogle Scholar
- MacIsaac K, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput Biol 2006, 2(4):e36. 10.1371/journal.pcbi.0020036View ArticlePubMedPubMed CentralGoogle Scholar
- Bailey T, Boden M, Buske F, Frith M, Grant C, Clementi L, Ren J, Li W, Noble W: MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 2009, 37(0):W202–8. 10.1093/nar/gkp335View ArticlePubMedPubMed CentralGoogle Scholar
- Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899–913. 10.1093/nar/gki791View ArticlePubMedPubMed CentralGoogle Scholar
- Johnson D, Mortazavi A, Myers R, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316(5830):1497–502. 10.1126/science.1141319View ArticlePubMedGoogle Scholar
- Blankenberg D, Taylor J, Schenck I, He J, Zhang Y, Ghent M, Veeraraghavan N, Albert I, Miller W, Makova K, Hardison R, Nekrutenko A: A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly. Genome Res 2007, 17(6):960–4. 10.1101/gr.5578007View ArticlePubMedPubMed CentralGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson D, Myers R, Wong W: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–300. 10.1038/nbt.1505View ArticlePubMedPubMed CentralGoogle Scholar
- Zhu L, Gazin C, Lawson N, Pages H, Lin S, Lapointe D, Green M: ChIPpeakAnno: a Bioconductor package to annotate ChIP-seq and ChIP-chip data. BMC Bioinformatics 2010, 11(0):237. 10.1186/1471-2105-11-237View ArticlePubMedPubMed CentralGoogle Scholar
- Wilson M, Barbosa-Morais N, Schmidt D, Conboy C, Vanes L, Tybulewicz V, Fisher E, Tavaré S, Odom D: Species-specific transcription in mice carrying human chromosome 21. Science 2008, 322(5900):434–8. 10.1126/science.1160930View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.