- Methodology article
- Open Access
A motif-independent metric for DNA sequence specificity
BMC Bioinformaticsvolume 12, Article number: 408 (2011)
Genome-wide mapping of protein-DNA interactions has been widely used to investigate biological functions of the genome. An important question is to what extent such interactions are regulated at the DNA sequence level. However, current investigation is hampered by the lack of computational methods for systematic evaluating sequence specificity.
We present a simple, unbiased quantitative measure for DNA sequence specificity called the Motif Independent Measure (MIM). By analyzing both simulated and real experimental data, we found that the MIM measure can be used to detect sequence specificity independent of presence of transcription factor (TF) binding motifs. We also found that the level of specificity associated with H3K4me1 target sequences is highly cell-type specific and highest in embryonic stem (ES) cells. We predicted H3K4me1 target sequences by using the N- score model and found that the prediction accuracy is indeed high in ES cells.The software to compute the MIM is freely available at: https://github.com/lucapinello/mim.
Our method provides a unified framework for quantifying DNA sequence specificity and serves as a guide for development of sequence-based prediction models.
Of the entire 3GB human genome, only about 2% codes for proteins. The identification of biological functions of the entire genome remains a major challenge [1, 2]. One powerful venue to gain functional insights is to identify the proteins that bind to each genomic region. Recent development of chromatin immunoprecipitation followed by microarray or sequencing (ChIP- chip or ChIPseq) technologies has made it feasible to map genome-wide protein-DNA interaction profiles [3–5]. The data generated by these experiments have not only greatly facilitated the genome-wide characterization of regulatory elements such as enhancers [6, 7] but also been integrated with other data sources to build gene regulatory networks [8–11].
An important question is to what extent a specific protein-DNA interaction is mediated at the level of genomic sequences. While it is well known that specific sequence motifs are crucial for transcription factors (TF) mediated cis-regulation, there are many other proteins, such as chromatin modifiers, whose target sequences cannot simply be characterized by a handful of distinct motifs . Such sequences are often regarded as nonspecific and not studied further. However, recent studies in nucleosome positioning have provided new insights by going beyond this motif-centric view . Here various sequence features have been associated with nucleosome positioning, including poly dA:dT track [14, 15], abundance of G/C content [16, 17], and certain periodic patterns [18, 19]. Such patterns cannot be captured by traditional motif analysis methods. Similar results have been obtained by analyzing histone modification [20, 21] and DNA methylation data [22, 23].
Despite the success of these recent sequence-based prediction models, it remains difficult to determine which sequences lack intrinsic specificity because a poor prediction outcome might imply than more sophisticated models. A guide is needed for developing sequence-based prediction models. To this end, here we present a simple approach to quantify sequence specificity based on the frequency distribution of k-mers. We will also systematically investigate the relative merit of various distance or similarity functions for capturing specific sequence information. While k-mers have been extensively to detect splice sites , to study functional genomic regions, to identify protein coding genes and used in motif analysis (reviewed by ), to our knowledge, they have not been used to quantify sequence specificity.
We evaluated the performance of our approach by analyzing one simulated datasets and two real experimental datasets, corresponding to a TF (STAT1) and a histone modification (H3K4me1) respectively. Our results have provided new insights into the role of DNA sequences in modulating protein-DNA interactions regardless of motif presence.
A simple measure of sequence specificity
While specific sequence information has been identified in the absence of distinct motifs, to our knowledge, it is always associated with enrichment of certain k-mers (where k is a small number, such as 4). Its main difference with motifs is that, when k is small, a single k-mer may occur many times in the genome and therefore would not be useful for any practical purpose. On the other hand, we reasoned that more specific information can be obtained by combinations of multiple k-mers. Therefore, it seems appropriate to quantify sequence specificity by aggregating enrichment information for all k-mers. For the rest of the paper, we fix k = 4, although the method presented below is equally applicable to any choice of k. Treating complementary sequences as identical, there are 136 non-redundant 4-mers. By counting the frequency of each 4-mer, each input sequence is then mapped to a 136 dimensional numerical vector containing the frequency of each k-mer. The distributions corresponding to sequences containing specific information should be distinct from those for random sequences, which are generated to match the number and length of the input sequences. We use the symmetric Kullback-Leibler (KL) divergence  for comparing frequency distributions and average over the entire set of input sequences. We term the resulting value as the M otif I ndependent M etric (MIM). To evaluate statistical significance, we estimate the null distribution by computing MIM values for sets of random sequences. The detailed procedure is described in the Methods section.
As an initial evaluation, we synthetically generated 8 sequence sets each containing 2000 sequences, mimicking TF ChIPseq experiments for which the corresponding TF recognizes a single motif: TTGACA. The difference between these sequence sets is the motif strength, which is parameterized by a real number ε (see Methods). In particular, a perfect motif corresponds to ε = 0, whereas a random sequence corresponds to ε = 0.25. In a typical ChIPseq experiment, only a subset of target sequences contains the motif. To simulate this fact, we randomly selected 1000 sequences from each set and inserted the motif at a randomly selected location. As control, we also synthesized 1000 sets of 2000 random sequences each.
We calculated the MIM values for each sequence set and evaluated the statistical significance of the resulting values. We found that the MIM values are statistically significant (p-value < 0.001) for ε up to 0.1 (Figure 1a and 1b). The information content for the corresponding motif is 5.35 bit, which is still lower than 98% of the motifs in the JASPAR core database . In the following we will show that our method indeed performs well for real data. We ranked each k-mer according to its relative contribution to the MIM. The most informative k-mers are shown in Table 1. The methodology used to select such motif is outlined in the methods section. We noticed that the top k-mers are substrings of the inserted motif (highlighted in bold in Table 1), suggesting that these k-mers may be used as a seed for motif detection, in a similar way as the dictionary approach . In additional to the KL divergence considered here, there are a number of other metrics to compare frequency distributions. We selected a few commonly used metrics and repeated the above analysis (Methods). We found that the results are quite similar (Table 2).
Real ChIPseq data
To validate our method using real experimental data, we analyzed a publicly available ChIPseq dataset for STAT1 , a member of the signal transducer and activator of transcription (STAT) family TFs, in the HeLa S3 cell line. The dataset contains 39,000 target sequences, 35% of which contains the consensus motif TTCCNGGAA (JASPAR database ). As control, we sampled random sequences from genomic background matching the number and length of the target sequences.
We evaluated the level of sequence specificity of the whole set of target sequences by using the MIM measure. The sequences are indeed highly specific (see Figure 2a and 2b). Again, among the top ranked k-mers, several are substrings of the "classic" STAT1 motif (highlighted in bold in Table 3), suggesting it may provide useful information for identifying discriminative sequence signatures without the knowledge of TF motifs. Furthermore, the results are not sensitive to the specific choice of distances as in the simulated data experiment (Table 4).
Detecting sequence specificity in absence of a dominant motif
As mentioned above, while the presence of STAT1 motif can explain the sequence specificity for 35% of the target sequences, it is unclear how TF is recruited to the other 65% of the targets. In order to evaluate the role of DNA sequence specificity for these motif-absent targets, we compared the MIM values between the motif-present and motif-absent subsets of targets. Surprisingly, we found that the MIM value for motif-absent targets is almost indistinguishable from motif-present targets (see Figure 2a and 2b). This high level of specificity cannot be simply explained by promoter-related biases, because only 11% of target sequences are located in promoters.
To gain mechanistic insights, we searched for enrichment of other TF motifs in the JASPAR database , using the FIMO software . We found two motifs that are significantly enriched (threshold p-value < 10-6): SP1 and ESR1, both have previously been shown to interact with STAT1 [33, 34]. Therefore, STAT1 might be recruited to the motif-absent targets through interaction with these other TFs. We further compared the associated gene ontology terms between the motif-present and motif-absent sets to see if there are any functional differences. We found that these two sets share many similar biological functions, such as hydrolase and ATPase activities (p < 10-17). On the other hand, while the motif-present targets are highly enriched for the voltage-gated calcium channel complex (p < 10-12), the motif-absent targets are highly enriched for cytoplasmic components instead (p < 10-12).
Unlike TFs, histone (de)modifying enzymes usually do not directly interact with DNA. The role of DNA sequences in the regulation of histone modification patterns remains poorly understood. As an example, the histone modification H3K4me1 plays an important role in gene regulation by demarcating cell-type specific enhancers ; yet how it is recruited to enhancer regions is poorly understood. We hypothesized that the role of DNA sequence may play a cell-type specific role and aimed to detect such differences by using our MIM measure. To this end, we assembled an H3K4me1 ChIPseq dataset in seven human cell-lines, including H1 (a human embryonic stem cell line), K562 (a myelogenous leukemia cell line), Huvec (human umbilical vein endothelial cells), Nhek (normal human epidermal keratinocytes), and three T cell-lines (CD4+, CD36+, and CD133+) from the public domain [1, 4, 35]. For each cell line, we identified the peak locations by using cisGenome  then calculated the MIM value for DNA sequences at the peaks (in Table 5 the top 20 k-mers ranking by different distances on H1 cell line). The MIM values are highly cell-type specific (see Figure 3a and 3b and Table 6). Interestingly, the value for H1 cells is much higher than any other cell line, suggesting that the DNA sequence plays a unique role in H3K4me1 recruitment in ES cells. To eliminate the possibility that this difference may be simply due to a GC content related bias, we repeated the analysis by using a different null model, obtained by random shuffling the original sequences within each dataset. While the MIM values slightly change, they are ordered in nearly the same way as before (Additional File 1). Importantly, the MIM values are distinctively higher in the H1 cell line compared to the other cell lines, suggesting that such differences are unlikely due to a GC- content related bias.
Since the H3K4me1 marks cell-type specific enhancers, one possible explanation for the high sequence specificity in ES cells is that the targets might be associated with a few ES-specific TFs. To test this possibility, we searched for enrichment of TF motifs in the JASPAR database using FIMO. Surprisingly, we were unable to find any significantly-enriched motif, suggesting that the specificity is contributed to a different mechanism.
We then investigated whether the H3K4me1 targets in ES cells are indeed highly predictable. In previous work, we developed a sequence-based model, called the N-score model, to predict epigenetic targets [19, 21]. This model integrates information from three classes of sequence features (sequence periodicity, word counts, and DNA structural parameters) by using stepwise logistic regression model (see methods for details). Here we applied the N-score model to predict H3K4me1 target sequences. As negative control, we selected the same number of sequences from the genome at random. We evaluated the model performance by using a 3-fold cross-validation. We found that prediction accuracy is indeed high for ES cells (AUC = 0.967) (Figure 4), whereas the accuracy for other cell types is much lower.
Recently it has been shown that a large number of proteins may weakly bind to DNA . It remains unclear to what extent such events are mediated by specific sequence information. This question cannot be answered by using traditional motif analysis, since the target sequences do not contain distinct motifs. As an alternative approach, we define a simple measure, called MIM, to quantify sequence specificity by aggregating information from all k-mers. Our approach does not make any assumptions regarding motif presence, providing a more versatile tool for sequence analysis. We validated this method by analyzing both simulated and experimental data and found that it is indeed effective for detecting sequence specificity in both cases.
We also showed that the MIM measure can provide new biological insights. Specifically, we found that the motif-absent targets of a TF may also contain specific sequence information due to interaction with other TFs. We also found that the sequence specificity for H3K4me1 targets is higher in ES cells than in differentiated cell-types, suggesting a unique role of DNA sequence in the recruitment of H3K4me1 in ES cells. Interestingly, this high specificity cannot be explained by enrichment of known TF motifs, suggesting a yet uncharacterized recruitment mechanism in ES cells. The MIM algorithm is implemented in Python and can be freely accessed at : https://github.com/lucapinello/mim.
The role of DNA sequence in gene regulation remains incompletely understood. Our MIM method has extended previous work by further accounting for sequence specificity due to accumulation of weak sequence features. The information can be used as a guide to systematically investigate the regulatory mechanisms for a wide variety of biological processes.
Synthetic data generation
We simulated ChIPseq data for a TF whose motif sequence is TTGACA. In order to simulate the variation of motif sites among different target sequences, we modeled the position weight matrix (PWM) as illustrated in Table 7, where ε measures the mutation rate of the motif and can change between 0 (perfect motif) and 0.25 (totally random). We sampled ε at 8 different values: 0, 0.001, 0.005, 0.01, 0.05, 0.1, 0.1667, and 0.25. For each choice of ε, we generated 2000 sequences of 500 bp each. The sequences were initially generated by randomly sampling from the background distribution with the probabilities of A,C,G,T equal to 0.15, 0.35, 0.35, 0.15, respectively. In addition, we randomly selected a subset of 1000 sequences and inserted the motif at a random location.
ChIPseq data source
Genome-wide STAT1 peak locations in HeLa S3 cell lines were obtained from the http://archive.gersteinlab.org/proj/PeakSeq/Scoring_ChIPSeq/Results/STAT1. ChIPseq data for H3K4me1 in seven human cell lines were obtained from literature: CD4+ T cell , CD36+ and CD133+ T cells , H1, Huvec, K562, and Nhek . The raw data were processed by cisGenome to identify peak locations . The DNA sequences at the peak locations were analyzed subsequently.
Motif analysis was done by using several tools in the MEME suite (http://meme.nbcr.net/meme/) as follows. Scanning DNA sequences for matches of a known motif was done by using the FIMO . Motif comparison was done by using TOMTOM.
Details of the MIM measure
Each DNA sequence is mapped to numerical values by enumerating the frequency of each k-mer treating complementary k-mers as the same. There are m = 136 non-redundant k-mers for k = 4. MIM is essentially a metric between two distributions of k-mer frequencies. Specifically, let P = ( P ij ) be the k-mer frequency distributions corresponding to a set of n target sequences S = ( S i ), where S i represents a sequence in the set S. We generate a set of n random sequences R = ( R i ) matching the sequence lengths (analogously R i represents a sequence in the set R). Let Q = (Q ij ) be the k-mer frequency distributions corresponding to R. Finally let and (P j in particular represents the probability of the j-th k-mer in S, analogously, Q j represents the probability of the j-th k-mer in R) then the difference between P and Q is quantified by the symmetrical Kullback-Leibler (KL) divergence , as follows:
The MIM value corresponding to S is defined as the expected value d kl (S, R), which is estimated by averaging over 1000 sets of random sequences. The MIM value, using the symmetrical KL divergence, can be interpreted as the number of the expected number of extra bits required to code samples from S when using a code based on the background distribution. Note that there exist several alternatives to measure the similarity of two probability distributions . To evaluate whether the results are sensitive to the specific choice of distances, we also computed MIM values based on two other well-known distances between probability distributions:
The Hellinger distance 
whose main differences from d kl are 1) d hl naturally satisfies the triangle inequality; and 2) the range of d hl is the interval [0,1].
The Bhattacharyya distance
which has been widely used for pattern recognition in computer science ;
where and are the mean and standard deviation, respectively, of P j ( and are defined similarly for Q j ).
In order to estimate the null distribution, we generated 1000 sets of random sequences and then calculated MIM values for each random sequence set. The probability density function (pdf) was estimated by using a kernel method . This pdf was used to infer not only the mean and standard deviation of the null distribution but also the statistical significance for any MIM value. Recognizing the limited resolution of the estimated pdf, we did not distinguish p-values that are smaller than 0.001.
The N-score model was described previously [19, 21]. In brief, the model integrates three types of sequence features, including sequence periodicities , word counts , and structural parameters , a total of 2920 candidate features. Model selection was done by stepwise logistic regression. The final model was used for target prediction.
Most informative k-mers selection
Giving P j and Q j associated to S and R respectively, it is possible to calculate their Kullback-Leibler (KL) divergence for each j, where j indicates the j-th k-mer component. This results in a list of 136 distance values, whose ranking can be used as a guide to identify the most informative k-mers.
Birney E, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816. 10.1038/nature05874
TCGA: Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455(7216):1061–8. 10.1038/nature07385
Bussemaker HJ, Li H, Siggia ED: Regulatory element detection using a probabilistic segmentation model. Proc Int Conf Intell Syst Mol Biol 2000, 8: 67–74.
Barski A, et al.: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823–37. 10.1016/j.cell.2007.05.009
Mikkelsen TS, et al.: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–60. 10.1038/nature06008
Heintzman ND, et al.: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459(7243):108–12. 10.1038/nature07829
Crawford GE, et al.: Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res 2006, 16(1):123–31.
Yeang CH, Ideker T, Jaakkola T: Physical network models. J Comput Biol 2004, 11(2–3):243–62. 10.1089/1066527041410382
Harbison CT, et al.: Transcriptional regulatory code of a eukaryotic genome. Nature 2004, 431(7004):99–104. 10.1038/nature02800
Zhou Q, et al.: A gene regulatory network in mouse embryonic stem cells. Proc Natl Acad Sci USA 2007, 104(42):16438–43. 10.1073/pnas.0701014104
Chang LW, et al.: Computational identification of the normal and perturbed genetic networks involved in myeloid differentiation and acute promyelocytic leukemia. Genome Biol 2008, 9(2):R38. 10.1186/gb-2008-9-2-r38
Kouzarides T: Chromatin modifications and their function. Cell 2007, 128(4):693–705. 10.1016/j.cell.2007.02.005
Jiang C, Pugh BF: Nucleosome positioning and gene regulation: advances through genomics. Nat Rev Genet 2009, 10(3):161–72.
Sekinger EA, Moqtaderi Z, Struhl K: Intrinsic histone-DNA interactions and low nucleosome density are important for preferential accessibility of promoter regions in yeast. Mol Cell 2005, 18(6):735–48. 10.1016/j.molcel.2005.05.003
Yuan GC, et al.: Genome-scale identification of nucleosome positions in S. cerevisiae. Science 2005, 309(5734):626–30. 10.1126/science.1112178
Peckham HE, et al.: Nucleosome positioning signals in genomic DNA. Genome Res 2007, 17(8):1170–7. 10.1101/gr.6101007
Tillo D, Hughes TR: G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics 2009, 10: 442. 10.1186/1471-2105-10-442
Field Y, et al.: Distinct modes of regulation by chromatin encoded through nucleosome positioning signals. PLoS Comput Biol 2008, 4(11):e1000216. 10.1371/journal.pcbi.1000216
Yuan GC, Liu JS: Genomic sequence is highly predictive of local nucleosome depletion. PLoS Comput Biol 2008, 4(1):e13. 10.1371/journal.pcbi.0040013
Ku M, et al.: Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoS Genet 2008, 4(10):e1000242. 10.1371/journal.pgen.1000242
Yuan GC: Targeted recruitment of histone modifications in humans predicted by genomic sequences. J Comput Biol 2009, 16(2):341–55. 10.1089/cmb.2008.18TT
Bock C, et al.: CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet 2006, 2(3):e26. 10.1371/journal.pgen.0020026
Das R, et al.: Computational prediction of methylation status in human genomic sequences. Proc Natl Acad Sci USA 2006, 103(28):10713–6. 10.1073/pnas.0602949103
Salzberg SL: A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput Appl Biosci 1997, 13(4):365–76.
DeCaprio D, et al.: Conrad: gene prediction using conditional random fields. Genome Res 2007, 17(9):1389–98. 10.1101/gr.6558107
Narlikar L, et al.: Genome-wide discovery of human heart enhancers. Genome Res 2010, 20(3):381–92. 10.1101/gr.098657.109
Ji H, Wong WH: Computational biology: toward deciphering gene regulatory information in mammalian genomes. Biometrics 2006, 62(3):645–63. 10.1111/j.1541-0420.2006.00625.x
Kullback S, Leibler RA: On Information and Sufficiency. The Annals of Mathematical Statistics 1951, 22(1):79–86. 10.1214/aoms/1177729694
Sandelin A, et al.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, (32 Database):D91–4.
Bussemaker HJ, Li H, Siggia ED: Building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis. Proc Natl Acad Sci USA 2000, 97(18):10096–100.
Rozowsky J, et al.: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol 2009, 27(1):66–75. 10.1038/nbt.1518
Grant CE, Bailey TL, Noble WS: FIMO: scanning for occurrences of a given motif. Bioinformatics 2011, 27(7):1017–8. 10.1093/bioinformatics/btr064
Look DC, et al.: Stat1 depends on transcriptional synergy with Sp1. J Biol Chem 1995, 270(51):30264–7. 10.1074/jbc.270.51.30264
Panchanathan R, et al.: Mutually positive regulatory feedback loop between interferons and estrogen receptor-alpha in mice: implications for sex bias in autoimmunity. PLoS One 2010, 5(5):e10868. 10.1371/journal.pone.0010868
Cui K, et al.: Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell 2009, 4(1):80–93. 10.1016/j.stem.2008.11.011
Ji H, et al.: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol 2008, 26(11):1293–300. 10.1038/nbt.1505
Hu S, et al.: Profiling the human protein-DNA interactome reveals ERK2 as a transcriptional repressor of interferon signaling. Cell 2009, 139(3):610–22. 10.1016/j.cell.2009.08.037
Eden E, et al.: GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 2009, 10: 48. 10.1186/1471-2105-10-48
Deza E, Deza MM: Dictionary of distances. Elsevier; 2006.
Theodoridis S, Koutroumbas K: Pattern Recognition. Fourth edition. Academic Press; 2009.
Kailath T: The Divergence and Bhattacharyya Distance Measures in Signal Selection. Communications, IEEE Transactions on [legacy, pre - 1988] 1967, 15(1):52–60.
Bowman AW, Azzalini A: Applied Smoothing Techniques for Data Analysis. Oxford Univeristy Press; 1997.
Lee W, et al.: A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet 2007, 39(10):1235–44. 10.1038/ng2117
We thank Zhen Shao for help with H3K4me1 data collection and initial processing. GY's research was supported by the NIH grant HG005085 and a Career Incubator Award from the Harvard School of Public Health.
LP and GY conceived and designed the study. LP and GL have implemented the MIM methodology. LP and BH analyzed the data. LP and GY interpreted the data. All authors wrote, read and approved the manuscript.