Effect of false positive and false negative rates on inference of binding target conservation across different conditions and species from ChIP-chip data
- Debayan Datta^{1} and
- Hongyu Zhao^{2, 3}Email author
https://doi.org/10.1186/1471-2105-10-23
© Datta and Zhao; licensee BioMed Central Ltd. 2009
Received: 30 May 2008
Accepted: 19 January 2009
Published: 19 January 2009
Abstract
Background
ChIP-chip data are routinely used to identify transcription factor binding targets. However, the presence of false positives and false negatives in ChIP-chip data complicates and hinders analyses, especially when the binding targets for a specific transcription factor are compared across conditions or species.
Results
We propose an Expectation Maximization based approach to infer the underlying true counts of "positives" and "negatives" from the observed counts. Based on this approach, we study the effect of false positives and false negatives on inferences related to transcription regulation.
Conclusion
Our results indicate that if there is a significant degree of association among the binding targets across conditions/species (log odds ratio > 4), moderate values of false positive and false negative rates (0.005 and 0.4 respectively) would not change our inference qualitatively (i.e. the presence or absence of conservation) based on the observed experimental data despite a significant change in the observed counts. However, if the underlying association is marginal, with odds ratios close to 1, moderate to large values of false positive and false negative rates (0.01 and 0.2 respectively) could mask the underlying association.
Keywords
Background
Transcription factors play an important role in gene regulation by binding to specific DNA sequences in the regulatory regions of their targets. Accurate identification of the binding targets of the transcription factors is paramount to the understanding of the regulatory mechanism. Chromatin immunoprecipitation (ChIP) experiments are commonly used to identify the regulatory targets in prokaryotes and eukaryotes. ChIP-chip experiments provide us with information about the binding targets of a particular regulator at the genome level [1–4].
The output of ChIP-chip experiments are often summarized in binary forms. Using replicate data, the statistical evidence for a gene being the binding target of a transcription factor is typically summarized as a p-value. A threshold for the p-value, e.g. 0.001 is then chosen, and genes with p-values less than the threshold are considered the binding targets for the transcription factor. Thus, for a transcription factor, we can enumerate a list of genes which are "positives", i.e. binding targets and a list of genes which are "negatives", i.e. non-binding targets. If the threshold is set at a very stringent level to control the number of false positives, this will be achieved at the expense of high false negatives. A more relaxed threshold will reduce the number of false negatives, but will end up with more false positive results. Over the past few years, ChIP-chip data has formed the basis of many transcription regulatory mechanism studies. Several groups have compared the binding of a regulator across multiple experimental conditions to determine condition dependence of binding [5–7]. Similarly, binding data of specific transcription factors across species has been used to investigate the presence of conserved binding targets [8]. Unfortunately, the presence of noise, in the form of false positives and false negatives as discussed above, may lead to inaccurate inference of the binding targets, and thus biased results and potentially incorrect conclusions on key aspects of transcription regulation, e.g. preservation of regulation targets across conditions and species. In this article, we develop a statistical approach to analyzing ChIP-chip data, appropriately incorporating false positives and false negatives. Based on our approach, we investigate the effect of false positives and false negatives on the inference of conservations of binding targets based on ChIP-chip data.
Methods
Summarizing Contingency Tables
As discussed above, the output of ChIP-chip experiments is typically summarized into binary forms and results across different experiments for the same transcription factor can be crosstabulated into a contingency table. A common question asked is whether a transcription factor has similar binding targets across conditions, and this is reflected as the dependency of outcome among the conditions. In the following, we give a brief discussion on two statistical measures that we will use to summarize the degree of dependency in a contingency table.
For the sake of clarity, we will focus on ChIP-chip experiments involving two different conditions or two species. The number of target genes in the two conditions/species can be cross tabulated into a 2 by 2 contingency table. We use two metrics to summarize such contingency tables – Odds Ratio and Positive specific agreement [9, 10].
Simple 2 by 2 contingency table.
Condition 2 | |||
---|---|---|---|
0 | 1 | ||
Condition 1 | 0 | a | b |
1 | c | d |
Other measures of dependency are also often used in psychological and medical research. For example, the problem can be formulated as follows: Suppose two raters classify each subject in a sample from some target population according to the presence or absence of some characteristic of interest. The resulting data can then be summarized into a 2 by 2 table. The agreement between raters can be quantified by the metric simple agreement, which is defined as the proportion of cases for which both raters agree, or (a + d)/(a + b + c + d). However, if a is large, this would approach 1 regardless of the performance on positive cases. Positive specific agreement provides insight when the positive cases are rare. It estimates the conditional probability that one rater will agree that a case is positive given the other one rated it positive, where the role of the two raters is selected randomly. Positive specific agreement, p_{ pos }is defined as: p_{ pos }= 2d/(2d + b + c). Both (log) odds ratio and positive specific agreement will be considered in our following discussion.
Model Setup
However, due to chance variations, p obtained through this approach based on the observed $\widehat{p}$ may have negative components, leading to uninterpretable results. Instead, we propose to estimate the true proportions using an Expectation Maximization (EM) based approach explained in detail in the following subsection.
where M_{1} and M_{2} are the transformation matrices for the first and second conditions respectively.
EM Algorithm
Given a vector of observed proportions which we obtain from experimental output, for different values of false positive rates and false negative rates, we aim to infer the true proportions. This would give us an idea about how the observed and true proportions differ for different levels of noise in the form of false positives and false negatives. We infer the true proportions from the observed proportions using an EM based approach which we now discuss in detail.
Let us consider the binding patterns for a transcription factor in experimental conditions c_{1} and c_{2}. We define the vector for the true binary binding pattern of a particular gene G as b = (b_{1}, b_{2}), where b_{1} and b_{2} take binary value 1 or 0 depending on whether the gene is a true binding target for the transcription factor in c_{1} and c_{2} respectively. Thus, for the experimental conditions c_{1} and c_{2}, this binary binding pattern vector can take four possible values, {(0, 0), (0, 1), (1, 0), (1, 1)}. For example, a binary binding pattern vector equal to (1, 1) indicates that the gene is a binding target for the transcription factor in both c_{1} and c_{2}. We aim to infer this true binary binding pattern for all the genes and thus obtain the true binary counts. Due to experimental errors, we have the observed counts as the experimental output.
In this article, we propose to estimate the P(b) to maximize P(g_{1}, g_{2}, ..., g_{ N }) using the EM algorithm, by treating b as the missing data as follows.
where ${b}_{1}^{(m)}$ and ${b}_{2}^{(m)}$ are the first and second components of the estimate b^{(m)}and take binary value 0 or 1.
Thus, for each gene, we start with a set of estimates P(b^{(m)}) and obtain estimates of the posterior probabilities P(b^{(m)}|g) for each gene at the E-step.
M-Step: In the Maximization step, the parameters P(b) are re-estimated to maximize the likelihood of the complete data. After obtaining P(b^{(m)}|g) for each gene, we cross-tabulate a two-way contingency table, with the "count" for each of the four values {(0, 0), (0, 1), (1, 0), (1, 1)} being the sum of the probabilities for that particular value across all the genes. These counts are then used to obtain updated P estimates for P(b). For example, $P(b=(0,0))=\frac{1}{N}{\displaystyle \sum P(b=0,0)|{g}_{i})}$.
We iterate between the E-Step and the M-Step until convergence. The convergence criterion was set as: |P(b^{(m)}) - P(b^{(m-1)})| < 10^{-12}.
Results and discussion
In this section, we study the effect of false positives and false negatives on inferring regulatory target conservation across conditions/species through both simulations and real data analyses.
We show that for a true odds ratio greater than 1 and for s < 1/2 and s + t < 1, ∂ (OOR)/∂s is negative. These are reasonable assumptions for real data, where the false positives are low and false negatives are not very high. The denominator of ∂ (OOR)/∂s is always positive as it is a squared number. The numerator can be written as:
F = F 1 * F 2 * F 3
where,
F 1 = (p_{01}p_{10} - p_{00}p_{11})(-1 + s + t),
F 2 = (1 - s)(p_{01} + p_{10} + 2p_{00}s) + (-p_{01} - p_{10} + 2p_{11} + 2(p_{01} + p_{10})s)t - 2p_{11}t^{2},
F 3 = (1 - p_{00})(-1 + t)t + p_{00}(-1 + t - s(-2 + s + 2t)).
Let us consider each term separately.
If the true odds ratio is greater than 1 and s + t < 1, then p_{01}p_{10} <p_{00}p_{11} and (-1 + s + t) < 0. Thus, we have F 1 > 0. F 2 can be simplified as:
F 2 = (1 - s)(p_{01} + p_{10} + 2p_{00}s) + (p_{01} + p_{10})(-1 + 2s)t + 2p_{11}t(1 - t).
Thus, if (-1 + 2s) < 0, i.e. s < 1/2, then all the three product terms in F 2 are positive. Thus, for s < 1/2, F 2 > 0. F 3 can be simplified as:
F 3 = (-1 + t)(t(1 - p_{00}) + p_{00}) - p_{00}s(-2 + s + 2t) = -p_{00}s^{2} + 2p_{00}s(1 - t) + (-1 + t)(t(1 - p_{00}) + p_{00}).
Simplifying, we get, $D=-4{p}_{00}^{2}(1-t)t-4{p}_{00}(1-t)t(1-{p}_{00})$ which is clearly negative. Thus, F is negative for true odds ratio greater than 1, s < 1/2 and s + t < 1.
Simulation Results
Real Datasets
We considered two ChIP-chip datasets. Harbison et al. [5] described the binding profiles of 204 transcription factors for S. Cerevisiae in Rich medium, and 84 of these transcription factors were also profiled in at least one other experimental condition. In their study, transcription factors were selected for profiling in a particular environment if they were essential for growth in that environment, or if there was other evidence suggesting their role in gene regulation in that environment. Borneman et al. [8]. studied the divergence of binding sites of regulators Ste12 and Tec1 in the yeasts S. cerevisiae, S. mikatae and S. bayanus under pseudohyphal conditions. They listed genes which showed differing degrees of conservation across the three species, i.e. genes which were targets in only one species, the targets in two species, and the targets all three species.
For ChIP-chip data from Harbison et al. [5], we focussed on the binding data for the transcription factors Ste12 and Tec1 in three different experimental conditions – Rich medium, Filamentation inducing and Mating inducing. We used a p-value threshold of 0.001 to obtain the binding targets for these two regulators. For a pair of experimental conditions, we cross-tabulated the binding targets and created a 2 by 2 contingency table. The odds ratio was quite high, hence we used the log odds ratio and positive specific agreement as metrics to summarize the contingency tables. Thus, given the observed log odds ratio and observed positive specific agreement, for different values of the false positive rate and false negative rate, we inferred the underlying true log odds ratio and the true positive specific agreement using our EM based approach.
In the section describing the model setup, we stated that multiplication of the vector of the observed proportions with the inverse of the transformation matrix could lead to inferred true proportions with negative components. Here we illustrate the scenario. For the regulator Ste12, in Rich Medium and Mating inducing condition the vector of the observed proportions is p = (0.9761, 0.0144, 0.0040, 0.0056)^{ t }. For a false positive rate of 0.001 and a false negative rate of 0.2, multiplying p by the inverse of the transformation matrix results in the vector of the inferred true proportions $\widehat{p}$ = (0.9743, 0.0150, 0.0020, 0.0087)^{ t }. However, for a false positive rate of 0.002 and a false negative rate of 0.3, the vector of the inferred true proportions is $\widehat{p}$ = (0.9748, 0.0144, -0.0005, 0.0113)^{ t }. Similarly, for a false positive rate of 0.004 and a false negative rate of 0.4, the vector of the inferred true proportions is $\widehat{p}$ = (0.9793, 0.0113, -0.0061, 0.0154)^{ t }. Thus, the inferred true proportions obtained by simply multiplying the observed proportions with the inverse of the transformation matrix could contain negative components.
Inferred true proportions of target genes of Ste12 in the Rich Medium and Mating Inducing conditions.
FNR | |||||
---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.001 | 0.974 | 0.973 | 0.972 | 0.971 | 0.969 |
0.015 | 0.015 | 0.016 | 0.106 | 0.017 | |
0.002 | 0.002 | 0.001 | 0.001 | 0.001 | |
0.009 | 0.010 | 0.011 | 0.013 | 0.014 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.002 | 0.977 | 0.976 | 0.974 | 0.973 | 0.971 |
0.014 | 0.014 | 0.014 | 0.015 | 0.015 | |
0.001 | 0.001 | 0.001 | 0 | 0 | |
0.009 | 0.010 | 0.011 | 0.012 | 0.013 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.003 | 0.979 | 0.978 | 0.976 | 0.975 | 0.973 |
0.013 | 0.013 | 0.013 | 0.014 | 0.014 | |
0 | 0 | 0 | 0 | 0 | |
0.009 | 0.009 | 0.010 | 0.011 | 0.013 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.004 | 0.980 | 0.979 | 0.978 | 0.977 | 0.975 |
0.011 | 0.012 | 0.012 | 0.012 | 0.013 | |
0 | 0 | 0 | 0 | 0 | |
0.008 | 0.009 | 0.010 | 0.011 | 0.012 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.005 | 0.982 | 0.981 | 0.980 | 0.979 | 0.977 |
0.010 | 0.011 | 0.011 | 0.011 | 0.011 | |
0 | 0 | 0 | 0 | 0 | |
0.008 | 0.009 | 0.010 | 0.010 | 0.012 |
Inferred true log odds ratios for target genes of Ste12 in the Rich Medium and Mating Inducing conditions.
FNR | ||||||
---|---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | ||
FPR | 0.001 | 5.60 | 5.98 | 6.50 | 7.28 | 8.44 |
0.002 | 7.24 | 7.55 | 8.54 | 10.11 | 12.18 | |
0.003 | 12.93 | 14.63 | 16.98 | 19.78 | 22.88 | |
0.004 | 25.68 | 29.10 | 32.83 | 36.74 | 40.73 | |
0.005 | 45.21 | 50.03 | 54.92 | 59.77 | 64.51 |
Inferred positive specific agreement for target genes of Ste12 in the Rich Medium and Mating Inducing conditions.
FNR | ||||||
---|---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | ||
FPR | 0.001 | 0.50 | 0.54 | 0.57 | 0.60 | 0.63 |
0.002 | 0.55 | 0.58 | 0.60 | 0.62 | 0.64 | |
0.003 | 0.57 | 0.59 | 0.61 | 0.63 | 0.64 | |
0.004 | 0.59 | 0.60 | 0.62 | 0.64 | 0.65 | |
0.005 | 0.61 | 0.62 | 0.64 | 0.66 | 0.67 |
Inferred true proportions of target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.
FNR | |||||
---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.001 | 0.956 | 0.954 | 0.953 | 0.952 | 0.950 |
0.006 | 0.005 | 0.003 | 0.002 | 0.001 | |
0.015 | 0.014 | 0.013 | 0.012 | 0.011 | |
0.023 | 0.026 | 0.030 | 0.034 | 0.038 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.002 | 0.958 | 0.957 | 0.956 | 0.954 | 0.953 |
0.005 | 0.004 | 0.002 | 0.001 | 0 | |
0.014 | 0.013 | 0.012 | 0.011 | 0.010 | |
0.023 | 0.026 | 0.030 | 0.034 | 0.038 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.003 | 0.961 | 0.960 | 0.958 | 0.957 | 0.955 |
0.003 | 0.002 | 0.001 | 0 | 0 | |
0.012 | 0.012 | 0.011 | 0.010 | 0.008 | |
0.024 | 0.026 | 0.030 | 0.033 | 0.037 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.004 | 0.964 | 0.962 | 0.961 | 0.959 | 0.957 |
0 | 0 | 0 | 0 | 0 | |
0.011 | 0.010 | 0.009 | 0.008 | 0.007 | |
0.024 | 0.026 | 0.029 | 0.032 | 0.036 | |
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | |
FPR = 0.005 | 0.966 | 0.965 | 0.963 | 0.961 | 0.959 |
0 | 0 | 0 | 0 | 0 | |
0.10 | 0.009 | 0.008 | 0.007 | 0.006 | |
0.024 | 0.026 | 0.029 | 0.032 | 0.035 |
Inferred true log odds ratios for target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.
FNR | ||||||
---|---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | ||
FPR | 0.001 | 5.48 | 5.88 | 6.43 | 7.22 | 8.37 |
0.002 | 5.88 | 6.26 | 6.93 | 7.92 | 9.35 | |
0.003 | 6.56 | 6.79 | 7.60 | 8.87 | 10.65 | |
0.004 | 8.16 | 8.00 | 8.90 | 10.51 | 12.69 | |
0.005 | 11.07 | 11.36 | 12.67 | 14.75 | 17.36 |
Inferred positive specific agreement values for target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.
FNR | ||||||
---|---|---|---|---|---|---|
0.20 | 0.25 | 0.30 | 0.35 | 0.40 | ||
FPR | 0.001 | 0.69 | 0.73 | 0.78 | 0.83 | 0.87 |
0.002 | 0.72 | 0.76 | 0.81 | 0.85 | 0.88 | |
0.003 | 0.76 | 0.79 | 0.83 | 0.87 | 0.90 | |
0.004 | 0.81 | 0.83 | 0.86 | 0.88 | 0.91 | |
0.005 | 0.83 | 0.85 | 0.87 | 0.90 | 0.92 |
Cross-tabulation of orthologous genes in functional category Hydrolase activity.
S. mikatae | |||
---|---|---|---|
0 | 1 | ||
S. cerevisiae | 0 | 551 | 4 |
1 | 5 | 5 |
Cross-tabulation of orthologous genes in functional category Tranferase activity.
S. mikatae | |||
---|---|---|---|
0 | 1 | ||
S. cerevisiae | 0 | 491 | 7 |
1 | 5 | 5 |
Cross-tabulation of orthologous genes in functional category Protein binding.
S. mikatae | |||
---|---|---|---|
0 | 1 | ||
S. cerevisiae | 0 | 338 | 3 |
1 | 5 | 3 |
Cross-tabulation of orthologous genes in functional category Transporter activity.
S. mikatae | |||
---|---|---|---|
0 | 1 | ||
S. cerevisiae | 0 | 225 | 3 |
1 | 6 | 4 |
Conclusion
In this article, we have studied the effect of false positives and false negatives in the analysis and interpretation of ChIP-chip data. We have derived a relationship between the observed and the underlying true binary outcomes. Given the observed binary outcome of an experiment, we have developed an EM based approach to infer the underlying true binary outcome for given values of false positive and false negative rates.
A common limitation with finding binding targets from ChIP-chip data is that typically an arbitrary threshold, e.g. 0.001, is applied to the data, and all genes with p-values less than this threshold are considered binding targets. The false positive rate and false negative rate for the binding data change with the threshold applied [5]. Datta and Zhao [11] proposed a statistical procedure to determine the binding targets without imposing a simple threshold to the ChIP-chip data. However, their approach relies on accurate inference of false discovery rate [12], which is a non-trivial task.
To summarize data in contingency tables we utilized two commonly used metrics – Odds Ratio and Positive specific agreement. Both these metrics are widely used to study dependency among categorical variables. Since we are interested in quantifying association among two categorial variables, i.e. whether there is association across two different conditions/species, the Odds Ratio and Positive specific agreement are appropriate metrics of interest. In our simulation, we used the chi-squared test of independence to test the null hypothesis that the binding targets are independent. Instead of using the chi-squared test, we could also use the Fisher's exact test to test the independence assumption on the two-way contingency table. The resulting p-values from both tests indicate the statistical evidence against the independence assumption. However, they do not provide a meaningful summary of the degree of dependence as they are also dependent on the sample size.
In general, for independently performed real world experiments, such as two separate ChIP-chip experiments, the independence assumption of equation (9) should hold. This is because we can assume that data points in a particular experiment are independent identically distributed random variables. However, for experiments with closely associated results, it is possible that the false positive and false negative data points for the experiments are not entirely independent. This could result in an under-estimation of the underlying association after the EM procedure.
Due to the limited degrees of freedom of the data, our EM algorithm cannot be used to estimate the false positive rate and false negative rate in experimental data. At each step in the EM algorithm, we estimate three parameters, and we have three equations to solve for them. If we also wish to estimate the false positive and false negative rates, we would have two additional parameters, but the number of equations would still be three. This would lead to an identifiability problem.
We initialized the EM algorithm by different initial estimates of the parameters. For each initial estimate of the parameters, the algorithm converged. The convergence criteria for the EM algorithm require that the log likelihood of the parameters l(b|g) be continuous and differentiable in the parameter space. Unfortunately, the M-step of our algorithm does not have a closed form. Hence, it is difficult to evaluate the gradient of the log likelihood function.
Harbison et al. performed their ChIP-chip experiments using microarrays consisting of spotted polymerase chain reaction (PCR) products representing all the intergenic regions of Saccharomyces cerevisiae. To obtain the binding targets a p-value threshold was applied to the binding intensities associated with the probes. One of the drawbacks of PCR based arrays is the low resolution of the DNA elements in the microarray chip. For PCR arrays designed for Yeast, the typical resolution achieved is less than 1 kb. In recent years, high density oligonucleotide arrays, comprising of large numbers (40, 000 to more than 6, 000, 000) of short oligonucleotides have been utilized for ChIP-chip studies [13–16]. A number of statistical algorithms have also been developed to determine the binding targets from such large scale tiling arrays [17–20]. Borneman et al. used high density oligonucleotide arrays to perform their experiment. The binding targets were obtained using Tilescope [21]. Since they report the target genes in each organism, we simply used their results to obtain the counts of target genes in each of the three organisms.
Our analysis can be applied to any experimental setting with binary outcomes. However, for the sake of simplicity, we have illustrated its application for ChIP-chip experiments. By applying our algorithm to ChIP-chip data from Harbison et al. and Borneman et al., we observe that for different values of the false positive and false negative rate, the observed and true metrics for the binary data can differ quite dramatically. However, we notice that when the true log odds ratio is greater than 4, i.e. there is a significant degree of association among the binding targets across conditions/species, such differences in the observed and true metrics would not change our inference. On the other hand, our simulation results indicate that when the true odds ratio is close to 1, i.e. for cases when the underlying association is marginal, moderate values of false positive and false negative rates (0.01 and 0.2 respectively) may not be able to provide conclusive evidence of any underlying association or independence.
Declarations
Acknowledgements
This work was supported in part by NSF grant DMS-0714817 and NIH grant GM59507.
Authors’ Affiliations
References
- Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004, 83: 349–360. 10.1016/j.ygeno.2003.11.004View ArticlePubMedGoogle Scholar
- Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO: Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001, 409: 533–538. 10.1038/35054095View ArticlePubMedGoogle Scholar
- Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nature Genetics 2001, 28: 327–334. 10.1038/ng569View ArticlePubMedGoogle Scholar
- Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA-binding proteins. Science 2000, 290: 2306–2309. 10.1126/science.290.5500.2306View ArticlePubMedGoogle Scholar
- Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of an eukaryotic genome. Nature 2004, 431: 99–104. 10.1038/nature02800PubMed CentralView ArticlePubMedGoogle Scholar
- Beyer A, Workman C, Hollunder J, Radke D, Moller U, Wilhelm T, Ideker T: Integrated assessment and prediction of transcription factor binding. PLOS Computational Biology 2006, 2(6):e70. 10.1371/journal.pcbi.0020070PubMed CentralView ArticlePubMedGoogle Scholar
- Zeitlinger J, Simon I, Harbison CT, Hannett NM, Volkert TL, Fink GR, Young RA: Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling. Cell 2003, 113: 395–404. 10.1016/S0092-8674(03)00301-5View ArticlePubMedGoogle Scholar
- Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M: Divergence of transcription factor binding sites across related yeast species. Science 2007, 317: 815–819. 10.1126/science.1140748View ArticlePubMedGoogle Scholar
- Agresti A: An introduction to categorical data analysis. New York: Wiley; 1996.Google Scholar
- Fleiss JL: Statistical methods for rates and proportions. 2nd edition. New York: John Wiley; 1981.Google Scholar
- Datta D, Zhao H: Statistical methods to infer cooperative binding among transcription factors in Saccharomyces cerevisiae . Bioinformatics 2008, 24(4):545–552. 10.1093/bioinformatics/btm523View ArticlePubMedGoogle Scholar
- Efron B: Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 2004, 99: 96–104. 10.1198/016214504000000089View ArticleGoogle Scholar
- Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA: Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 2005, 122: 947–956. 10.1016/j.cell.2005.08.020PubMed CentralView ArticlePubMedGoogle Scholar
- Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Hannett NM, Herbolsheimer E, Zeitlinger J, Lewitter F, Gifford DK, Young RA: Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 2005, 122: 517–527. 10.1016/j.cell.2005.06.026View ArticlePubMedGoogle Scholar
- Zeitlinger J, Zinzen RP, Stark A, Kellis M, Zhang H, Young RA, Levine M: Whole-genome ChIP-chip analysis of Dorsal, Twist and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes Development 2007, 21: 385–390. 10.1101/gad.1509607PubMed CentralView ArticlePubMedGoogle Scholar
- Zheng Y, Yosefowicz SZ, Kas A, Chu TT, Gavin MA, Rudensky AY: Genome-wide analysis of Foxp3 target genes in developing and mature regulatory T cells. Nature 2007, 445: 936–940. 10.1038/nature05563View ArticlePubMedGoogle Scholar
- Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M: Global identification of human transcribed sequences with genome tiling arrays. Science 2004, 306: 2242–2246. 10.1126/science.1103388View ArticlePubMedGoogle Scholar
- Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 2005, 21: 3629–3636. 10.1093/bioinformatics/bti593View ArticlePubMedGoogle Scholar
- Li W, Meyer CA, Liu XS: A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 2005, 21: i274-i282. 10.1093/bioinformatics/bti1046View ArticlePubMedGoogle Scholar
- Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M: A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 2006, 22(24):3016–3024. 10.1093/bioinformatics/btl515View ArticlePubMedGoogle Scholar
- Zhang ZD, Rozowsky J, Lam HY, Du J, Snyder M, Gerstein M: Tilescope: online analysis pipeline for high-density tiling microarray data. Genome Biology 2007, 8: R81. 10.1186/gb-2007-8-5-r81PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.