Effect of false positive and false negative rates on inference of binding target conservation across different conditions and species from ChIP-chip data

Datta, Debayan; Zhao, Hongyu

doi:10.1186/1471-2105-10-23

Methodology article
Open access
Published: 19 January 2009

Effect of false positive and false negative rates on inference of binding target conservation across different conditions and species from ChIP-chip data

Debayan Datta¹ &
Hongyu Zhao^2,3

BMC Bioinformatics volume 10, Article number: 23 (2009) Cite this article

5554 Accesses
1 Citations
Metrics details

Abstract

Background

ChIP-chip data are routinely used to identify transcription factor binding targets. However, the presence of false positives and false negatives in ChIP-chip data complicates and hinders analyses, especially when the binding targets for a specific transcription factor are compared across conditions or species.

Results

We propose an Expectation Maximization based approach to infer the underlying true counts of "positives" and "negatives" from the observed counts. Based on this approach, we study the effect of false positives and false negatives on inferences related to transcription regulation.

Conclusion

Our results indicate that if there is a significant degree of association among the binding targets across conditions/species (log odds ratio > 4), moderate values of false positive and false negative rates (0.005 and 0.4 respectively) would not change our inference qualitatively (i.e. the presence or absence of conservation) based on the observed experimental data despite a significant change in the observed counts. However, if the underlying association is marginal, with odds ratios close to 1, moderate to large values of false positive and false negative rates (0.01 and 0.2 respectively) could mask the underlying association.

Background

Transcription factors play an important role in gene regulation by binding to specific DNA sequences in the regulatory regions of their targets. Accurate identification of the binding targets of the transcription factors is paramount to the understanding of the regulatory mechanism. Chromatin immunoprecipitation (ChIP) experiments are commonly used to identify the regulatory targets in prokaryotes and eukaryotes. ChIP-chip experiments provide us with information about the binding targets of a particular regulator at the genome level [1–4].

The output of ChIP-chip experiments are often summarized in binary forms. Using replicate data, the statistical evidence for a gene being the binding target of a transcription factor is typically summarized as a p-value. A threshold for the p-value, e.g. 0.001 is then chosen, and genes with p-values less than the threshold are considered the binding targets for the transcription factor. Thus, for a transcription factor, we can enumerate a list of genes which are "positives", i.e. binding targets and a list of genes which are "negatives", i.e. non-binding targets. If the threshold is set at a very stringent level to control the number of false positives, this will be achieved at the expense of high false negatives. A more relaxed threshold will reduce the number of false negatives, but will end up with more false positive results. Over the past few years, ChIP-chip data has formed the basis of many transcription regulatory mechanism studies. Several groups have compared the binding of a regulator across multiple experimental conditions to determine condition dependence of binding [5–7]. Similarly, binding data of specific transcription factors across species has been used to investigate the presence of conserved binding targets [8]. Unfortunately, the presence of noise, in the form of false positives and false negatives as discussed above, may lead to inaccurate inference of the binding targets, and thus biased results and potentially incorrect conclusions on key aspects of transcription regulation, e.g. preservation of regulation targets across conditions and species. In this article, we develop a statistical approach to analyzing ChIP-chip data, appropriately incorporating false positives and false negatives. Based on our approach, we investigate the effect of false positives and false negatives on the inference of conservations of binding targets based on ChIP-chip data.

Methods

Summarizing Contingency Tables

As discussed above, the output of ChIP-chip experiments is typically summarized into binary forms and results across different experiments for the same transcription factor can be crosstabulated into a contingency table. A common question asked is whether a transcription factor has similar binding targets across conditions, and this is reflected as the dependency of outcome among the conditions. In the following, we give a brief discussion on two statistical measures that we will use to summarize the degree of dependency in a contingency table.

For the sake of clarity, we will focus on ChIP-chip experiments involving two different conditions or two species. The number of target genes in the two conditions/species can be cross tabulated into a 2 by 2 contingency table. We use two metrics to summarize such contingency tables – Odds Ratio and Positive specific agreement [9, 10].

Table 1 gives an illustration of a 2 by 2 contingency table. The goal is to identify whether a relationship, or association exists between the two categorical variables. In our scenario, it would correspond to whether the transcription factor exhibits condition dependent binding or condition independent binding. For such a contingency table, the odds ratio is a commonly used measure to quantify association among the categorical variables. An odds is defined as the ratio of the frequency of being in one category and the frequency of not being in that category. For example, from Table 1, the odds that a particular gene is a binding target in experimental condition 1 is equal to (c + d)/(a + b). This odds is called marginal odds, obtained from the total frequencies in one margin of the table, disregarding the effects of the other variable. Conditional odds are the chances of the transcription factor binding relative to not binding in one experimental condition, given a particular level (binding state) in the other experimental condition. The variables are deemed to be unassociated if the conditional odds are equal or close to each other, and hence equal to the marginal odds. To compare directly the two conditional odds, a single summary statistic, obtained by dividing the first conditional odds by the second is called odds ratio. Thus, for the data in Table 1, the odds ratio is defined as: Odds Ratio = (a/c)/(b/d) = ad/bc. Odds ratio takes only positive values and has no upper limit. An odds ratio of 1 indicates no relationship among the variables. In addition to the odds ratio, its logarithm is also commonly used. Logarithmic transformation of data has a number of advantages – the variation of log transformed data tends to be less dependent on the magnitude of values, while taking logs also reduces the skewness of the distributions. After log transformation, data tends to be spread out more evenly, also making it easier to examine visually.

Table 1 Simple 2 by 2 contingency table.

Full size table

Other measures of dependency are also often used in psychological and medical research. For example, the problem can be formulated as follows: Suppose two raters classify each subject in a sample from some target population according to the presence or absence of some characteristic of interest. The resulting data can then be summarized into a 2 by 2 table. The agreement between raters can be quantified by the metric simple agreement, which is defined as the proportion of cases for which both raters agree, or (a + d)/(a + b + c + d). However, if a is large, this would approach 1 regardless of the performance on positive cases. Positive specific agreement provides insight when the positive cases are rare. It estimates the conditional probability that one rater will agree that a case is positive given the other one rated it positive, where the role of the two raters is selected randomly. Positive specific agreement, p_posis defined as: p_pos= 2d/(2d + b + c). Both (log) odds ratio and positive specific agreement will be considered in our following discussion.

Model Setup

Consider an experiment with a binary outcome. Let p₀ denote the proportion of true negatives, while p₁ be the proportion of true positives. We denote p = (p₀, p₁)^tas the vector of true proportions. Due to false positives and false negatives, the observed proportions likely differ from the true proportions. Let $\hat{p} = {({\hat{p}}_{0}, {\hat{p}}_{1})}^{t}$ denote the vector of the observed proportions. The relationship between p and E( $\hat{p}$ ) can be written as:

(\begin{matrix} E ({\hat{p}}_{0}) \\ E ({\hat{p}}_{1}) \end{matrix}) = (\begin{matrix} 1 - s & t \\ s & 1 - t \end{matrix}) (\begin{matrix} p_{0} \\ p_{1} \end{matrix}),

(1)

where s is the false positive rate and t is the false negative rate. Denoting the transformation matrix as M, Equation (1) can be written as:

E (\hat{p}) = M p .

(2)

Thus, for different values of false positive and false negative rates, different observed proportions will be obtained based on Equation (2). If the false positive and false negative rates are known, the true proportions may be inferred based on the observed experimental proportions. Multiplying both sides of Equation (2) by M^-1 gives us:

M^{- 1} E (\hat{p}) = p .

(3)

However, due to chance variations, p obtained through this approach based on the observed $\hat{p}$ may have negative components, leading to uninterpretable results. Instead, we propose to estimate the true proportions using an Expectation Maximization (EM) based approach explained in detail in the following subsection.

Often, we are interested in the analysis of the binding of a particular transcription factor in multiple experimental conditions or across different species. In either case, we are interested in counts of similarity of binding across conditions or organisms. This would correspond to an extension of Equation (2) into a higher dimension. For simplicity, we present our analysis for a 2-dimensional case. For example, if we consider the binding targets of a transcription factor across two experimental conditions, the vector of true proportions can be represented as p = (p₀₀, p₀₁, p₁₀, p₁₁)^t. Here p₀₀ denotes the proportion of genes which are not targets of the regulator in either condition, p₀₁ denotes the proportion of genes which are targets of the regulator in the second condition but not in the first, p₁₀ denotes the proportion of genes which are targets of the regulator in the first condition but not in the second, while p₁₁ denotes the proportion of genes which are targets of the regulator in both conditions. Similarly, the vector for the observed proportions can be denoted as $\hat{p} = {({\hat{p}}_{00}, {\hat{p}}_{01}, {\hat{p}}_{10}, {\hat{p}}_{11})}^{t}$ . The relationship between the observed and true proportions can be then written as:

E (\hat{p}) = (M \otimes M) p .

(4)

If we consider Equation (2) to correspond to a 1-dimensional case, for the n-dimensional case, the new transformation matrix would simply be obtained by taking the tensor product of M with itself n times. Here we assume that the false positive and false negative rates to be the same across two conditions. In general that may not be the case. In such a scenario, for a 2-dimensional case, Equation 4 takes the general form:

E (\hat{p}) = (M_{1} \otimes M_{2}) p,

(5)

where M₁ and M₂ are the transformation matrices for the first and second conditions respectively.

EM Algorithm

Given a vector of observed proportions which we obtain from experimental output, for different values of false positive rates and false negative rates, we aim to infer the true proportions. This would give us an idea about how the observed and true proportions differ for different levels of noise in the form of false positives and false negatives. We infer the true proportions from the observed proportions using an EM based approach which we now discuss in detail.

Let us consider the binding patterns for a transcription factor in experimental conditions c₁ and c₂. We define the vector for the true binary binding pattern of a particular gene G as b = (b₁, b₂), where b₁ and b₂ take binary value 1 or 0 depending on whether the gene is a true binding target for the transcription factor in c₁ and c₂ respectively. Thus, for the experimental conditions c₁ and c₂, this binary binding pattern vector can take four possible values, {(0, 0), (0, 1), (1, 0), (1, 1)}. For example, a binary binding pattern vector equal to (1, 1) indicates that the gene is a binding target for the transcription factor in both c₁ and c₂. We aim to infer this true binary binding pattern for all the genes and thus obtain the true binary counts. Due to experimental errors, we have the observed counts as the experimental output.

We denote the observed binding pattern for a particular gene as g = (g₁, g₂), where each component is either 0 or 1 denoting whether the gene is observed to be the binding target of the particular transcription factor in c₁ and c₂ respectively, based on the experimental output. Thus, the vector g represents the observed data. The probability of the observed binding data is then given by

\begin{array}{l} P (g) & = & P (b = (0, 0)) P (g | b = (0, 0)) + P (b = (0, 1)) P (g | b = (0, 1)) + \\ P (b = (1, 0)) P (g | b = (1, 0)) + P (b = (1, 1)) P (g | b = (1, 1)) . \end{array}

(6)

Thus, for N genes, the probability of the observed data is

P (g_{1}, g_{2}, ..., g_{N}) = \prod_{i = 1}^{N} P (g_{i}) .

(7)

In this article, we propose to estimate the P(b) to maximize P(g₁, g₂, ..., g_N) using the EM algorithm, by treating b as the missing data as follows.

E-Step: In the Expectation step, the conditional distribution of the missing data given the observed data is evaluated. We evaluate the posterior probabilities of each true binding state given the observed binding pattern. Thus, for every gene G with observed binding pattern g, we estimate:

P (b^{(m)} | g) = \frac{P (g | b^{(m)}) * P (b^{(m)})}{\sum P (g | b^{(m)}) * P (b^{(m)})}

(8)

where b^(m)is the estimate of the true binding state b probability at the m-th step. Since b^(m)can have four possible values, at each step, we estimate four probabilities. The probability P(g|b^(m)) can be expanded as:

\begin{array}{l} P (g | b^{(m)}) & = & P ((g_{1}, g_{2}) | (b_{1}^{(m)}, b_{2}^{(m)})) \\ = & P (g_{1} | b_{1}^{(m)}) P (g_{2} | b_{2}^{(m)}), \end{array}

(9)

where $b_{1}^{(m)}$ and $b_{2}^{(m)}$ are the first and second components of the estimate b^(m)and take binary value 0 or 1.

The second equation in (9) results from the independence assumption for the data from two separate ChIP-chip experiments. Thus, the probability of observing g₁ would be independent of the estimate of binding state $b_{2}^{(m)}$ , while the probability of observing g₂ would be independent of the estimate of binding state $b_{1}^{(m)}$ . There are four possible cases for the expression $P (g_{i} | b_{i}^{(m)})$ in Equation (4). From Equation (1), they can be enumerated as:

\begin{array}{l} P (g_{i} = 0 | b_{i}^{(m)} = 0) = 1 - s, & P (g_{i} = 0 | b_{i}^{(m)} = 1) = t, \\ P (g_{i} = 1 | b_{i}^{(m)} = 0) = s, & P (g_{i} = 1 | b_{i}^{(m)} = 1) = 1 - t . \end{array}

(10)

Thus, for each gene, we start with a set of estimates P(b^(m)) and obtain estimates of the posterior probabilities P(b^(m)|g) for each gene at the E-step.

M-Step: In the Maximization step, the parameters P(b) are re-estimated to maximize the likelihood of the complete data. After obtaining P(b^(m)|g) for each gene, we cross-tabulate a two-way contingency table, with the "count" for each of the four values {(0, 0), (0, 1), (1, 0), (1, 1)} being the sum of the probabilities for that particular value across all the genes. These counts are then used to obtain updated P estimates for P(b). For example, $P (b = (0, 0)) = \frac{1}{N} \sum P (b = 0, 0) | g_{i})$ .

We iterate between the E-Step and the M-Step until convergence. The convergence criterion was set as: |P(b^(m)) - P(b^(m-1))| < 10^-12.

Results and discussion

In this section, we study the effect of false positives and false negatives on inferring regulatory target conservation across conditions/species through both simulations and real data analyses.

We consider an experiment involving the binding of a transcription factor in two different conditions with a total of 1000 genes. We consider the odds ratio as our metric of interest. For fixed true odds ratios, and different values of false positive and false negative rates, we plot the surface of the observed odds ratio in Figure 1. The observed odds ratio is obtained from Equation 4. It can be seen that the observed odds ratio is the largest for low values of false positive and false negative rates and its value decreases with increasing false positive and false negative rates. To visualize this phenomenon in two-dimensions, we fix the false negative rate, and plot the observed odds ratio as the false positive rate varies (Figure 2). We observe that with increasing false positive rates, the observed odds ratio decreases. This is expected, as with an increasing false positive rate, a larger number of true negatives are detected as positives. This reduces the count of genes which are observed negatives in both conditions, i.e. the cell in the contingency table corresponding to "00". Thus, there is a reduction in the observed odds ratio value. Similarly, we observe that for a fixed false positive rate with an increasing false negative rate, the observed odds ratio also decreases. This is also expected, as with increasing false negative rate, a larger number of true positives are detected as negatives. This reduces the number of genes which are observed positives in both conditions, i.e. the cell in the contingency table corresponding to "11", thereby causing a reduction in the observed odds ratio value. To study the effect of asymmetry between p₀₁ and p₁₀, we repeated this simulation for differing values of p₀₁ and p₁₀. We observed similar trends of decreasing observed odds ratios for increasing false positive rate for a fixed false negative rate and a fixed true odds ratio.

In the following, we give an analytical proof for the reduction in the observed odds ratio for increasing false positive rates, with the false negative rate being fixed. Equation 4 can be expanded as:

(\begin{matrix} E ({\hat{p}}_{00}) \\ E ({\hat{p}}_{01}) \\ E ({\hat{p}}_{10}) \\ E ({\hat{p}}_{11}) \end{matrix}) = (\begin{matrix} {(1 - s)}^{2} & (1 - s) t & t (1 - s) & t^{2} \\ (1 - s) s & (1 - s) (1 - t) & t s & t (1 - t) \\ s (1 - s) & s t & (1 - t) (1 - s) & (1 - t) t \\ s^{2} & s (1 - t) & (1 - t) s & {(1 - t)}^{2} \end{matrix}) (\begin{matrix} p_{00} \\ p_{01} \\ p_{10} \\ p_{11} \end{matrix}) .

(11)

The observed odds ratio is:

O O R = \frac{E ({\hat{p}}_{00}) * E ({\hat{p}}_{11})}{E ({\hat{p}}_{01}) * E ({\hat{p}}_{10})}

(12)

where from Equation 11 we get,

\begin{array}{l} E ({\hat{p}}_{00}) = p_{00} {(1 - s)}^{2} + (p_{01} + p_{10}) (1 - s) t + p_{11} t^{2}, \\ E ({\hat{p}}_{01}) = p_{00} (1 - s) s + p_{01} (1 - s) (1 - t) + p_{10} s t + p_{11} (1 - t) t, \\ E ({\hat{p}}_{10}) = p_{00} (1 - s) s + p_{01} s t + p_{10} (1 - s) (1 - t) + p_{11} (1 - t) t, \\ E ({\hat{p}}_{11}) = p_{00} s^{2} + (p_{01} + p_{10}) s (1 - t) + p_{11} {(1 - t)}^{2} . \end{array}

We show that for a true odds ratio greater than 1 and for s < 1/2 and s + t < 1, ∂ (OOR)/∂s is negative. These are reasonable assumptions for real data, where the false positives are low and false negatives are not very high. The denominator of ∂ (OOR)/∂s is always positive as it is a squared number. The numerator can be written as:

F = F 1 * F 2 * F 3

where,

F 1 = (p₀₁p₁₀ - p₀₀p₁₁)(-1 + s + t),

F 2 = (1 - s)(p₀₁ + p₁₀ + 2p₀₀s) + (-p₀₁ - p₁₀ + 2p₁₁ + 2(p₀₁ + p₁₀)s)t - 2p₁₁t²,

F 3 = (1 - p₀₀)(-1 + t)t + p₀₀(-1 + t - s(-2 + s + 2t)).

Let us consider each term separately.

If the true odds ratio is greater than 1 and s + t < 1, then p₀₁p₁₀ <p₀₀p₁₁ and (-1 + s + t) < 0. Thus, we have F 1 > 0. F 2 can be simplified as:

F 2 = (1 - s)(p₀₁ + p₁₀ + 2p₀₀s) + (p₀₁ + p₁₀)(-1 + 2s)t + 2p₁₁t(1 - t).

Thus, if (-1 + 2s) < 0, i.e. s < 1/2, then all the three product terms in F 2 are positive. Thus, for s < 1/2, F 2 > 0. F 3 can be simplified as:

F 3 = (-1 + t)(t(1 - p₀₀) + p₀₀) - p₀₀s(-2 + s + 2t) = -p₀₀s² + 2p₀₀s(1 - t) + (-1 + t)(t(1 - p₀₀) + p₀₀).

Thus, F 3 is a quadratic function of s. Since the coefficient of s² is negative, if the discriminant of the quadratic is negative, F 3 is always negative. The discriminant D is given by:

D = 4 p_{00}^{2} {(1 - t)}^{2} + 4 p_{00} (- 1 + t) (t (1 - p_{00}) + p_{00})

Simplifying, we get, $D = - 4 p_{00}^{2} (1 - t) t - 4 p_{00} (1 - t) t (1 - p_{00})$ which is clearly negative. Thus, F is negative for true odds ratio greater than 1, s < 1/2 and s + t < 1.

Simulation Results

To study the effect of false positives and false negatives on statistical inference of dependence between two conditions, we consider a similar setting in which the binding of a regulator in two conditions is studied for 1000 genes. We simulated data for a fixed true odds ratio, and fixed the false positive and false negative rates. We randomly added false positives and false negatives to the data based on the false positive rate and false negative rate. This manifests itself as the observed data, and we repeated this 1000 times. We performed a chi-squared test for independence between the two conditions and counted the number of times the null hypothesis was rejected at a significance level of 0.001. Further, for each observed dataset, we inferred back the true data using our EM algorithm. The inferred true counts are almost equal to the true counts before false positives and false negatives were randomly added. This is because the false positives and false negatives were randomly added based on the fixed false positive rate and false negative rate. These fixed rates are used in our EM algorithm to obtain the inferred true counts. For example, for a true odds ratio of 2, the vector of true counts was (800, 100, 80, 20)^t. We fixed the false positive rate to 0.01 and the false negative rate to 0.2. The vector of inferred true counts was determined to be (799.87, 101.92, 78.55, 20.66)^t. The EM algorithm was initialized by giving equal weights to each possible true binding pattern for each gene. We used the chi-squared test and counted the number of times the null hypothesis was rejected for the inferred true data at the same level of significance. We repeated this analysis for different values of the true odds ratio and different values of false positive and false negative rates. Figure 3 shows the plot of the number of null rejections versus the odds ratio for both the observed and inferred true data. Our results indicate that the number of null rejections for the inferred true data is consistently larger than that for the observed data. We also note that as the odds ratio increases, the difference between the number of null rejections for the inferred true data and observed data also increases. Instead of using the chi-squared test, we also used used thresholds for the odds ratio and positive specific agreement to ascertain the number of null rejections for observed and inferred true data. Thus, for each simulation, we rejected the null hypothesis if the odds ratio was greater than some threshold. We repeated this by applying a threshold to the positive specific agreement. The results are shown in Figures 4 and 5. Thus, in addition to the chi-squared, thresholds for the odds ratio and positive specific agreement also provide evidence that the number of null rejections for the inferred true data is consistently larger than that for the observed data.

Real Datasets

We considered two ChIP-chip datasets. Harbison et al. [5] described the binding profiles of 204 transcription factors for S. Cerevisiae in Rich medium, and 84 of these transcription factors were also profiled in at least one other experimental condition. In their study, transcription factors were selected for profiling in a particular environment if they were essential for growth in that environment, or if there was other evidence suggesting their role in gene regulation in that environment. Borneman et al. [8]. studied the divergence of binding sites of regulators Ste12 and Tec1 in the yeasts S. cerevisiae, S. mikatae and S. bayanus under pseudohyphal conditions. They listed genes which showed differing degrees of conservation across the three species, i.e. genes which were targets in only one species, the targets in two species, and the targets all three species.

For ChIP-chip data from Harbison et al. [5], we focussed on the binding data for the transcription factors Ste12 and Tec1 in three different experimental conditions – Rich medium, Filamentation inducing and Mating inducing. We used a p-value threshold of 0.001 to obtain the binding targets for these two regulators. For a pair of experimental conditions, we cross-tabulated the binding targets and created a 2 by 2 contingency table. The odds ratio was quite high, hence we used the log odds ratio and positive specific agreement as metrics to summarize the contingency tables. Thus, given the observed log odds ratio and observed positive specific agreement, for different values of the false positive rate and false negative rate, we inferred the underlying true log odds ratio and the true positive specific agreement using our EM based approach.

In the section describing the model setup, we stated that multiplication of the vector of the observed proportions with the inverse of the transformation matrix could lead to inferred true proportions with negative components. Here we illustrate the scenario. For the regulator Ste12, in Rich Medium and Mating inducing condition the vector of the observed proportions is p = (0.9761, 0.0144, 0.0040, 0.0056)^t. For a false positive rate of 0.001 and a false negative rate of 0.2, multiplying p by the inverse of the transformation matrix results in the vector of the inferred true proportions $\hat{p}$ = (0.9743, 0.0150, 0.0020, 0.0087)^t. However, for a false positive rate of 0.002 and a false negative rate of 0.3, the vector of the inferred true proportions is $\hat{p}$ = (0.9748, 0.0144, -0.0005, 0.0113)^t. Similarly, for a false positive rate of 0.004 and a false negative rate of 0.4, the vector of the inferred true proportions is $\hat{p}$ = (0.9793, 0.0113, -0.0061, 0.0154)^t. Thus, the inferred true proportions obtained by simply multiplying the observed proportions with the inverse of the transformation matrix could contain negative components.

Table 2 shows how the inferred true proportions change for different values of the false positive rate and false negative rate. The vector of the observed proportions is (0.976, 0.014, 0.004, 0.006)^t. Table 3 gives the calculation of the inferred true odds ratios from the inferred true proportions. Figure 6 shows how the surface of the inferred true log odds ratio varies with different values of false positive rate and false negative rate. We notice that as the false positive rate and false negative rate increase, the inferred true log odds ratio differs quite significantly from the observed log odds ratio. We observe similar trends when we use positive specific agreement as the metric of interest (Table 4 and Figure 7). Harbison et al. reported that the false discovery rate in their data was likely to be approximately 4%, while the false negative rate was around 24% for a p-value threshold of 0.001. For binding data from Harbison et al., typically the number of "negatives" was close to 6000, while the number of "positives" was about 100 to 200 at a p-value threshold of 0.001. Thus, the false positive rate was close to 0.001. For our analysis, we studied the variation of observed and true outcomes by varying the false positive rate from 0.001 to 0.005, and the false negative rate from 0.2 to 0.4. Thus, our range of false positive rates would correspond to about 6 to 30 false positives, and about 20 (100 * 0.2) to 80 (200 * 0.4) false negatives, which appears to be quite reasonable. From Table 3, we see that for a false positive rate of 0.001 and false negative rate of 0.20, the inferred true log odds ratio is 5.60, while the observed log odds ratio is 4.56. Since both the log odds ratios are quite high, our inference of association among the two experimental conditions would not be affected by these values of the false positive rate and false negative rate.

Table 2 Inferred true proportions of target genes of Ste12 in the Rich Medium and Mating Inducing conditions.

Full size table

Table 3 Inferred true log odds ratios for target genes of Ste12 in the Rich Medium and Mating Inducing conditions.

Full size table

Table 4 Inferred positive specific agreement for target genes of Ste12 in the Rich Medium and Mating Inducing conditions.

Full size table

We also analyzed the results of ChIP-chip experiments performed by Borneman et al. [8]. We obtained the counts of genes which were the binding targets of the regulators Ste12 and Tec1 in one, two and all three species. We repeated our analysis as described in the previous paragraph for a pair of species (Tables 5, 6 and 7; Figures 8 and 9). Here too, we notice a considerable difference between the observed and inferred outcomes as the false positive rate and false negative rate increases. For example, for a false positive rate of 0.001 and a false negative rate of 0.2, compared to the observed log odds ratio of 4.50, the inferred true log odds ratio is 5.48; however, for a false positive rate of 0.005 and a false negative rate of 0.4, the inferred true log odds ratio is as high as 17.36. Further, we attempted to test the notion that genes falling under similar functional categories tend to be the conserved binding targets across the three species. We listed all the orthologous genes in Yeast. For all these genes, we used SGD GO Slim finder http://db.yeastgenome.org/cgi-bin/GO/goSlimMapper.pl to categorize the genes into broad functional categories. For the top categories which contained the largest number of genes, we cross-tabulated the genes and created a 2 by 2 contingency table based on counts of the genes which are binding targets (Tables 8, 9, 10, 11). For the genes falling in major categories, we notice that the log odds ratios are considerably high, indicating considerable degree of binding conservation. For example, for the 565 genes found to be enriched for Hydrolase activity, 551 were not the binding targets of Ste12 in either S. cerevisiae or S. mikatae. Of the 14 genes which were the binding targets in at least one of the two species, 5 (YNL053W, YDR452W, YGL163C, YIL118W, YMR305C) were the binding targets in both species. Of the remaining 9 genes, 5 (YIR027C, YER133W, YNL180C, YOR049C, YDL047C) were targets in only S. cerevisiae, while 4 (YNL141W, YHR005C, YOR126C, YOL011W) were targets in only S. mikatae.

Table 5 Inferred true proportions of target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.

Full size table

Table 6 Inferred true log odds ratios for target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.

Full size table

Table 7 Inferred positive specific agreement values for target genes of Ste12 under the pseudohyphal condition in S. cerevisiae and S. mikatae.

Full size table

Table 8 Cross-tabulation of orthologous genes in functional category Hydrolase activity.

Full size table

Table 9 Cross-tabulation of orthologous genes in functional category Tranferase activity.

Full size table

Table 10 Cross-tabulation of orthologous genes in functional category Protein binding.

Full size table

Table 11 Cross-tabulation of orthologous genes in functional category Transporter activity.

Full size table

Conclusion

In this article, we have studied the effect of false positives and false negatives in the analysis and interpretation of ChIP-chip data. We have derived a relationship between the observed and the underlying true binary outcomes. Given the observed binary outcome of an experiment, we have developed an EM based approach to infer the underlying true binary outcome for given values of false positive and false negative rates.

A common limitation with finding binding targets from ChIP-chip data is that typically an arbitrary threshold, e.g. 0.001, is applied to the data, and all genes with p-values less than this threshold are considered binding targets. The false positive rate and false negative rate for the binding data change with the threshold applied [5]. Datta and Zhao [11] proposed a statistical procedure to determine the binding targets without imposing a simple threshold to the ChIP-chip data. However, their approach relies on accurate inference of false discovery rate [12], which is a non-trivial task.

To summarize data in contingency tables we utilized two commonly used metrics – Odds Ratio and Positive specific agreement. Both these metrics are widely used to study dependency among categorical variables. Since we are interested in quantifying association among two categorial variables, i.e. whether there is association across two different conditions/species, the Odds Ratio and Positive specific agreement are appropriate metrics of interest. In our simulation, we used the chi-squared test of independence to test the null hypothesis that the binding targets are independent. Instead of using the chi-squared test, we could also use the Fisher's exact test to test the independence assumption on the two-way contingency table. The resulting p-values from both tests indicate the statistical evidence against the independence assumption. However, they do not provide a meaningful summary of the degree of dependence as they are also dependent on the sample size.

In general, for independently performed real world experiments, such as two separate ChIP-chip experiments, the independence assumption of equation (9) should hold. This is because we can assume that data points in a particular experiment are independent identically distributed random variables. However, for experiments with closely associated results, it is possible that the false positive and false negative data points for the experiments are not entirely independent. This could result in an under-estimation of the underlying association after the EM procedure.

Due to the limited degrees of freedom of the data, our EM algorithm cannot be used to estimate the false positive rate and false negative rate in experimental data. At each step in the EM algorithm, we estimate three parameters, and we have three equations to solve for them. If we also wish to estimate the false positive and false negative rates, we would have two additional parameters, but the number of equations would still be three. This would lead to an identifiability problem.

We initialized the EM algorithm by different initial estimates of the parameters. For each initial estimate of the parameters, the algorithm converged. The convergence criteria for the EM algorithm require that the log likelihood of the parameters l(b|g) be continuous and differentiable in the parameter space. Unfortunately, the M-step of our algorithm does not have a closed form. Hence, it is difficult to evaluate the gradient of the log likelihood function.

Harbison et al. performed their ChIP-chip experiments using microarrays consisting of spotted polymerase chain reaction (PCR) products representing all the intergenic regions of Saccharomyces cerevisiae. To obtain the binding targets a p-value threshold was applied to the binding intensities associated with the probes. One of the drawbacks of PCR based arrays is the low resolution of the DNA elements in the microarray chip. For PCR arrays designed for Yeast, the typical resolution achieved is less than 1 kb. In recent years, high density oligonucleotide arrays, comprising of large numbers (40, 000 to more than 6, 000, 000) of short oligonucleotides have been utilized for ChIP-chip studies [13–16]. A number of statistical algorithms have also been developed to determine the binding targets from such large scale tiling arrays [17–20]. Borneman et al. used high density oligonucleotide arrays to perform their experiment. The binding targets were obtained using Tilescope [21]. Since they report the target genes in each organism, we simply used their results to obtain the counts of target genes in each of the three organisms.

Our analysis can be applied to any experimental setting with binary outcomes. However, for the sake of simplicity, we have illustrated its application for ChIP-chip experiments. By applying our algorithm to ChIP-chip data from Harbison et al. and Borneman et al., we observe that for different values of the false positive and false negative rate, the observed and true metrics for the binary data can differ quite dramatically. However, we notice that when the true log odds ratio is greater than 4, i.e. there is a significant degree of association among the binding targets across conditions/species, such differences in the observed and true metrics would not change our inference. On the other hand, our simulation results indicate that when the true odds ratio is close to 1, i.e. for cases when the underlying association is marginal, moderate values of false positive and false negative rates (0.01 and 0.2 respectively) may not be able to provide conclusive evidence of any underlying association or independence.

References

Buck MJ, Lieb JD: ChIP-chip: considerations for the design, analysis and application of genome-wide chromatin immunoprecipitation experiments. Genomics 2004, 83: 349–360. 10.1016/j.ygeno.2003.11.004
Article CAS PubMed Google Scholar
Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO: Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 2001, 409: 533–538. 10.1038/35054095
Article CAS PubMed Google Scholar
Lieb JD, Liu X, Botstein D, Brown PO: Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nature Genetics 2001, 28: 327–334. 10.1038/ng569
Article CAS PubMed Google Scholar
Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, Volkert TL, Wilson CJ, Bell SP, Young RA: Genome-wide location and function of DNA-binding proteins. Science 2000, 290: 2306–2309. 10.1126/science.290.5500.2306
Article CAS PubMed Google Scholar
Harbison CT, Gordon DB, Lee TI, Rinaldi NJ, Macisaac KD, Danford TW, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of an eukaryotic genome. Nature 2004, 431: 99–104. 10.1038/nature02800
Article PubMed Central CAS PubMed Google Scholar
Beyer A, Workman C, Hollunder J, Radke D, Moller U, Wilhelm T, Ideker T: Integrated assessment and prediction of transcription factor binding. PLOS Computational Biology 2006, 2(6):e70. 10.1371/journal.pcbi.0020070
Article PubMed Central PubMed Google Scholar
Zeitlinger J, Simon I, Harbison CT, Hannett NM, Volkert TL, Fink GR, Young RA: Program-specific distribution of a transcription factor dependent on partner transcription factor and MAPK signaling. Cell 2003, 113: 395–404. 10.1016/S0092-8674(03)00301-5
Article CAS PubMed Google Scholar
Borneman AR, Gianoulis TA, Zhang ZD, Yu H, Rozowsky J, Seringhaus MR, Wang LY, Gerstein M, Snyder M: Divergence of transcription factor binding sites across related yeast species. Science 2007, 317: 815–819. 10.1126/science.1140748
Article CAS PubMed Google Scholar
Agresti A: An introduction to categorical data analysis. New York: Wiley; 1996.
Google Scholar
Fleiss JL: Statistical methods for rates and proportions. 2nd edition. New York: John Wiley; 1981.
Google Scholar
Datta D, Zhao H: Statistical methods to infer cooperative binding among transcription factors in Saccharomyces cerevisiae . Bioinformatics 2008, 24(4):545–552. 10.1093/bioinformatics/btm523
Article CAS PubMed Google Scholar
Efron B: Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. Journal of the American Statistical Association 2004, 99: 96–104. 10.1198/016214504000000089
Article Google Scholar
Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA: Core transcriptional regulatory circuitry in human embryonic stem cells. Cell 2005, 122: 947–956. 10.1016/j.cell.2005.08.020
Article PubMed Central CAS PubMed Google Scholar
Pokholok DK, Harbison CT, Levine S, Cole M, Hannett NM, Lee TI, Bell GW, Walker K, Rolfe PA, Hannett NM, Herbolsheimer E, Zeitlinger J, Lewitter F, Gifford DK, Young RA: Genome-wide map of nucleosome acetylation and methylation in yeast. Cell 2005, 122: 517–527. 10.1016/j.cell.2005.06.026
Article CAS PubMed Google Scholar
Zeitlinger J, Zinzen RP, Stark A, Kellis M, Zhang H, Young RA, Levine M: Whole-genome ChIP-chip analysis of Dorsal, Twist and Snail suggests integration of diverse patterning processes in the Drosophila embryo. Genes Development 2007, 21: 385–390. 10.1101/gad.1509607
Article PubMed Central CAS PubMed Google Scholar
Zheng Y, Yosefowicz SZ, Kas A, Chu TT, Gavin MA, Rudensky AY: Genome-wide analysis of Foxp3 target genes in developing and mature regulatory T cells. Nature 2007, 445: 936–940. 10.1038/nature05563
Article CAS PubMed Google Scholar
Bertone P, Stolc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M: Global identification of human transcribed sequences with genome tiling arrays. Science 2004, 306: 2242–2246. 10.1126/science.1103388
Article CAS PubMed Google Scholar
Ji H, Wong WH: TileMap: create chromosomal map of tiling array hybridizations. Bioinformatics 2005, 21: 3629–3636. 10.1093/bioinformatics/bti593
Article CAS PubMed Google Scholar
Li W, Meyer CA, Liu XS: A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 2005, 21: i274-i282. 10.1093/bioinformatics/bti1046
Article CAS PubMed Google Scholar
Du J, Rozowsky JS, Korbel JO, Zhang ZD, Royce TE, Schultz MH, Snyder M, Gerstein M: A supervised hidden markov model framework for efficiently segmenting tiling array data in transcriptional and ChIP-chip experiments: systematically incorporating validated biological knowledge. Bioinformatics 2006, 22(24):3016–3024. 10.1093/bioinformatics/btl515
Article CAS PubMed Google Scholar
Zhang ZD, Rozowsky J, Lam HY, Du J, Snyder M, Gerstein M: Tilescope: online analysis pipeline for high-density tiling microarray data. Genome Biology 2007, 8: R81. 10.1186/gb-2007-8-5-r81
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

This work was supported in part by NSF grant DMS-0714817 and NIH grant GM59507.

Author information

Authors and Affiliations

Department of Biomedical Engineering, Yale University, New Haven, CT, 06520, USA
Debayan Datta
Department of Epidemiology and Public Health, Yale University, New Haven, CT, 06520, USA
Hongyu Zhao
Department of Genetics, Yale University, New Haven, CT, 06520, USA
Hongyu Zhao

Authors

Debayan Datta
View author publications
You can also search for this author in PubMed Google Scholar
Hongyu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongyu Zhao.

Additional information

Authors' contributions

DD performed data analysis and drafted the manuscript. HZ conceived and guided the study. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Datta, D., Zhao, H. Effect of false positive and false negative rates on inference of binding target conservation across different conditions and species from ChIP-chip data. BMC Bioinformatics 10, 23 (2009). https://doi.org/10.1186/1471-2105-10-23

Download citation

Received: 30 May 2008
Accepted: 19 January 2009
Published: 19 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-23

Effect of false positive and false negative rates on inference of binding target conservation across different conditions and species from ChIP-chip data