Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups
© Zhang and Coombes; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
Skip to main content
© Zhang and Coombes; licensee BioMed Central Ltd. 2012
Published: 24 August 2012
High-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation.
Using two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation.
The three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters.
We have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR.
With the advent of high throughput technologies, research has focused on the systematic genome-wide study of biological systems. Microarray technology has been used to measure the mRNA expression of thousands of genes simultaneously. Concurrently, new statistical methods have been developed to analyze the data generated by these experiments. These methods involve both data preprocessing (background correction, data transformation, normalization, etc.) and specific tools for different types of studies (e.g., class discovery, class prediction, or class comparison).
The canonical class comparison problem involves the identification of lists of DEGs. The evolving consensus  on the analysis of microarray data recognizes the centrality of methods that estimate the FDR associated with gene lists. Although the concept of FDR was introduced by Benjamini and Hochberg in 1995 , a variety of methods have been introduced since then to estimate the FDR in microrray data sets [3–8]. These methods share certain characteristics: they perform a separate statistical test for each gene or protein; they compute a p-value associated with each test; and they estimate the FDR using the distribution of p-values. The methods are usually based on the assumption of independent or weakly dependent test statistics. However, when dealing with microarray data, we know that genes are usually correlated either for biological or technical reasons. Recent studies have demonstrated the non-negligible effects of correlation in microarray data on large-scale simultaneous hypothesis testing and pointed out how variable the FDR could be in the presence of strong correlation [9–13].
Our study sheds more light on possible reasons for the (lack of) precision in the estimated FDR. Our results provide two concrete examples of this imprecision. First, we look at an example where univariate Cox proportional hazards (CPH) models are used to determine which genes appear to be related to survival. By resampling ~ 100 patients at a time (out of a set of ~ 200 patients), we find that the estimates of the percentage of genes that appear to be related to survival range from 0% to 20%. In a simpler example of univariate t-tests between two groups of samples, we find that the estimate of the percentage of DEGs ranges between 13% and 43%. This range of estimates is much wider than one would anticipate from fitting a distribution based on thousands of p-values. However, those p-values are not independent. Since gene expression is often correlated, the effective number of independent measurements used to estimate the distribution may be quite small. Strategies that have been proposed to improve the estimation of FDR include resampling [11, 14, 15] and latent FDR with a random term to capture the correlation . However, the causes of highly variable FDR estimation are multifold and deserve further investigation.
Throughout this paper, we estimate FDR using a BUM model for the distribution of p-values . We select this method primarily because it can be computed quickly. While the method may not be the most accurate way to estimate FDR, the fact that it gives a relatively good fit to the p-value distributions that we encounter suggests that our results are driven by intrinsic variability in the p-value distributions from one sample set to another, and thus are likely to affect all known methods to estimate FDR.
In many cases, it is difficult to evaluate analytical methods for microarray data because of the complex— and unknown—nature of the underlying biological phenomena. Thus, simulated data sets with known “ground truth” are needed in order to assess the performance of computational algorithms for the analysis of high throughput data. To address this problem, many groups have developed microarray simulation software [10, 16–20]. However, many existing simulations rely on simplified ideas of the underlying biology. For instance, the manuscript  that introduced the SPLOSH method to estimate FDR includes a simulation that assumes that (i) genes are independent and (ii) the genes that are differentially expressed all have the same fold change. Neither assumption is likely to hold in the real world, and these simplified assumptions do not give a realistic view of the variability in the FDR estimates.
We have already introduced a package of microarray simulation software called UMPIRE . The current version of UMPIRE allows researchers to simulate heterogeneous microarray data with correlated block structure, which is linked to binary or time-to-event outcomes. Through a comprehensive set of simulations, we show that sample size, correlation structure and portion of DEGs account for the majority of observed variability in the p-value distributions and FDR estimates found in real data.
The Affymetrix microrray data were collected as part of a study to predict survival in follicular lymphoma patients . The data set contains 191 patient samples measured with Affymetrix U133A and U133B arrays. Dave and colleagues quantified the microarray data using Affymetrix MAS 5.0 software and then transformed the results by computing the base-two logarithm. They also separated the samples into training (N = 95) and testing (N = 96) sets. We downloaded the processed microarray data and associated clinical annotations from the supporting web site (http://llmpp.nih.gov/FL/).
The two-color fluorescent cDNA microarray data were collected as part of a study to identify clinically relevant subtypes of prostate cancer . The data set consists of data from 41 healthy prostate specimens, 62 primary prostate tumors, and 9 unmatched lymph node metastases. The microarrays contain 42,129 spots for 38,804 different cDNA clones representing 21,287 distinct UniGene clusters. The prostate cancer samples were labeled with Cy5. A common reference material, pooled from 11 established human cell lines, was labeled with Cy3. The raw microarray data were downloaded from the Stanford Microarray Database (http://cmgm.stanford.edu/pbrown/). We used intensity-dependent loess normalization to normalize the background-corrected channels from each microarray, after which the log ratios between experimental and reference channels were transformed by computing the base-two logarithm.
Genes could be correlated when they are involved in active biological pathways, or are regulated by the same set of factors. We consider the correlation in gene expression to be “clumpy”, meaning that there are gene groups with high correlation within groups but no or little correlation between groups. In order to mimic this correlation feature, we applied block structure. Both the block sizes and the correlations within a block vary in order to mimic different sized pathways/networks, and loosely or strongly correlated genes within a particular pathway/network. Distinct blocks are assumed to be independent. Please refer to our previous publication for detailed description of the block structure .
Using the UMPIRE package, additive and multiplicative noise were incorporated, and correlated blocks were implemented. We simulated normal samples as a homogeneous population with G genes and N samples. We allowed N to vary in order to study the effect of the number of independent observations on various test statistics. The same number of cancer samples were generated with a portion of DEGs, where differential expressions were simulated as changed mean expression. Rather than focusing on individual genes, we altered the means of blocks of genes in order to mimic the effect of cancer pathology on pathways or networks. We let ѱ denote the percentage of differentially expressed blocks, which were randomly selected from the transcriptionally active blocks. Although it is possible for an inactive block of genes in normal samples to be turned on in cancer samples, or vice versa, we kept transcriptionally inactive blocks inactive in both normal and cancer samples in this simulation. The absolute changes of the mean expression values on log scale for a block of genes were given by Δ g ~ Gamma(α, β). Both parameters for this gamma distribution were set to 10 so that the absolute fold change on the log2 scale was 1, and the long tail on the right hand side of the distribution allowed a few genes to have large fold changes. A gene in the changed block was randomly assigned to be up-regulated or down-regulated in cancer samples.
The two practical data sets were used to demonstrate the variability in the distribution of p-values and FDR estimation observed in real data. However, due to the limited sample size and unknown ground truth, the practical data sets lack the flexibility needed for testing analytical methods. Thus, realistic simulations were used to disentangle different factors contributing to the variation in FDR estimation.
We simulated 128 sets of normal and cancer data, using four different sample sizes (N = 10, 25, 50 and 100), eight different mean block sizes (ξ = 1, 5, 10, 50, 100, 250, 500, 1000), and four different θ scenarios, where θ denote the portion of negatively correlated genes within a block. In addition, we tested five levels of differential expression (ѱ = 0, 5%, 10%, 20%, and 40%) with three different mean block sizes (ξ = 1, 10, 100) and 25 samples in each biological condition. Our first goal was to study how sample size, block size, ѱ, and θ affect FDR. Based on the results, different θs do not have a pronounced impact on various parameters of interest (data not shown). Thus in the following sections, we will only show the results obtained from θ that is uniformly distributed between 0 and 0.5 (θ = 0.5 – |x – 0.5| where x ~ Beta(1, 1)). We performed gene-by-gene two-sample t-tests in order to identify DEGs between normal and cancer samples. Then we modeled the p-values using a BUM model. Several parameters of interest were recorded for each data set.
for 0 < x ≤ 1, 0 < α < 1, and 0 < λ < 1. The parameter λ determines the size of the uniform component of the model. The shape of the distribution is determined by α; smaller values of α yield sharper peaks near zero. We estimated and for each simulation.
The remaining portion of the set of p-values arises from the alternative hypothesis; that is, represents the proportion of DEGs. In our example, the true value of π = 1 – ѱ = 0.9 since ѱ = 10%. The distribution of the calculated upper bound of when ѱ = 10% has a pattern similar to the one shown in Figure 4. Even though the upper bound of is approaching its true value with large sample size, the model appears to consistently underestimate the proportion of non-DEGs and overestimate the proportion of DEGs. This finding is consistent with previous studies .
We also studied the correlation between and , which are negatively correlated (Additional file 1). However, the extent of negtive correlation decreases with increasing sample size. The negative correlation is due to the fact that the total area under the BUM model sums to 1. The smaller is, the smaller the alternative component (beta density) and the larger the null component (uniform density) which corresponds to larger . With larger sample size, variabilities of and are reduced due to more independent observations, which explains the decrease in absolute correlation.
We simulated cancer samples with a proportion of DEGs having altered mean expression values. For given (nominal) p-value thresholds, we counted the number of significant DEGs that were called “positive”. Because we know which genes were truly differentially expressed in the simulation, we calculated the number of true positives (TP) and false positives (FP) for each p-value threshold. Next, we calculated the observed false discovery proportion: FDP = FP/(FP + TP). (FDR is the marginal average of the FDP.)
To illustrate the influence of sample size on FDP estimation, we performed another set of simulations with different sample sizes and block sizes, but fixed ѱ (10%). Since we know that the mean of FDP is not significantly affected by block size, we calculated mean FDP by averaging over all block sizes for each sample size. Additional file 2 shows that the mean FDP decreases with larger sample size. Consistent from what we observed in Figure 5, larger block size corresponds to more variable FDP estimation, presumably because of the decreased number of independent measurements.
Efron  showed that correlations among genes could considerably widen or narrow the distribution of the test statistic under the null hypothesis. Moreover, the main effect of the pair-wise correlation could be summarized by a single dispersion variate A. The central peak of the null distribution widens when A > 0, and narrows when A < 0. We recorded A for each simulation; Additional file 3 shows the boxplots of A grouped by sample size and block size when ѱ = 10%. The mean value of A is essentially unaffected by either block size or sample size, but the variance of A increases with larger block size. This observation is consistent with Efron’s finding that more correlation leads to larger variance.
Also following Efron, we calculated the standard deviation of the empirical correlation densities (corr.std). We found that corr.std increases with larger block size (Additional file 4). This result is not surprising, since increasing the block size increases the total amount of correlation present in the data but decreases the effective number of independent measurements that contribute to the estimate. However, the sample size has a much larger effect, with corr.std decreasing significantly with larger sample size.
From the two concrete examples, we observed a lack of precision in the estimation of FDR. In order to study the sources of variation during FDR estimation, we simulated microarray data with more realistic parameters. In our simulation, block-wise structure with different block sizes and intra-block correlation are used to mimic the molecular networks or biological functional groups where large scale correlation of gene expression arises. A particular block of genes could be transcriptionally active or inactive depending on specific biological conditions. When the block of genes are transcriptionally active, their expression levels follow a multi-variate normal distribution with parameters estimated from real microarray data. Certain portion of genes will be differentially expressed between normal and cancer samples, where the magnitude of changes follows a gamma distribution which allows some large magnitude changes while the majority have two fold change on average. With this setting, we simulated microarray data sets with different sample sizes, correlation structure, portion of negtively correlated genes within a block, and portion of DEGs between two biological conditions.
For each simulated data set, we have recorded the parameters related to FDR estimation. Different portions of negtively correlated genes within a block do not affect the parameter estimations. Thus, the three major sources of variation in FDR estimation are the sample size, correlation structures and the portion of DEGs. With large sample size, the variances of all parameters decrease due to increased estimation power. However, the percentage of non-DEGs is always under-estimated, even though it approaches the true portion with larger sample size. Large block size results in less precise estimation of all the parameters due to less independent measurements. However, the block size does not affect the mean estimation of the parameters. Thus the FDR estimation are less precise with more correlation, but the average FDR estimation is not affected.
Our study suggests that an important factor affecting FDR estimation is the portion of DEGs. With larger portion of DEGs, the distribution of test statistic is widened by the larger portion of true positives, thus resulting in smaller FDR and more precise FDR estimation.
In summary, the correlation structure is not the only factor affecting FDR estimations. The portion of DEGs, which varies under different biological conditions contributes to both the precision and the magnitude of FDR estimation.
False Discovery Rate
False Discovery Proportion
Ultimate Microarray Prediction, Inference, and Reality Engine
differentially expressed genes
Cox proportional hazards
This research was supported by NIH/NCI grants P30 CA016672, P50 CA140388, and R01 CA123252.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 13, 2012: Selected articles from The 8th Annual Biotechnology and Bioinformatics Symposium (BIOT-2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S13/S1