BonEV: an improved multiple testing procedure for controlling false discovery rates
 Dongmei Li^{1}Email authorView ORCID ID profile,
 Zidian Xie^{2},
 Martin Zand^{1},
 Thomas Fogg^{1} and
 Timothy Dye^{1, 3}
https://doi.org/10.1186/s128590161414x
© The Author(s) 2017
Received: 20 July 2016
Accepted: 7 December 2016
Published: 3 January 2017
Abstract
Background
Stability of multiple testing procedures, defined as the standard deviation of total number of discoveries, can be used as an indicator of variability of multiple testing procedures. Improving stability of multiple testing procedures can help to increase the consistency of findings from replicated experiments. BenjaminiHochberg’s and Storey’s qvalue procedures are two commonly used multiple testing procedures for controlling false discoveries in genomic studies. Storey’s qvalue procedure has higher power and lower stability than BenjaminiHochberg’s procedure. To improve upon the stability of Storey’s qvalue procedure and maintain its high power in genomic data analysis, we propose a new multiple testing procedure, named BonEV, to control false discovery rate (FDR) based on Bonferroni’s approach.
Results
Simulation studies show that our proposed BonEV procedure can maintain the high power of the Storey’s qvalue procedure and also result in better FDR control and higher stability than Storey’s qvalue procedure for samples of large size(30 in each group) and medium size (15 in each group) for either independent, somewhat correlated, or highly correlated test statistics. When sample size is small (5 in each group), our proposed BonEV procedure has performance between the BenjaminiHochberg procedure and the Storey’s qvalue procedure. Examples using RNASeq data show that the BonEV procedure has higher stability than the Storey’s qvalue procedure while maintaining equivalent power, and higher power than the BenjaminiHochberg’s procedure.
Conclusions
For medium or large sample sizes, the BonEV procedure has improved FDR control and stability compared with the Storey’s qvalue procedure and improved power compared with the BenjaminiHochberg procedure. The BonEV multiple testing procedure is available as the BonEV package in R for download at https://CRAN.Rproject.org/package=BonEV.
Keywords
Multiple testing procedure FDR Power Stability RNASeqBackground
Microarray and nextgeneration sequencing (NGS) technologies have been widely used in biological and biomedical research to identify novel biomarkers and genomic modifications related to biological processes and diseases. Multiple testing procedures are widely used in microarray and NGS studies to control the multiple testing error rate to minimize false discoveries from the enormous number of simultaneous hypothesis tests [1]. Many multiple testing error rates have been proposed such as familywise error rate (FWER) [2], k familywise error rate (kFWER) [3], and false discovery rate (FDR) [4]. For discovery purposes, the false discovery rate (FDR), defined as the expected proportion of false discoveries among total number of discoveries, is often controlled in multiple testing procedures to select significant features in microarray and NGS studies [4–6]. Benjamini and Hochberg’s FDR controlling procedure [4] and Storey’s qvalue procedure [7, 8] are the most commonly used procedures [9]. The Bonferroni procedure, although perceived as a conservative procedure for multiple testing error rate control, has stability superior to BenjaminiHochberg’s procedure in terms of variability of total number of discoveries, and equivalent power to BenjaminiHochberg’s procedure, when used to control the expected number of false discoveries (EV, where V stands for the number of false discoveries) at a userspecified level [10].
In this study, we examine the stability (defined as standard deviation of the total number of rejected hypotheses) of both BenjaminiHochberg’s FDR controlling procedure and Storey’s qvalue procedure for generating adjusted pvalues to select significant genes or biomarkers in microarray and NGS data analysis. In addition, we propose our own multiple testing procedure (named BonEV) based on Bonferroni’s EV controlling procedure, that has equivalent power, higher stability, and better FDR control than the Storey’s qvalue procedure with at least mediumsized samples in microarray and NGS data analysis. Multiple testing procedures with high power, good FDR control, and high stability are desirable in genomic data analysis due to the high cost of sequencing in genomic studies. The BonEV multiple testing procedure will be attractive to genomic data analysts as it not only maintains the high power of Storey’s qvalue procedure, but also offers better FDR control and higher stability, especially for small to medium sample size studies that need high stability, high power and good FDR control to maximize the odds of true discoveries.
Methods
BenjaminiHochberg procedure

Order original pvalues p _{ i },i=1,…,m, from the smallest to the largest such that p _{(1)}≤p _{(2)}·≤p _{(m)};

Find k as the largest i for which \(P_{(i)} \le \frac {i}{m}\alpha \);

Reject all null hypotheses H _{ i },i=1,2,…,k.
The BenjaminiHochberg procedure provides \(FDR = \frac {m_{0}}{m}\alpha \le \alpha \), a strong control for FDR at level α for independent and positively correlated test statistics. Meanwhile, the BenjaminiHochberg procedure is also conservative by a factor of \(\frac {m_{0}}{m}\) for controlling FDR at level α.
Storey’s qvalue procedure

Order original pvalues p _{ i },i=1,…,m, from the smallest to the largest such that p _{(1)}≤p _{(2)}·≤p _{(m)};

Find k as the largest i for which \(P_{(i)} \le \frac {i}{m \hat {\pi _{0}}}\alpha \) where \(\pi _{0} = \frac {m_{0}}{m}\);

Reject all null hypothesis H _{ i },i=1,2,…,k.
Storey proposes to estimate π _{0} conservatively by \(\hat {\pi _{0}} = \frac {\sharp \{p_{i} > \lambda \}}{(1\lambda)m}\), where λ is chosen to minimize the meansquared error of the pFDR estimates.
Bonferroni procedure
If we let γ=α·E(R), then we have E(V)≤α·E(R) and \(\frac {E(V)}{E(R)} \le \alpha \). Notice that the Bonferroni procedure used to control E(V) is conservative by a factor of \(\frac {m_{0}}{m}\). Thus, we further improve power of the Bonferroni procedure by estimating m _{0} and replacing m with a conservative estimator of m _{0} in the cutoff value.
BonEV procedure

Compare each p _{ i } with \(\frac {\alpha \cdot Pr(\widehat {P^{*}} \le \alpha)}{\hat {\pi _{0}}}\), i=1,2,…,m;

Reject H _{ i } if \(p_{i} \leq \frac {\alpha \cdot Pr(\widehat {P^{*}} \le \alpha)}{\hat {\pi _{0}}}\), i=1,2,…,m.
In our BonEV procedure, P ^{∗} are the adjusted pvalues from the BenjaminiHochberg procedure, and P r(P ^{∗}≤α) is estimated by the proportion of null hypotheses that have the BenjaminiHochberg adjusted pvalues ≤α. We use the same estimate of π _{0} as used in Storey’s qvalue procedure.
Thus, our procedure controls E(V) at α·(m P r(P ^{∗}≤α)) and controls pFDR at level α. We took advantage of existing R functions to estimate P r(P ^{∗}≤α) and π _{0}. We estimate P r(P ^{∗}≤α) from the p.adjust function in R with Benjamini and Hochberg’s method and estimate π _{0} using the qvalue and pi0est function from the qvalue package.
Our proposed BonEV procedure integrates the approaches of the Bonferroni procedure, BenjaminiHochberg procedure, and Storey’s qvalue procedure. The estimated π _{0} from Storey’s qvalue procedure used in the BonEV procedure increases the cutoff value for pvalues in each comparison, thus improving its power compared to BenjaminiHochberg’s procedure. As the sample size increases, the \(\hat {\pi _{0}}\) is closer to the value of π _{0}, and the power will be further improved. By adapting the estimate of P r(P ^{∗}≤α) from the BenjaminiHochberg procedure for the BonEV procedure, the proportion of false discoveries is reduced compared to Storey’s qvalue procedure. Thus, we expect the BonEV procedure to have better FDR control than Storey’s qvalue procedure. Regarding the stability, singlestep approaches such as the Bonferroni and BonEV procedures are superior to stepwise approaches such as the BenjaminiHochberg procedure and Storey’s qvalue procedure, but, the inclusion of the π _{0} estimate in the BonEV and Storey’s qvalue procedures reduces the stability. Taken together, we expect the BonEV procedure to have better stability than Storey’s qvalue procedure.
Results
Simulation studies
Our simulations compare gene expression levels between two groups with equal sample sizes of 5, 15, and 30 in each group. For each sample, we test 10,000 genes with expression levels following multivariate normal distributions with means at vector of 0 for the control group and means at vector of (μ,0) for the treatment group. We set standard deviations at 1 with correlations between genes equal to ρ (ρ = 0, 0.4, or 0.8, depending on the simulation study). All genes were equally correlated with correlation ρ. Meanwhile, we also conduct simulations with pairwise gene correlations randomly selected from uniform(0, 0.8) distribution for sample sizes of 5, 15, and 30 in each of the two groups. We set the proportions of differentially expressed genes (π _{1}=1−π _{0}) at 0.005, 0.01, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.75, and 0.80 for each simulation study. The vector μ is set as a sequence from 1 to 3 with length equal to 10,000×π _{1}. We use tstatistics for two independent samples for testing differential gene expression between groups. Each simulation include 1,000 iterations. Each procedure is set to control the FDR at the 0.05 level.
Simulation results
Examples using real data
As a complement to our simulation studies, we also compare apparent test power (total number of discoveries) and stability (SD of total number of discoveries) resulting from these three different procedures using human RNASeq data [12]. We compare gene expression levels between 17 females and 24 males using count data from RNASeq downloaded from the ReCount web site [13]. The RNASeq data from human Bcells that we analyze include 52,580 genes and 41 samples. The summarized count data from the RNASeq experiment is first filtered by only retaining genes that express at a countpermillion (CPM) above 0.5 in at least two samples. The retained 9745 genes are further normalized to eliminate RNA composition biases between libraries by finding a set of scaling factors for the library sizes that minimize the logfold changes between the samples for most genes using the default methoda trimmed mean of M values (TMM) [14] between each pair of samples. The raw pvalue is obtained by fitting negative binomial generalized linear models with CoxReid dispersion estimates using the glmFit and flmLRT function in the edgeR package in Bioconductor [15].
Discussion
In this study, we propose a new multiple testing procedure (BonEV), based on the Bonferroni procedure, intended to improve FDR control and stability as well as maintain power. We compare the BonEV procedure with the BenjaminiHochberg procedure and Storey’s qvalue procedure using both simulation studies and real RNASeq data. Our studies show that the proposed BonEV multiple testing procedure has better control of FDR and higher stability than Storey’s qvalue procedure, and also maintains high levels of power at small and medium sample sizes.
Next generation sequencing and third generation sequencing technology has become more popular in biological and biomedical studies. The sample size in sequencing studies remain small to medium although the price of sequencing per sample has significantly decreased in recent years. Multiple testing procedures with larger power could help increase the probability of novel discoveries. The BonEV procedure, similar to the Storey’s qvalue procedure, offers higher power than the BenjaminiHochberg procedure and increases the costeffectiveness of sequencing studies.
Compared to the Storey’s qvalue procedure, the BonEV procedure offers better FDR control. In recent years, irreproducibility of results in biomedical research has drawn increasing attention in the popular press and academia [17, 18]. Investigations have found many landmarks in preclinical oncology to be nonreproducible, and confirming research conducted by scientists in the haematology and oncology department at the biotechnology firm Amgen in Thousand Oaks, California, finds only 11% of scientific findings they examined to be reproducible [17]. The reasons for irreproducibility are at least, in part, due to false discoveries in those studies [18–20]. With better FDR control, the BonEV procedure will enable more accurate control of false discoveries in genomic studies. Meanwhile, understanding the source of variation in the data generation and analysis can help improve reproducibility of scientific studies [21]. Multiple testing procedures are widely used to select significant features such as genes, SNPs, methylation loci, and others in microarray and NGS studies. Assessment of the stability of statistical findings from multiple testing procedures and improving the stability of these procedures could reduce replication failures. The BonEV procedure will help reduce replication failures compared with the Storey’s qvalue procedure, and provide higher stability.
Conclusions
Our study investigates the stability of BenjaminiHochberg and Storey’s qvalue FDR controlling procedures commonly used in genomic and genetic data analysis and proposes a new multiple testing procedure with higher stability, better FDR control and power equivalent to Storey’s qvalue procedure as well as higher power than the BenjaminiHochberg procedure. The BonEV multiple testing procedure we propose is attractive in microarray and sequencing data analysis in that it has higher power than the BenjaminiHochberg procedure, and better FDR control and higher stability than Storey’s qvalue procedure.
Declarations
Acknowledgements
We thank Stanley Pounds from biostatistics department at St. Jude Children’s Research Hospital for his insightful suggestions and comments on our manuscript and the Center for Integrated Research Computing at the University of Rochester for providing high performance computing resources. We also would like to thank the anonymous reviewers for their insightful comments and suggestions that helped to further improve our manuscript.
Funding
Dr. Li’s, Dr. Zand’s, Mr. Fogg’s, and Dr. Dye’s time is supported by the University of Rochester’s Clinical and Translational Science Award (CTSA) number UL1 TR000042 and UL1 TR002001 from the National Center for Advancing Translational Sciences of the National Institutes of Health. Dr. Zand is also supported by the National Institute of Allergy and Infectious Diseases and the National Institute of Immunology, grant numbers AI098112 and AI069351. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Availability of data and materials
The BonEV multiple testing procedure is available as the BonEV package in R for download at https://CRAN.Rproject.org/package=BonEV. The RNASeq data used in the real data examples are downloaded from http://bowtiebio.sourceforge.net/recount/website.
Authors’ contributions
DL, ZX, MZ, TF, and TD contributed to the conception and design of the study. DL, ZX designed the simulation studies. DL performed the simulation studies and real data analysis. DL, ZX, MZ, TF, and TD interpreted the analysis results and wrote the manuscript. All authors edited and approved the final manuscript. All authors agreed to be accountable for all aspects of the study in ensuring that questions related to the accuracy or integrity of any part of the study are appropriately investigated and resolved.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable
Ethics approval and consent to participate
Not applicable
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Pounds SB. Estimation and control of multiple testing error rates for microarray studies. Brief Bioinforma. 2006; 7(1):25–36.View ArticleGoogle Scholar
 Hochberg Y, Tamhane AC. Multiple Comparison Procedures. New York, NY: WileyInterscience; 1987.View ArticleGoogle Scholar
 Dohler S. A sufficient criterion for control of some generalized error rates in multiple testing. Stat Probab Lett. 2014; 92:114–20.View ArticleGoogle Scholar
 Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc Series B (Methodological). 1995; 57(1):289–300.Google Scholar
 Benjamini Y, Hochberg Y. On the adaptive control of the false discovery rate in multiple testing with independent statistics. J Educ Behav Stat. 2000; 25(1):60–83.View ArticleGoogle Scholar
 Keselman HJ, Cribbie R, Holland B. Controlling the rate of type i error over a large set of statistical tests. Br J Math Stat Psychol. 2002; 55:27–39.View ArticlePubMedGoogle Scholar
 Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B (Methodological). 2002; 64(3):479–98.View ArticleGoogle Scholar
 Storey JD. The positive false discovery rate: a Bayesian interpretation and the qvalue. Ann Stat. 2003; 31(6):2013–35.View ArticleGoogle Scholar
 Khatree R, Naik D. Computational Methods in Biomedical Research. New York, NY: Chapman & Hall/CRC; 2008.Google Scholar
 Gordon A, Glazko G, Qiu X, Yakovlev A. Control of the mean number of false discoveries, bonferroni and stability of multiple testing. Ann Appl Stat. 2007; 1(1):179–90.View ArticleGoogle Scholar
 Li D, Xie Z, Le Pape M, Dye T. An evaluation of statistical methods for dna methylation microarray data analysis. BMC Bioinforma. 2015; 16(1):1–20. doi:http://dx.doi.org/10.1186/s128590150641x.
 Cheung VG, Nayak RR, Wang IX, Elwyn S, Cousins SM, Morley M, Spielman RS. Polymorphic Cis and Transregulation of human gene expression. PLoS Biol. 2010; 8(9):1000480. doi:http://dx.doi.org/10.1371/journal.pbio.1000480.
 Frazee AC, Langmead B, Leek JT. Recount: A multiexperiment resource of analysisready rnaseq gene count datasets. BMC Bioinformatics. 2011; 12(1):1–5. doi:http://dx.doi.org/10.1186/1471210512449.
 Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of rnaseq data. Genome Biol. 2010; 11(3):25.View ArticleGoogle Scholar
 Robinson MD, McCatyhy DJ, Smyth GK. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26(1):139–40.View ArticlePubMedGoogle Scholar
 Bottomly D, Walter NAR, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzemann R. Evaluating gene expression in c57bl/6j and dba/2j mouse striatum using rnaseq and microarrays. PLoS ONE. 2011; 6(3):1–8. doi:http://dx.doi.org/10.1371/journal.pone.0017820.
 Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012; 483:531–3.View ArticlePubMedGoogle Scholar
 Plant AL, Locascio LE, May WE, Gallagher PD. Improved reproducibility by assuring confidence in measurements in biomedical research. Nat Methods. 2014; 11(9):895–9.View ArticlePubMedGoogle Scholar
 Begley CG. Reproducibility: Six red flags for suspect work. Nature. 2013; 497:433–4.View ArticlePubMedGoogle Scholar
 McNutt M. Reproducibility. Science. 2014; 343(6168):229–9. doi:http://dx.doi.org/10.1126/science.1250475. http://science.sciencemag.org/content/343/6168/229.full.pdf.
 Stodden V. Reproducing statistical results. Annu Rev Stat Appl. 2015; 2:1–19.View ArticleGoogle Scholar