Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation

Background The use of current high-throughput genetic, genomic and post-genomic data leads to the simultaneous evaluation of a large number of statistical hypothesis and, at the same time, to the multiple-testing problem. As an alternative to the too conservative Family-Wise Error-Rate (FWER), the False Discovery Rate (FDR) has appeared for the last ten years as more appropriate to handle this problem. However one drawback of FDR is related to a given rejection region for the considered statistics, attributing the same value to those that are close to the boundary and those that are not. As a result, the local FDR has been recently proposed to quantify the specific probability for a given null hypothesis to be true. Results In this context we present a semi-parametric approach based on kernel estimators which is applied to different high-throughput biological data such as patterns in DNA sequences, genes expression and genome-wide association studies. Conclusion The proposed method has the practical advantages, over existing approaches, to consider complex heterogeneities in the alternative hypothesis, to take into account prior information (from an expert judgment or previous studies) by allowing a semi-supervised mode, and to deal with truncated distributions such as those obtained in Monte-Carlo simulations. This method has been implemented and is available through the R package kerfdr via the CRAN or at .


Background
Multiple-testing problems occur in many bioinformatic studies where we considere a large set of biological objects (genes, SNPs, DNA patterns, etc.) and we want to test a null hypothesis H for each object. Typically, H may be 'the expression level of the gene is not affected by the treatment' or 'the pattern is as frequent as expected in the observed DNA sequence'. The control of the number of false positives, i.e. falsely rejected hypotheses, is the crucial issue in multiple testing. To this end, several error rates, such as the Family-Wise Error-Rate (FWER) or the False Discovery Rate (FDR), have emerged and various strategies to control these criteria have been developed (see [1] for a review).
In the last decade the FDR criterion introduced in [2] has received the greatest focus, due to its lower conservativeness compared to the FWER. The FDR is defined as the mean proportion of false positives among the list of rejected hypotheses. It is therefore a global criterion that cannot be used to assess the reliability of a specific hypothesis, i.e. that of a given gene, SNP or pattern.
More recently, a strong interest has been devoted to the local version of the FDR, called 'local FDR' [3] and denoted hereafter ᐍFDR. The idea is to quantify the probability for a given null hypothesis to be true. Even if many different strategies were designed to estimate the ᐍFDR, some of them based on the estimation of FDR itself [4], most of them rely on a mixture model assumption [5], which is a general and statistically convenient framework: the score (test statistics, p-values) on which the testing procedure is based follows a mixture distribution depending on the unobserved status of the hypothesis (true or false). Different approaches have been proposed: fully parametric [6][7][8][9], semi-parametric [10], Bayesian [11,12] or empirical Bayes [3].
The semi-parametric approach developed by [10] uses the knowledge of the distribution f 0 of the score under the null hypothesis, to provide a flexible non-parametric estimation of the alternative distribution (denoted f 1 ), i.e. under the alternative hypothesis. However, some important questions remain partially or not addressed in this reference.
In this paper we provide an implementation of the method with several important and practical generalizations. The Results and Discussion Section recalls the theoretical framework underlying our method, the properties of the estimation algorithm as well as the main steps of its implementation.
Performances are then studied via simulations, and compared to other existing methods. Finally, applications to various bioinformatic data sets, such as gene expressions, DNA sequence patterns and genome-wide associations, are carried out and proposed to the reader

Results and discussion
Semi-parametric mixture model Our estimation of the local FDR (ᐍFDR) relies on the semi-parametric mixture model proposed in [10]. e have at our disposal n hypotheses {H i } i = 1,...,n we want to test.
Suppose that an unknown proportion  0 of them are true nulls. For any hypothesis, we define a random variable H i that equals 0 if it is under H 0 (true null hypothesis), and equals 1 under H 1 (false null). For each H i , we compute a score denoted by X i (a p-value for example). We assume that these scores are independent and identically distributed, with mixture distribution where  1 = 1 - 0 states for the proportion of false null hypotheses, f 0 denotes the probability density function (pdf) of scores under H 0 and f 1 is the pdf of scores under H 1 . Note that f 0 is completely specified. For instance if X i is the p-value of a Student statistic, f 0 is the uniform distribution on [0, 1]. If any transformation (probit or log) is applied, f 0 remains completely known. On the contrary, f 1 needs systematically to be estimated so as to  0 .
In our framework, ᐍFDR defined the probability that H i = 0 given the observed value x i of the score X i : This quantity may be interpreted as a measurement of how likely the hypothesis at hand could be falsely rejected.
Since f 1 is unknown, we use the following (non-parametric) kernel estimator for a given bandwidth h > 0 in which we replace the unknown H i 's by their conditional expectation These expectations are themselves thanks to where is a given estimator of the unknown proportion and . Thus, we obtain As 's and depend on each other, we alternate the computation of (3) and (4) until convergence, which is proved in [10].

Implementation
The method may require to apply a transformation to the sample of p-values (optional), to estimate the proportion of null hypotheses ( 0 ), to determine an optimal value for the bandwidth (h) used in the kernel estimator and to compute the estimation of f 1 . These technical points are further developed and discussed in the Methods section.
Moreover, the corresponding R package allows a simple and straightforward use. For instance the command try = kerfdr(pv) for a given sample of p-values (pv) returns the estimates of  0 and ᐍFDR in try$pi0 and try$localfdr respectively. In addition the running time is very fast thanks to an efficient implementation using convolution through fast Fourier transforms and a list of customizable options for more advanced users such as the choice of  0 , h or the kernel function. The complete R code and a pseudo-R code of kerfdr are available on the webpage.

Practical generalizations
Semi-supervised cases Prior information is actually available in many experiments. Among all the null hypotheses to be tested, some are known to be true (control genes in microarray experiments) while some others are known to be false (test genes in spike-in settings). Such a knowledge is taken into account in the estimation procedure described previously: known a priori the  i s are kept fixed throughout the steps of the algorithm. They contribute to the estimation of f 1 in Eq. (4), but are not updated in Eq. (3).

Truncation
Let us suppose now that we have at hand truncated data within an interval I = [a, b]. By 'truncated', we mean that the support of the p-values distribution is strictly smaller than [0,1]. For instance, if B denotes the number of simulations, p-values smaller than 1/B are often truncated to 0.0. How this will affect our method?
In order to deal with densities, the restrictions of f 0 , f 1 and f to I need to be normalized. Denoting by q 0 , q 1 and q the corresponding normalization factors, the mixture definition gives: Despite q 0 , q 1 can not be easily computed as f 1 is unknown. Fortunately, we can estimate q from a sample X 1 ,..., X n of non-truncated data using from which we derive One should note that this estimator does not necessarily belong to [0,1]. In order to overcome this, we replace its value by 0 if < 0 and by 1 if > 1.

Simulation study
A comparison with other estimation methods of ᐍFDR is provided in [10]. It shows that the semi-parametric approach we propose performs as well as the empirical Bayes approach [13] and the Gaussian mixture model [8] when the distributions f 1 and f 0 are well separated. However, it outperforms them in more difficult situations, especially in terms of stability. We focus here on the particular cases described below (semi-supervised and truncation) that are not handle by the aforementioned methods. The exponential distribution can provide values greater than one and a beta distribution as used in [6] can appear more appropriate; however it occurs very rarely with the taken value for . For each of the 4 × 2 × 2 = 16 configurations, S = 500 samples of size n = 1,000 were generated.

Simulation design
For each proportion  0 and distribution f 1 , the ᐍFDR of the i-th p-value  i has a theoretical expression that is computed. Denoting by , the local FDR estimate of the i-th p-value for the simulation s (s = 1,..., S), the performances of the method are assessed by means of the root mean square error The smaller the RMSE, the better the performances.
To see how prior information improves the estimation of ᐍFDR, we randomly select some hypotheses for which the status is known. The proportion  of these hypotheses is fixed, so that the true value of the local FDR is also known (and equal either to 0 or 1). Figure 1 shows that even a small proportion ( = 1% or 5%) of known hypotheses improves significantly the ᐍFDR estimation.

Truncation
In purpose of comparison, we truncate p-values to a given threshold p* (p* = 10 -2 , 10 -3 ) and compare the generalized method that takes account of truncation with the naive one, in terms of the RMSE criterion. In Figure 2, the orig-inal non-truncated p-values provide a reference that can not be outperformed. We see that the correction improves the quality of the estimates, especially when the truncation is severe (p* = 10 -2 ) and that the corrected estimates can be almost as good as the best achievable.

Applications
Gene expression data As a first illustration, we apply our method to the classical example of Hedenfalk [14] in which the expression levels of n = 3,226 genes are studied. The aim is to compare patients with two different breast cancers: 7 BRCA1 (7 patients) and BRCA2 (8 patients) corresponding to two different gene mutations predisposing to the disease. We Semi-supervised Figure 1 Semi-supervised. Root Mean Square Error (RMSE) between the true local FDR  and the estimates as a function of the proportion 1 - 0 (log-log scale). Proportion of known hypothesis:  = 0 (dotted), 1% (cross), 5% (asterix), 10% (circle) and 50% (square). Top: exponential shape for f 1 . Bottom: uniform shape. Left:  = 0.001. Right:  = 0.01. Variance of the RMSE lies between 1e -4 and 5e -4 with 500 simulations.
use the modified t-test statistic proposed in [15] which avoids false-positives due to bad variance estimates.
Applying our method, we obtain a proportion of null genes of = 66.4% which is consistent with the proportion estimated in [8] ( = 65%). Figure 3 displays the estimated densities: although the proportion of modified genes is quite high (1 -= 33.6%), the local FDR is lower than 1% for only 5 genes; it is below 5% for only 69. This shows that the local FDR is an efficient tool to reduce the type-I error-rate in difficult cases.
The choice of the bandwidth is known to be a crucial step in density estimation problems. In this example, we selected a bandwidth of 0.27. To check to influence of this choice on the results, we tried several values of h between 0.20 and 0.35. Figure 4 shows that the estimated local FDR is not sensitive to this choice.

DNA sequence patterns
It is well known that most biological patterns in DNA sequences have unusual frequencies due to selection mechanisms. It is hence natural to search for new functional patterns among those whose number of occurrences is statistically significant. In order to do so, it is classical to adopt a test framework where the null hypoth-  Genes expression: estimated densities for the Hedenfalk dataset. The expression levels of n = 3,226 genes for 7 BRCA1 and 8 BRCA2 patients (corresponding to two different gene mutations predisposing to the disease) are studied [14]; pvalues are computed by using the modified t-test statistic proposed in [15].

kerfdr(): pi1 = 0.336 and bw = 0.269
where N obs is the observed frequency of the oligomer in the genome.
Thanks to a simple CLT argument, we get that the distribution of Z is approximately a standard Gaussian under the null hypothesis. It is hence possible to use this approximation either by working directly with the z-score or by computing the two-sided p-value associated to each observation: The natural approach is to estimate the densities from the p-values ( Figure 5) where all the 'exceptional' oligomers (under and over-represented) accumulate on the left side of the resulting density. But the flexibility of our method allows us to make the estimations directly on the basis of the z-scores ( Figure 6) by taking into account their bimodal distribution under H 1 and distinguishing the oligomers that are under-represented (on the left side of the resulting density) from those that are over-represented (on the right side). If both strategies provide the same estimation for the proportion of 'null' oligomers ( = 57.3%), ᐍFDR estimations are sensibly different in particular for the ligomers that are over-represented (data not shown).

Quality control in genome-wide association studies
In association studies, deviations from Hardy-Weinberg equilibrium (HWE) can be due to inbreeding, population stratification or selections. They can also be a symptom of lack of quality in genotyping because of a tendency to misscall heterozygous genotypes as homozygous for instance [16]. As a result, testing for HWE has often been proposed as a data quality check with the aim to discard loci that deviate from the equilibrium. Testing for deviations from HWE can be carried out using the Pearson chi-square statistic (X HW ) that quantifies the distance between the observed genotype proportions and the ones expected under the equilibrium.
Here, the HWE test is applied to controls of genome-wide case-control data on the multiple sclerosis from France (Rennes). The data set consists in 74,067 Single Nucleotide Polymorphisms (SNPs). Since the usual chi-square approximation can be poor when there are low genotype counts, p-values are computed via Monte-Carlo simulations (number of simulations B = 10,000) which represents a typical case of truncation of p-values for those that are below the level of precision given by the number of simulations.
Applying our method, we obtain a proportion of null SNPs of = 99.44%. Figure 7 displays the estimated densities, showing a large overlap between the two distributions f 0 and f 1 . By considering a threshold of 1%, then 29 SNPs would be declared to deviate from HWE, and up to 537 for a threshold of 5%. These quantities come down Genes expression: sensitivity of local FDR estimates to the choice of the bandwidth to 454 and 576 respectively when local FDR are estimated in the naive way (not accounting for the truncation). Consequently and in addition to our simulations, this application underlines an inflation of excluded SNPs when the information about a truncation, when it exists, is not taken into account in the estimation procedure.

Conclusion
A simple computational approach to local FDR considers a two-components normal mixture model for modeling the observed empirical distribution (f) where the null distribution (f 0 ) is the standard normal and the alternative distribution (f 1 ) is a normal density with unspecified mean and variance. But the reliability of this approach obviously depends on how well the proposed two-components normal mixture model approximates the real distribution.
Our semi-parametric approach does not assume any constrained alternative distribution and is hence much more flexible. Nonetheless it requires a complete specification of the null distribution, the a priori proportion of true null hypotheses ( 0 ), as well asthe bandwidth (h) for which efficient estimation methods have been developed. The performances of the approach compared to existing methods were assessed in a preceding publication [10] which showed its advantages in difficult situations where the distributions f 0 and f 1 are not well separated. We focused here Patterns in DNA sequences: estimated densities for all 4,096 oligomers of size 6 using p-values on the implementation of the approach, and on two interesting extensions such as the possibility to use prior information in the estimation procedure (semi-supervised) and the ability to handle truncated distribution such as those generated by Monte-Carlo estimation of p-values. Our simulation showed that these informations can significantly improve the quality of estimates. As an illustration, we analyzed three high-throughput biological dataset concerning genes expressions, DNA sequence patterns, and genome-wide association studies. The corresponding R package available at http:// stat.genopole.cnrs.fr/software/kerfdr is fast, thanks to fast Fourier transforms, straightforward to use and propose customizable options to advanced users.
Finally, most of the local FDR estimation procedures derived from the Benjamini and Hochberg framework, including our approach, assume that p-values testing true null hypotheses are independent observations. If it may well be the case for patterns, in practice this assumption does not hold for all the genes or SNPs. A proposed solution is to cluster highly correlated genes (or SNPs) together, and to represent a cluster by a single gene or a linear combination of the associated genes [8]. Theses approaches also generally assume that p-values testing true null hypotheses are continuous and uniform over [0,1]. These issues are likely to be alive fields of research in the near future.
Patterns in DNA sequences: estimated densities for all 4,096 oligomers of size 6 using z-scores Figure 6 Patterns in DNA sequences: estimated densities for all 4,096 oligomers of size 6 using z-scores. This is the same dataset than Figure 5 with the difference that Local FDR is estimated from the z-scores directly instead of p-values. It results in a bimodal density for f 1 .

Probit or logarithm transformations
While it is obviously possible to work directly with a sample of p-values (in this case, f 0 is simply the uniform density over [0, 1]) this option is seldom used in practice. This comes from the fact that most H 1 p-values are concentrated near 0 while H 0 ones are uniformly distributed between 0 and 1. Working with the rough p-values will hence favor estimation of f 0 over f 1 which is precisely our opposite goal. In order to overcome this problem it is then classical to introduce a transformation that will allow us to "zoom" on the interesting part of the distribution. We propose here to consider two such transformations: Probit transformation X = probit(P) =  -1 (P) where P is a p-value and F is the cumulative distribution function of the normal distribution. If P ~ ([0, 1]), X follows a normal distribution and Logarithmic transformation X = log 10 (P) If P ~ ([0, 1]) the -log(10) × X has an exponential distribution and we easily get that Association studies: estimated densities for the Hardy-Weinberg test applied to a set of 74,067 SNPs. DNA were genotyped using a 100 K Affymetrix chip. The algorithm used for making genotype calls has been previously described by Affymetrix. Local FDR is computed from the p-values resulting from an Hardy-Weinberg equilibrium test applied to each SNP.
Note that f 0 is almost perfectly overlapping f since  0 is close to 1.
Two assets of this transformation are to give more weight to small p-values and to be easier to interpret than the probit transformation (X = -2 correspond to P = 10 -2 , X = -5 to P = 10 -5 ).

Estimation of  0
For all 0 §  § 1 we have where T is either the probit or the log 10 function. We hence get We have q 0 = 1 - but q 1 is unknown. We notice that the higher , the closer to 0 q 1 will be. As we can estimate q from a sample X 1 ,..., X n by we obtain the following (conservative) estimator: It is therefore necessary to find a tradeoff between the magnitude of the error O(q 1 ) (lowest for  = 1.0) and the quality of the estimation (best for  = 0.0).
Storey [17] first proposed to use  = 0.5 which appears to be a good choice in most cases.

Determination of the bandwidth
About the choice of the bandwidth, our first approach consists in selecting h as if we were applying a kernel estimation over the whole sample.
For that matter, the literature proposes many methods already implemented in R: biased and unbiased cross-validation estimations (bcv and ucv), method using estimation of derivatives from [18] (sj-ste for solve-the-equation and st-dpi for direct-plugin) and, in two simple heuristics in the special case of Gaussian kernels: nrd0 from [19] (page 48) and nrd from [20].

Estimation of f 1 : Convolution and Fast Fourier Transforms
If we have an observed sample x 1 ,..., x n with weights  1 ,...,  n we get for all x  ‫ޒ‬ where  =  i  i and K states for the kernel function.
The naive computation of all (x i ) requires a quadratic complexity. Fortunately, [21] introduced an algorithm (later modified by [22]) based on Fast Fourier Transform (FFT, see [23] chapter 12) allowing to perform the same computation with a far more efficient linear complexity (see [23] chapter 13 for more details on fast discrete convolution through FFT).

kerfdr and discrete p-values
In developing their original FDR-control procedure, Benjamini and Hochberg [2] assumed that p-values testing true null hypotheses are independent observations from a continuous uniform distribution over [0,1]. A large family of succeeding methods requires the same conditions, to which kerfdr belongs. However, how the performance of these methods are affected when the assumption of continuity or uniformity are violated has not been often considered, contrary to the assumption of independence (see [24] and [25] for instance). Discrete p-values that become more frequently encountered in practice as categorical genomic data, such as Single-Nucleotide-Polymorphisms, Comparative-Genomic-Hybridation and Copy-Number-Variation become more widely available, clearly violate the assumption of uniformity and introduces instability into FDR-like and local FDR estimates.
In kerfdr,  0 and the shape of f 0 are parameters of the method. Since with discrete p-values, correct estimators of  0 and f 0 are tricky to obtain with classical methods included in the package, it is still feasible to use methods more adapted to each situation, such as those proposed by [26][27][28][29], in order to pre-compute  0 and/or f 0 before running kerfdr and to minimize the problems generated by discrete p-values. However, how our algorithm behaves exactly in this context has still to be considered along with its extension dependent data.
For instance in Figure 7, the short decrease in local FDR observed for the p-values near 1 should be interpreted as a nuisance effect that can happen due to a more severe discreteness of p-values near 1 (here computed by Monte-Carlo simulations) and hence should be ignored by the user.