 Research
 Open access
 Published:
Modifying the false discovery rate procedure based on the information theory under arbitrary correlation structure and its performance in highdimensional genomic data
BMC Bioinformatics volume 25, Article number: 57 (2024)
Abstract
Background
Controlling the False Discovery Rate (FDR) in Multiple Comparison Procedures (MCPs) has widespread applications in many scientific fields. Previous studies show that the correlation structure between test statistics increases the variance and bias of FDR. The objective of this study is to modify the effect of correlation in MCPs based on the information theory. We proposed three modified procedures (M1, M2, and M3) under strong, moderate, and mild assumptions based on the conditional Fisher Information of the consecutive sorted test statistics for controlling the false discovery rate under arbitrary correlation structure. The performance of the proposed procedures was compared with the Benjamini–Hochberg (BH) and Benjamini–Yekutieli (BY) procedures in simulation study and real highdimensional data of colorectal cancer gene expressions. In the simulation study, we generated 1000 differential multivariate Gaussian features with different levels of the correlation structure and screened the significance features by the FDR controlling procedures, with strong control on the Family Wise Error Rates.
Results
When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures. In low to medium correlation structures the BY procedure is too conservative. The BH procedure is too liberal, and the mean number of screened features was constant at the different levels of the correlation between features. The mean number of screened features by proposed procedures was between BY and BH procedures and reduced when the correlations increased. Where the features are highly correlated the number of screened features by proposed procedures reached the Bonferroni (BF) procedure, as expected. In real data analysis the BY, BH, M1, M2, and M3 procedures were done to screen gene expressions of colorectal cancer. To fit a predictive model based on the screened features the Efficient Bayesian Logistic Regression (EBLR) model was used. The fitted EBLR models based on the screened features by M1 and M2 procedures have minimum entropies and are more efficient than BY and BH procedures.
Conclusion
The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. The modified procedures avoided screening the noninformative features and so the number of screened features reduced with the increase in the level of correlation.
Introduction
Controlling the Family Wise Error Rate (FWER) under the nominal level α, in a largescale multiple testing is an important issue in statistical inference. The simplest method for controlling FWER is a Bonferroni (BF) correction, which can be defined as a modification of the rejection threshold for individual Pvalues. The BF procedure compares all the pvalues of K simultaneous hypotheses with α/K. This procedure is very conservative and provide a strong control on the FWER and leads to an increase in type II error rate. In most studies, researchers accepted the hazard of the false discoveries to find any possible significance difference [1]. So, the False Discovery Rate (FDR) procedures are proposed and developed. The Benjamini–Hochberg (BH) procedure that compares the Pvalues with a fixed increase in threshold is used in most recent scientific research [2].
The BH procedure is one of the most important methodological advances in testing multiple hypotheses, which has been widely used for screening large data sets of genomic to identify a favorable number of important features. This procedure has an essential assumption of independence between test statistics. However, when dealing with highdimensional data such as microarray data, genes are usually associated with biological or technical reasons [1, 2].
So, BenjaminiYekutieli (BY) proposed a simple correction on BH procedure for arbitrary correlation structure. As they reported, this corrected procedure is very conservative [4]. Considering correlation in estimating FDR suggested in several studies [5,6,7,8,9,10,11,12,13,14,15,16,17], but to the best of our knowledge, few studies that provide and applicable modification of FDR procedures based on the arbitrary correlation structure.
Initial research in the test of multiple hypotheses and controlling the FDR largely ignored the structure of dependence among the hypotheses, which is often considered a nuisance parameter and is heavily overwhelmed by the assumption of independence [3, 4].
Correlation may lead to more liberal or conservative test methods; therefore, it should be considered in deciding which hypotheses should be reported as alternative hypotheses [5]. Also, the correlation may greatly increase (inflate) the variance of false discoveries and estimators of the common discovery rate [6, 7]. Ignoring the dependence between hypotheses may lead to loss of efficiency and bias in decisionmaking. On the other hand, errors in nonnull distribution can lead to false positive and false negative errors [3]. Consequently, correlation can significantly worsen the performance of many FDR methods [8], and the FDR can be variable if there is a strong correlation [5, 9].
Controlling the FDR under dependency is a major problem that requires a lot of research. The key issue is how to incorporate the dependency structure correctly in the inference. Currently, researchers have focused on the development of multiple comparative methods for the affiliated hypotheses. For the first time, Benjamini and Yekutieli mentioned that the effect of the test statistic dependence on FDR at the level of α is controlled under the desired dependence between Pvalues in the BH procedure. This method is very conservative in practice. They also introduced the concept of positive regression dependence on subsets (PRDS) and proved that the BH procedure controls the FDR for Pvalues with such property [10].
Qiu and Yakovlev showed a strong correlation for FDR only through simulation [7]. Storey et al., Wu, and Clarke and Hall showed that in the asymptotic concept, the BH procedure is valid in poor dependency models, linear process, and Markov dependency [11,12,13]. Owen and Finner et al. showed that the expected values and variance of falsepositive cases might have different features under dependence, but results did not provide an FDR, indicating that the BH procedure under severe dependence and variation is vulnerable [6, 14].
Efron and Schwartzman and Lin showed that strong correlations reduce the accuracy of estimating and testing [2, 5]. Specifically, positive or negative correlations have affected the experimental zero distributions of Zvalues, which has a significant effect on the subsequent analysis.
The studies carried out by Sun and Tony Cai, and Sun and Wei, and Benjamini and Heller showed that the combination of functional, spatial, and temporal correlations in inference could improve the strength and interpretation of existing methods. However, these methods do not apply to general dependency structures [4, 15, 16]. Also, Leek and Storey and Friguet et al. studied multiple testing under the factor models [4, 17, 18]. For a general class of dependent models, Leek and Storey, Friguet et al., Fan et al., and Fan and Han showed that overall dependence could be very weakened by reducing the common factors. Modified Pvalues can be used to build more powerful FDR methods [17,18,19,20]. The studies by Hall and Jin, and Li and Zhong showed that multiple testing and covariance structures can be used through conversion to make the test statistic, and the results indicated the beneficial effects of dependence [21,22,23].
However, the above methods rely heavily on the accuracy of estimated models and the asymptotic assumptions of the test statistics. Under small sample conditions, poor estimates of model parameters or violation of independence hypotheses may lead to less powerful or invalid FDR methods. Risser developed a theoretical approach of Bayesian decision for multiple dependent tests and a nonparametric hierarchical statistical model, which controls the FDR and is a strong model for determining the false model. Du et al. created a class of multiple testing without distribution for controlling the FDR under general dependency by considering a sequence of symmetric ranking statistics [20, 21].
In many cases, especially in highdimensional data, consecutive test statistics have a moderate or strong correlation [24,25,26]. Although in highdimensional and fused highlow order biological information some techniques such as machine learning or graph representation learnings are developed to handle the complex structures between features, feature selection by MCPs before using these techniques could improve their results [27,28,29].
Benjamini and Tille and Clark and Hall argued that the state of dependency in the multiple testing is asymptotic with the same independence [10, 11]. But, general dependency structures in multiple testing is still a very challenging and important problem. Efron noted that solidarity should be considered in deciding whether zero hypotheses are important because the accuracy of FDR techniques is compromised in highcorrelation situations [9]. However, even if procedures are valid under specific dependency structures, regardless of real dependency information, they will continue to suffer from reduced performance.
Due to the widespread use of the BH procedure, considering the effect of correlation in practical analysis is important. Previous studies evaluated two type of correlation structures; correlation among features and correlated samples. The studies by Storey et al., Hall and Jin, Sun and Cai, and Li and Zhong focused on correlated samples [1, 10, 19]. In the present study, we consider the correlation between features that leads to dependent test statistics, so to modify the BH procedure we accommodate the correlation between sorted features based on the absolute values of corresponded test statistics. For correct inference, this study modified the FDR procedure according to an arbitrary correlation structure and proposed three modified procedures based on conditional fisher information of consecutive sorted test statistics for controlling the false discovery rate.
In the present study, we proposed three modifications to the FDR procedure which can counteract the correlation between sorted features based on the conditional fisher information between consecutive sorted test statistics, and applied them for high dimensional hypothesis testing. Our proposed methods suggested for simultaneous hypothesis testing in two major groups;

1.
For simultaneous comparison of P features in two groups; Such as genomic data of a specific disease we have thousands of features for two groups (case/control), so we must have done P hypothesis testing to find the feature(s) with significant difference between groups.

2.
For pairwise comparison of a unique feature among k independent groups;Such as Post Hoc tests after ANOVA, we must have done k(k1)/2 hypothesis testing to find the group(s) with significant differences.
The correlation structure between test statistics in both categories are exist and obviously, it is not ignorable. We applied our modified procedures for first category of simultaneous high dimensional hypothesis testing but it could be applied simply for the second category.
Results
Results of the simulation study
Table 1 compares the mean and the standard deviation (SD) of the number of screened features without adjustment on the pvalues, and with adjustment by Bonferroni (BF), Benjamini–Hochberg (BH), Benjamini–Yekutieli (BY), and three proposed modified procedures under mild (M3), moderate (M2), and strong (M1) assumptions, according to the level of the correlation coefficient (ρ = 0, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 0.95, 0.99) between consecutive sorted test statistics by their pvalues.
When the correlation coefficient between all features is zero the number of features with pvalue less than 0.05 have the mean and the standard deviation equal to 353.35, and 11.80, respectively. Also, the mean number of screened features by BH procedure is approximately equal with all three modified procedures M1, M2, and M3. However the mean number of screen features by BY procedure is considerably less than all other procedures except BF. The mean number of screen features by BY procedure reaches to M1, M2, and M3 procedures when the correlations are 0.95, 0.9, and 0.8, respectively. It means that under high levels of correlations the BY procedures are performed approximately equal to the modified procedure. As shown in Table 1, with an increase in correlation coefficients the mean number of screened features without adjustment on their Pvalues are approximately constant, but the standard deviations increased by ρ. This pattern exists for BF, BY, and BH adjustment procedures. But for the M1, M2, and M3 procedures both means and standard deviations have changed according to the correlation between features. The mean number of the screened features decreased according to the increase in the level of correlations in the three proposed methods, but their standard deviations increased.
As expected the number of screened features by the M3 procedure is less than the number of screened features by the moderate modification M2. The number of screened features by M2 is less than the number of screened features by the mild modification M3. The standard deviations of the number of the screened features increase with the level of correlation in all proposed procedures. As shown in Fig. 1. when \(\rho =0\) the distribution of the number of screen features are symmetric, but the kurtosis of BF and BY procedures are higher than normal density. The distribution of screen features by BH, M1, M2, and M3 procedures are approximately identical and normal. By increasing \(\rho\) distribution of screen features by all procedures is skewed to right and the skewness increases with\(\rho\). The box plots of the number of screened features to compare the median, Interquartile Range (IQR), and outliers are presented in Fig. 2. From these plots, we can observe that despite the increase in outliers, with the increase in, the IQR of modified procedures is smaller than BH and BY procedures. The range distance between the first and the third quartiles with the median for modified procedures is approximately equal and symmetric in comparison to BY and BH procedures. More descriptive statistics of screen features in simulation study are presented in Additional file 1: S1. Also, the results of the othersimulation study when the sample sizes at each group are equal to 30, are presented in Additional file 2: S2.
Results of the real study
Based on the pvalues of the ttests, on P = 22,277 gene expressions and at the level of the α = 0.05, 8465 gene expressions were significant but, most of these genes are not involved in cancer. Since α, type I error rate was not reported in this study, we first determined the power (1β) based on the different values of the effect size and type I error rates, using the following formula,
where φ, is the cumulative Gaussian density function, n = 55, the sample size in the healthy tissue group, δ = 0.75 is the midpoint between 0.5 to 1, or between the moderate to large effect size, and α = 5 × 10^{–12}, and 5 × 10^{–10}. So the calculated powers for individual tests are 1β = 98.9%, and 99.9%, respectively.
Due to the highdimension data, when performing this hypothesis test, the main concern is to keeping the trade between control the amount of type 1 error (i.e., to keep the familywise type I error rate at its nominal level α, such as BF procedure) and the power of the study to screening the significance features, by using the FDR procedures to screen the more relevant gene expressions to colorectal cancer. We compared the performance of BH, BY and three proposed modified procedures; M1, M2, and M3 in Table 2.
Firstly, we show the distribution of the 22,277 tvalues and bivariate correlations between sorted features by their pvalues in Fig. 3. As shown in these histograms the distribution of tvalues and correlations are symmetric around zero. For more exploration we draw and show the distribution of the first 200 tvalues and bivariate correlations between sorted features by their pvalues in Fig. 3. As expected the t statistics for first 200 tvalues have bimodule distribution on two tailed of the histogram of tvalues for 22,277 features. However, the histogram of correlations is more exciting. The correlations between first 200 features are high and also has bimodule distribution on the taileds of the histogram of bivariate correlations between 22,277 consecutive sorted gene expressions. So, independence assumption in the BH procedure is violated and it is necessary to consider the correlation structure in the FDR procedures.
Table 3 shows the number of screened features by six adjustment procedures at two levels of α. Also the entropy and AUC for the EBLR models were reported in this table. As shown in Table 3 the number of screened features by BF and BY procedures are equal. Also, the number of screen features by M1 and M2 procedures are equal. The numbers of screen feature at α = 5 × 10^{–12}, by all six procedures are few and the Entropies are approximately equal. So, we prefer to use α = 5 × 10^{–10}, which gains more power and compares the performance of FDR procedures at this level of type I error rate. At this level of α, the number of screened features increased considerably for all adjustment procedures. Also, there is a considerable difference in screen features by the different adjustment procedures. By fitting the EBLR model on screened features by six adjustment procedure, the entropies and the AUCs were calculated. As seen in this table all the AUCs are 1 (perfect fit) except for the BF procedure. The entropies are near together, but the entropy of the EBLR model on 94 screen features by the BH procedure is equal to 0.82, and the entropy of the EBLR model on 61 screen features by the M1 procedure is equal to 1.19, it means that with losing 94–61 = 33 degree of freedom the reduction in entropy is just equal to 1.19–0.82 = 0.37, so the efficiency of M1 procedure is more than the BH procedure. Also the difference between the entropy of the EBLR models fitted on screened features by M2 and M1 procedures is ignorable in compare with loss in degree of freedoms. The box plots in Fig. 4, show that the predicted probability of the EBLR model completely separated in cancerous and healthy tissue for M1, M2, M3, and BY procedures, but for BF and BH procedures there is no complete separation. Although, the box plot of the BY procedure shows perfect fit the entropy of the EBLR model fitted on 61 features from the M1 procedure, and 71 features from the M2 procedure, and 81 features from the M3 procedure are less than the entropy of the EBLR model fitted on the 59 screen features by BY procedure. So, the M1, M2 and M3 procedures are more efficient than BY procedure. So, finally the M1 procedure with 61 screened features is the most efficient procedure for feature screening in colorectal cancer data according to less entropy with less loss in degree of freedom.
Discussion
The BH procedure for feature screening based on controlling the false discovery rate has a substantial assumption of independent test statistics. In largescale multiple testing assumption of independence between test statistics is unrealistic. Many studies reported that the dependency structure between test statistics cause overdispersion in the distribution of the FDR [4,5,6,7,8]. In the present study, we observe that the overdispersion and right skewness in the distribution of the number of screen features by BF, BH, BY and proposed procedures increase with the level of correlations. However, as shown in Fig. 1 the skewness in the density of the proposed procedures is less than the BH procedure, and also the interquartile range of boxplots in Fig. 2 was thinner than BH and BY procedures.
When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures, but the BY procedure is very conservative as reported [4]. In low to medium correlation structures the BY procedure is too conservative, and the BH procedure is too liberal. The mean number of screened features by BH and BY procedures were constant at the different level of the correlation between features. The mean number of screen features by our proposed procedures were between BY and BH procedures and reduced when the level of correlations increased. Where the correlations between features were high (ρ > 0.8) the number of screened features by proposed procedures reach to the BF procedure, as expected. We reduced the acceleration of increasing the number of false discoveries by modifying the BH procedure according to the amount of extra information of each new feature, resulting in a more precise procedure for screening the important features with the presence of a solidarity structure between the features.
Then, we compared the performance of three proposed procedures with BF, BH and BY for screening in Highdimensional genomic dataset, with 22,277 gene expressions' comparisons between the healthy and cancerous tissue groups. In this regard, by allowing two different levels for nominal type I error rates, α, the significance genes were screen by six procedures. The Efficient Bayesian Logistic Regression (EBLR) model were used to fit a predictive model based on the screened features. The EBLR model based on the screen features by M1 and M2 procedures have minimum entropies and were more efficient than BY and BH procedures. In a study on this data set twenty Machin Learning approaches were used to fit the predictive model based on the screened features. The maximum AUC was 0.94 obtained by Deep Neural Network (DNN) and Logistic Model Tree (LMT) [27].
Leek and Storey developed an approach to address the strong arbitrary dependence of multiple testing collected on the original data surface in a largescale (highdimensional data) study before calculating the test statistics or Pvalues. To address the dependency problem of multiple testing based on kernel dependency estimation, they presented a small set of vectors that define entirely the dependency structure in any highpower data set. They showed that hypothesis tests could be randomly independent as long as conditioning on a dependence kernel. This generalizes the results of the independent error rate control to the general dependency mode. It can also estimate dependence at the data level, which is more useful than estimating dependence at the Pvalue level or test statistics [23]. Compared with proposed procedures this method is blind and base on the random correlation structures, but our modifications are based on the ordered information of whole data set.
Although, some efficient methods for the low to highcorrelated feature have been proposed and used, our proposed procedures are the first to modify the thresholds of the FDR procedure based on the information theory. So, according to the results of the simulation study and real data study, the optimization in the number of screened features has occurred.
Conclusion
The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. Our modified procedures avoided screening the noninformative features and so the number of screened features reduced with the increase in the level of correlation.
The three proposed modified procedures for feature screening are simply applicable for arbitrary positive or negative, and low or high correlation structures between sorted test statistics. These modifications are based on information theory and lead to finding the small set of significant features with sufficient information according to correlation between the sorted features and so, the remaining features do not have extra information.
Methods
First, we describe the Benjamini–Hochberg (BH) procedure and Benjamini–Yekutieli (BY) procedure then introduce our proposed modified procedures.
Benjamini–Hochberg procedure (BH Procedure)
In this procedure, when test statistics under the distribution of the null hypothesis are independent, the BH procedure control the FDR at the level of α. The BH procedure is shown below:
1. Sorting the observed pvalues in ascending order, \({p}_{(1)}\le \dots \le {p}_{(P)}\)
2. Calculation of \(k={\text{max}}\{1\le {\text{i}}\le {\text{P}}:{p}_{\left(i\right)}\le \frac{{l}_{i}}{{\text{P}}}\alpha \}\) where \({l}_{i}=i\quad for \quad i=\mathrm{1,2},\dots ,P\)
3. If there is such a K, all the null hypotheses corresponding to \({p}_{(1)}\le \dots \le {p}_{\left(k\right)}\) are rejected.
Benjamini–Yekutieli procedure (BY procedure)
The Benjamini–Yekutieli proposed a procedure for controling the false discovery rate under arbitrary dependency (test statistics have positive or negative correlations). They modified the threshold of BF procedure using a constant function \(C\left(P\right)=\sum_{i=1}^{P}\frac{1}{i}\). And find.
But in situation that the tests statistics are independent or positively correlated they suggested C(P) = 1 like as an ordinary BH procedure.
Proposed modified procedures
Consider the simultaneously P hypotheses:
where \({\delta }_{i}=\left{\mu }_{1i}{\mu }_{2i}\right\), is the absolute mean difference between two groups of the ith feature; \({\mu }_{1i}, is\; the\; mean\; of\; the\; ith\; feature\; at\; the\; first\; \left(case\right) group.\) \({\mu }_{2i}, is\; the\; mean\; of\; the\; ith\; feature\; at\; the\; second\; \left(control\right) group.\)
If we assume that all features are independent and following the multivariate Gaussian distribution with mean \({\varvec{\delta}}=({\delta }_{1},{\delta }_{2},\dots ,{\delta }_{P})\) and diagonal covariance matrix Σ.
We could scaled each \({\delta }_{i}s\) by dividing on their variances:
where; \({\sigma }_{1i}^{2}, is\; the\; variance\; of\; the\; {i}{th}\; feature\; at\; the\; first\; \left(case\right) group.\) \({\sigma }_{2i}^{2}, is\; the\; variance\; of\; the\; {i}{th}\; feature\; at\; the\; second\; \left(control\right) group.\) \({n}_{1} \& {n}_{2}, are\; the \; sample \; sizes \; of \; the \; first \; and \; second\; groups, \; respectively.\)
So we rewrite the hypotheses (1) as follow:
The ttest statistics for (2) is as follows:
where \({\overline{X} }_{1i}, is\; the\; sample\; mean\; of\; the\; {i}{th}\; feature\; at\; the\; first\; \left(case\right)\; group,\) \({\overline{X} }_{2i}, is\; the\; mean\; of\; the\; {i}{th}\; feature\; at\; the\; second\; \left(control\right)\; group,\) \({S}_{1i}^{2},\; is\; the\; sample\; variance\; of \; the\; {i}{th}\; feature\; at \; the\; first\; \left(case\right) \; group,\) \({S}_{2i}^{2}, is\; the\; sample\; variance\; of\; the\; {i}{th}\; feature\; at\; the\; second\; \left(control\right)\; group,\) \({S}_{i}= \sqrt{\frac{\left({n}_{1}1\right){S}_{1i}^{2}+{(n}_{2}1){S}_{2i}^{2}}{{n}_{1}+{n}_{2}2}} , is\; pooled\; variance\; of\; the\; {i}{th}\; feature\; in\; both\; groups,\) \({n}_{1}, {n}_{2},\; are\; the\; sample\; sizes\; of\; the\; first\; and\; second\; groups,\; respectively,\) if n_{1} and n_{2} are large enough,\(\left({n}_{1}+{n}_{2}2\right)\ge 30\), \({t}_{i} s\) follows Gaussian distribution with mean \({\varvec{\tau}}=({\tau }_{1},{\tau }_{2},\dots ,{\tau }_{P})\) and covariance matrix I. So we use Z_{i} instead of t_{i}.
According to information theory when \(\left{\overline{X} }_{1i}{\overline{X} }_{2i}\right s\) are independent multivariate Gussian random variables, the fisher information of \({\delta }_{i}\), conditional on \({\delta }_{i1}\) is as follow;
Also, \({Z}_{i} s\) are independent multivariate standard Gaussian random variables, so the fisher information of \({\tau }_{i}\), conditional on \({\tau }_{i1},\) is \({I}_{({\tau }_{i}\left{\tau }_{i1}\right)}({Z}_{i}\left{Z}_{i1}\right)=1,\, for\, i=2, 3,\) …, P. Also the fisher information of P independent Gaussian features is equal to I(\({\tau }_{1},{\tau }_{2}, \dots , {\tau }_{P})=P\). So,
In BH procedure according to independence assumption, the stepup conditional thresholds increase by \(\frac{1}{P}\). But when the features are correlated, if \(Corr\left({X}_{(i)},{X}_{(i1)}\right)={\rho }_{i}\) we have:
So, the fisher information of \({\delta }_{i}\), conditional on \({\delta }_{i1}, i\ne j,\) is as follow,
And the fisher information of \({\tau }_{i}\), conditional on \({\tau }_{i1}, i\ne j,\) is as follow:
So, under mild condition we propose the conditional thresholds increase by (3).
As \(\left(1{\rho }_{i}^{2}\right)\le 1\) the information of \({\tau }_{i}\), conditional on \({\tau }_{i1}\) decrees when both variables are correlated. It is clear, because when two variables are correlated, a part of information of the second variable is defined in the first variable. As \(Corr\left({Z}_{\left(i\right)},{Z}_{(i1)}\right)={\rho }_{i}\) we could define two independent consecutive sorted standardized Gaussian test statistics as
In genomic high dimensional datasets, features are measured for unique source (patient) so we could have a strong assumption that all effects (absolute mean differences) are identical with Gaussian distribution, but the correlation between features are different. As a Result
So, the conditional fisher information under strong assumption is
So, under strong assumption we propose the conditional thresholds increase by (4).
Also, we can write:
AS, \(\left(1+\left{\rho }_{i}\right\right)\ge 1\), we proposed a moderate modification between strong and mild modification:
As a result, under moderate condition we propose the conditional thresholds increase by (5).
The stepdown procedure works after sorting absolute values of Z_{i}, in descending order. Supposed that Z_{(i)} is the ith sorted test statistics and \(Corr\left({Z}_{\left(i1\right)},{Z}_{\left(i\right)}\right)={\rho }_{i}\) for i = 2,3,…,P.
In case of \({\rho }_{i}\ne 0\), the FDR procedure should be modified based on this correlation coefficient. The Pearson correlation coefficient r_{i}, as an estimator of \({\rho }_{i}\), between sorted consecutive features according to their pvalues was used to the modifications on the FDR procedure.
We propose three (strong, moderate, and mild) modifications on the threshold of the BH procedure. So, the thresholds, l_{i}, based on the conditional Fisher information under mild, moderate, and strong assumptions were suggested as follow:

1
Mild modification, where \(\dddot l_{i} = \dddot l_{i  1} + \left( {1  \left {r_{i} } \right^{2} } \right)\)

2
Moderate modification, where \({{\ddot{l}}_{i}=\ddot{l}}_{i1}+\left(1\left{r}_{i}\right\right)\)

3
Strong modification, where \({{\dot{l}}_{i}=\dot{l}}_{i1}+\left[\frac{\left(1\left{r}_{i}\right\right)}{\left(1+\left{r}_{i}\right\right)}\right]\)
For \(i=1\dots \dots P\), and we define \(\dot{l}_{1} = \ddot{l}_{1} = \dddot l_{1} = 1\).
So, our procedures work as follow,

1.
Sorting the observed Pvalues in ascending order, \({p}_{(1)}\le \dots \le {p}_{(P)}\)

2.
Calculating the \(Corr\left({X}_{(i)},{X}_{(i1)}\right)={r}_{i}\), for \(i=1, 2, \dots , P\).

3.
Calculating the \({l}_{i}s\), for \(i = 1, 2, \ldots , P.\)

4.
Calculation of \(={\text{max}}\{1\le {\text{i}}\le {\text{P}};{p}_{\left(i\right)}\le \frac{{l}_{i}}{{\text{P}}}\alpha \}\),\(for\, i=\mathrm{1,2},\dots ,P\).

5.
If there is such a K, all the first K sorted pvalues called significance.
If all the sorted features have a complete linear correlation, we will have
it means that all sorted test statistics have same information in the class of the linear estimation statistics of \({\tau }_{i}\), and so the thresholds of our proposed procedures do not increase for consecutive tests. So, the performance of modified FDR procedures is near to the BF procedure.
If the test statistics are independent the pairwise correlation coefficient between all features are zero, so we have:
It means that, when all sorted test statistics are independent, the performance of three proposed procedures are near to the BH procedure.
We compared the adjusted thresholds and the adjusted pvalues procedures of BF, BH, BY and three proposed procedures; strong (M1), moderate (M2), and mild (M3) by the rank of the sorted pvalues in Table 4. Except for the BY procedure, the first pvalue compared with \(\frac{1}{P}\alpha\) in all other procedures. The thresholds of BF procedure are fixed and there is no increase with the rank of the sorted pvalues. Both BH and BY thresholds increased constantly by the rank of the sorted pvalues, \(\frac{k}{P}\alpha\) and \(\frac{k}{P\times C(P)}\alpha\), respectively. The thresholds of M1, M2, and M3, increased by the rank of sorted pvalues but were proportional to the level of correlation between sorted test statistics. The speed of increases in modified procedures is lower than BH procedure. So, it is expected that the number of screened features by the modified procedures be less than the BH procedure. As \(C\left(P\right)>1\, for\, P>1,\) the first threshold of the BY procedure is less than the BF procedure, so, the BY procedure could be more conservative than BF procedure due to its first threshold value.
Illustration example
To demonstrate how we estimate the thresholds and adjusted pvalues we make an artificial example. Supposed that we did eight individual hypothesis tests to find the significance differences for eight features in two groups and sort their pvalues as follows,
Also we find the Pearson correlation coefficient between two consecutive sorted features by their pvalues as follow,
The purpose of this example is simultaneous comparison of eight features between two groups. So, we must use of an adjustment procedure to control the FDR. We compare the performance of six adjustment procedures to find the simultaneous difference at the significance level of α = 0.1 With two approaches; first, we calculate the adjusted pvalues and then compared them with α, and secondly, we calculate the adjusted thresholds and compared the sorted pvalues with them (Table 5). As shown in Table 5, both approaches lead to the same result.
Simulation study
We set the dimension of P = 1000 features in two independent equal groups with size n_{1} = n_{2} = 100 and generate the observations for these features sequentially as the following scheme with 1000 replications.

1
Simulate δ_{i}: \({\varvec{\delta}}=\left({\delta }_{1},{\delta }_{2},\dots ,{\delta }_{P}\right)\sim MVNorm\left(0,{\upsigma }^{2}{\varvec{I}}\right)\) with \({\upsigma }^{2}=0.0678\)

2
Simulate Z_{i}:
$${\text{Zi}}{\text{group}}=1: {\varvec{Z}}=\left({Z}_{1},{\text{Z}},\dots ,{Z}_{P}\right)\sim MVNorm\left({\varvec{\delta}},(1\rho ){\varvec{I}}+\rho {\varvec{J}}\right)$$$${\text{Zi}}{\text{group}}=2: {\varvec{Z}}=\left({Z}_{1},{\text{Z}},\dots ,{Z}_{P}\right)\sim MVNorm\left(0,(1\rho ){\varvec{I}}+\rho {\varvec{J}}\right)$$
with ρ: ρ=0, 0.2, 0.4, 0.5, 0.6, 0.8, 0.9, 0.95, 0.99.
We set the variance of δ equal to 0.0678, to have strong control on the Family Wise Error Rate (FWER). That means we bounded the number of screened features by BF adjustment method less than 5% of the total features regardless of which null hypotheses are true and which are false.
At each replication we conducted P independent sample ttests and sort their pvalues. We sort the pvalues and then calculate the adjusted pvalues according to the BH procedure, our proposed procedures, and BF procedure. Then we set α = 0.05 and calculate the number rejected null hypothesis (screened features) without adjustment (pvalue < α) and with adjustment (adjusted pvalue < α) by each procedure. The mean, and the standard deviation of the number of discoveries for all (r = 1000) replications were calculated separately for each value of ρ.
Real data application: gene expression data from colon cancer patient tissues
In this section, we evaluate the performance of the proposed procedure as an analysis of a real data set. From the GSE44861 data set of colorectal cancer, we used 111 samples of microarray tests with 22,277 gene expression levels and a binary status feature including 56 samples of cancer tissue (Y = 1) and 55 samples of healthy tissue (Y = 0). This data was generated using the Affymetrix Gene Chip platform and has been preprocessed and the gene expression levels are presented as fragments per kilo base million (FPKM. The normalization process was done using the “edgeR” package in R. This data set is freely available for researchers to investigate gene expression patterns in colon tumors and identify potential biomarkers of colorectal cancer. These data were registered in the GEO database in 2013 and updated in 2017.
Compared to cancerous and noncancerous cells, if the difference in expression is significant for a specific gene, it can be concluded that the gene was related with colorectal cancer. We used a Ttest to find genes associated with colon cancer and to select the significant gene expressions. The hypothesis of this test is as follows,
where \({\mu }_{1i}\) is the mean of the ith gene in group 1 (cancerous tissue), \({\mu }_{2i}\) is the mean of the ith gene expression in group 2 (healthy tissue), and P = 22,277. In this way, the pvalues of the ttests for all features are determined. Then, we sort the pvalues in an ascending order. And estimating the bivariate correlation between two consecutive sorted test statistics by calculating the bivariate correlation between their sorted features. Then the adjusted pvalues based on BF, BH, BY, and three proposed procedures M1, M2, and M3 were calculated and compare with α.
To assessing the efficiency of screen features by different procedures, a multiple logistic regression model was used. Due to quasi complete separation, and small sample size ordinary maximum likelihood approach did not converge. So, the Efficient Bayesian Logistic Regression (EBLR) model that was developed under a highly efficient Ultimate Polya Gamma Marcov Chain Monte Carlo (MCMC) algorithms, was used. The “UPG” package under R4.3.1 was used to fit EBLR model on screened features.
To compare the results of the EBLR model on screened features by different procedures; BF, BY, BH, M1, M2, and M3, we use three approaches:

1.
Estimating the Entropy (log(likelihood)) of models

2.
Estimating the Area Under the ROC Curve (AUC) to show the predictive power of models

3.
Drawing the box plot for the predicted probability of allocating in the cancerous tissue group (Y = 1) versus the real status (Y = 1/ 0, cancerous or healthy tissue groups) for all models
Change history
23 February 2024
A Correction to this paper has been published: https://doi.org/10.1186/s12859024056968
References
Storey JD, Taylor JE, Siegmund D. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J Roy Stat Soc B. 2004;66(1):187–205.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol). 1995;57(1):289–300.
Qian HR, Huang S. Comparison of false discovery rate methods in identifying genes with differential expression. Genomics. 2005;86(4):495–503.
Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165–88.
Fan J, Han X, Gu W. Estimating false discovery proportion under arbitrary covariance dependence. J Am Stat Assoc. 2012;107(499):1019–35.
Fan J, Han X. Estimation of the false discovery proportion with unknown dependence. J R Stat Soc Ser B (Stat Methodol). 2017;79(4):1143–64.
Zhang J, Coombes KR. Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups. BMC Bioinf. 2012;13(13):1–11.
Schwartzman A, Lin X. The effect of correlation in false discovery rate estimation. Biometrika. 2011;98(1):199–214.
Wang X, Shojaie A, Zou J. Bayesian hidden Markov models for dependent largescale multiple testing. Comput Stat Data Anal. 2019;136:123–36.
Sun W, Tony Cai T. Largescale multiple testing under dependence. J R Stat Soc Ser B. 2009;71(2):393–424.
Efron B. Correlation and largescale simultaneous significance testing. J Am Stat Assoc. 2007;102(477):93–103.
Owen AB. Variance of the number of false discoveries. J R Stat Soc B. 2005;67(3):411–26.
Qiu X, Yakovlev A. Some comments on instability of false discovery rate estimation. J Bioinf Comput Biol. 2006;4(05):1057–68.
Qiu X, Klebanov L, Yakovlev A. Correlation between gene expression levels and limitations of the empirical Bayes methodology for finding differentially expressed genes. J Bioinf Mol Biol. 2005;4(1):1.
Clarke S, Hall P. Robustness of multiple testing procedures against dependence. Ann Stat. 2009;37(1):332–58.
Wu WB. On false discovery control under dependence. Ann Stat. 2008;36(1):364–80.
Finner H, Dickhaus T, Roters M. Dependency and false discovery rate: asymptotics. J Ann Stat. 2007;35(4):1432–55.
Li J, Zhong PS. A rate optimal procedure for recovering sparse differences between highdimensional means under dependence. Ann Stat. 2017;45(2):557–90.
Sun W, Wei Z. Multiple testing for pattern identification, with applications to microarray timecourse experiments. J Am Stat Assoc. 2011;106(493):73–88.
Du L, et al. False discovery rate control under general dependence by symmetrized data aggregation. J Am Stat Assoc. 2021;118:1–34.
Risser MD, Paciorek CJ, Stone DA. Spatially dependent multiple testing under model misspecification, with application to detection of anthropogenic influence on extreme climate events. J Am Stat Assoc. 2019;114(525):61–78.
Benjamini Y, Heller R. False discovery rates for spatial signals. J Am Stat Assoc. 2007;102(480):1272–81.
Leek JT, Storey JD. A general framework for multiple testing dependence. Proc Natl Acad Sci. 2008;105(48):18718–23.
Friguet C, Kloareg M, Causeur D. A factor model approach to multiple testing under dependence. J Am Stat Assoc. 2009;104(488):1406–15.
Hall P, Jin J. Innovated higher criticism for detecting sparse signals in correlated noise. Ann Stat. 2010;38(3):1686–732.
Farcomeni A. A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion. J Stat Methods Med Res. 2008;17(4):347–88.
Nazari E, et al. Machine learning approaches for classification of colorectal cancer with and without feature selection method on microarray data. Gene Rep. 2021;25: 101419.
Zhao BW, et al. Fusing higher and lowerorder biological information for drug repositioning via graph representation learning. IEEE Trans Emerg Top Comput. 2023;1:1.
Zhao BW, et al. GRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics. 2023;39(8):451.
Funding
The authors declare no funding support for present study.
Author information
Authors and Affiliations
Contributions
R.S.: Statistical analysis and Rcodes, and wrote the introduction. S. A.: Conceptualized and designed the procedures, improved the Rcodes and wrote the result, and the discussion. T. H.: Statistical analysis, and wrote the methods. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
All data were used in this study are from an open access and open sources; the GSE44861 data set of colorectal cancer, this data set was attached at Additional file 3: S3. The Rcodes written for this study are available in suppleAdditional file 4 S4.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original version of this article was revised: the author’s name has been updated.
Supplementary Information
Additional file 1:
S1. Descriptive statistics of the number of screened features by BF, BH, BY, M1, M2, and M3 procedures at the different levels of correlation (ρ) in the simulation study for n1=n2=100.
Additional file 2:
S2. Descriptive statistics of the number of screened features by BF, BH, BY, M1, M2, and M3 procedures at the different levels of correlation (ρ) in the simulation study for n1=n2=30.
Additional file 3:
S3. 22278 Features (Gene Experissions) of Coloreactal Cancer Data.
Additional file 4
S4. R Codes.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Rastaghi, S., Saki, A. & Tabesh, H. Modifying the false discovery rate procedure based on the information theory under arbitrary correlation structure and its performance in highdimensional genomic data. BMC Bioinformatics 25, 57 (2024). https://doi.org/10.1186/s1285902405678w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902405678w