A multiple testing procedure for multi-dimensional pairwise comparisons with application to gene expression studies

Background Often researchers are interested in comparing multiple experimental groups (e.g. tumor size) with a reference group (e.g. normal tissue) on the basis of thousands of features (e.g. genes) and determine if a differentially expressed feature is up or down regulated in a pairwise comparison. There are two sources of false discoveries, one due to multiple testing involving several pairwise comparisons and the second due to falsely declaring a feature to be up (or down) regulated when it is not (known as directional error). Together, the total error rate is called the mixed directional false discovery rate (mdFDR). Results We develop a general powerful mdFDR controlling testing procedure and illustrate the methodology by analyzing uterine fibroid gene expression data (PLoS ONE 8:63909, 2013). We identify several differentially expressed genes (DEGs) and pathways that are specifically enriched according to the size of a uterine fibroid. Conclusions The proposed general procedure strongly controls mdFDR. Several specific methodologies can be derived from this general methodology by using appropriate testing procedures at different steps of the general procedure. Thus we are providing a general framework for making multiple pairwise comparisons. Our analysis of the uterine fibroid growth gene expression data suggests that molecular characteristics of a fibroid changes with size. Our powerful methodology allowed us to draw several interesting conclusions regarding the molecular characteristics of uterine fibroids. For example, IL-1 signaling pathway (Sci STKE 2003:3, 2003), associated with inflammation and known to activate prostaglandins that are implicated in the progression of fibroids, is significantly enriched only in small tumors (volume < 5.7 cm3). It appears that the molecular apparatus necessary for fibroid growth and development is established during tumor development. A complete list of all DEGs and the corresponding enriched pathways according to tumor size is provided for researchers to mine these data. Identification of these DEGs and the pathways may potentially have clinical implications. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0937-5) contains supplementary material, which is available to authorized users.

Supplementary Materials: A multiple testing procedure for multi-dimensional pairwise comparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyamal D. Peddada

S1 Notations and Definitions
Let "m" denote the number of genes in the data that has gene expressions for each gene on "p" categories (tumor sizes and normal tissue). Let µ ij denote the mean response corresponding to the i-th category in j-th gene, i = 1, ..., p, j = 1, ..., m. The problem of biological interest we discuss in the context of uterine fibroid data is to detect genes that are differentially expressed in a tumor size category compared to normal sample. Thus, if the last category, "p", corresponds to normal sample, we need to test the pairwise differences θ ij = µ ij − µ pj , i = 1, 2, ..., q and j = 1, 2, ..., m, are equal to zero, where, q = p−1. For each gene j, the pairwise null and alternative hypotheses are, H j 0i : θ ij = 0 against H j 1i : θ ij = 0, i = 1, . . . , q. (S1) For each gene j, we have a vector of parameters θ j = (θ 1j , θ 2j , ..., θ qj ). We first need to find out the genes that are differentially expressed in at least one tumor sample compared to normal samples. Thus, we define the null and alternative "screening hypotheses" to test the significance of each gene as, H j 0screen : θ j = 0 against H j 1screen : θ j = 0, j = 1, . . . , m, for testing whether all parameters θ ij , i = 1, ..., q are simultaneously 0 or not, equivalently, whether all µ ij , i = 1, ..., p are equal or not. These hypotheses give rise to m families of hypotheses corresponding to m genes with each family having a screening hypothesis and"q" pairwise hypotheses. Figure S1 shows a simple graphical representation of the structure of hypotheses in our formulation. Let x k ij denote the k th observed gene expressions of the j th gene in i th group, k = 1, . . . , n i with n i being the sample size for i th group, j = 1, . . . , m, i = 1, . . . , p. Let T ij and P ij , i = 1, . . . , q, j = 1, . . . , m, denote the test statistics and the p-values respectively for testing H j 0i . The test statistics for testing the screening hypotheses H j 0screen are obtained as a function of T ij , i = 1, . . . , q, for instance, the highest order statistic of T ij , i = 1, . . . , q. We denote the p-values for testing screening hypotheses as P j screen . For each family j we denote a vector of p-values, P j = (P 1j , P 2j , ..., P qj ) based on the test statistics T j = (T 1j , T 2j , ..., T qj ), for testing the pairwise hypotheses in (S1). If H j 0i is rejected we conclude on direction, i.e., Given the screening p-values P j screen , for every j = 1, 2, ..., m, to carry out the simultaneous testing of the screening hypotheses in (S2), we use the BH-procedure [1], that controls the FDR at a given level α. This is a step-up procedure as follows: given the ordered screening p-values P screen (1) ≤ P screen (2) ≤ · · · ≤ P screen (m) and the corresponding screening null hypotheses H 0screen (1), H 0screen (2), ..., H 0screen (m), find, R = max {1 ≤ j ≤ m : P screen (j) ≤ jα/m} and reject H 0screen (1), . . . , H 0screen (R), provided the maximum exists, otherwise, accept all the screening hypotheses.
When an H j 0screen : θ j = 0 is rejected, further decisions are made on the pairwise hypotheses in (S1) and on rejection, directional decisions are made on the signs of the component θ ij . A Type I error might occur due to wrongly rejecting H j 0screen or correctly rejecting H j 0screen but wrongly rejecting H j 0i for some i = 1, . . . , q. A Type II error might occur due to failing to reject a false null hypothesis H j 0screen or correctly rejecting a false H j 0screen but failing to reject a false null pairwise hypothesis H j 0i for some i = 1, . . . , q. A directional error (Type III error) might occur due to correctly rejecting H j 0screen but wrong assignment of the sign of θ ij while correctly rejecting H j 0i : θ ij = 0. The general practice in any multiple testing problem is to find a procedure that controls the Type I errors and minimizes the Type II errors. Here, we need to control Type I as well as Type III errors. A practical way of doing that would be to use an error rate combining both Type I and Type III errors in the FDR framework and make sure that it is controlled.
An error rate that combines Type I errors and Type III errors in FWER setup is mdFWER [2,3], which is the probability of making at least one Type I error or Directional error. Heller et al. [4] used the Overall False Discovery Rate (OFDR), which is the expected proportion of falsely discovered gene sets out of all discovered gene sets, as an appropriate error measure to control, in their two-stage procedure for identifying differentially expressed genes and gene sets. The concept of OFDR was introduced by Benjamini and Heller [5] in the context of testing partial conjunction hypotheses. Inspired by Heller et al. [4] and Shaffer [2], Guo et al. [6] define the mixed directional False Discovery Rate (mdFDR) defined below.
Let V (j) denote the indicator function of at least one Type I error or Directional Error committed while testing family j and the pairwise hypotheses in it, i.e., V (j) is 1 if either H j 0screen is falsely rejected or H j 0screen is correctly rejected but at least one Type I error or Directional error occurs while testing pairwise hypotheses in the family j; V (j) is 0 otherwise. Let, R denote the number of screening hypotheses rejected by a multiple testing procedure, that is, R = max {1 ≤ j ≤ m : P screen (j) ≤ jα/m}. Then, mdFDR is formally defined as follows. Definition 1: mdFDR -mixed directional False Discovery Rate. The expected proportion of Type I and Directional errors among all discovered families,
First consider the second term in (S5). By the assumption of independence of p-value vectors we can write it as follows: The inequality in (S6) follows as we use an mdFWER controlling procedure at level (rα/m) for each significant family and as j ∈ I 1 , the probability of making at least one Type I error or directional error in family j is ≤ rα/m. Summing over all values of r, the equality in (S7) follows by noting that m r=1 P r R (−j) = r − 1 = 1. Next consider the first term in (S5). By the assumption of independence of p-value vectors we can write it as follows: The inequality in (S8) follows due to the fact that the true null p-values are stochastically larger than or equal to U (0, 1). Summing over all values of r, the equality in (S9) follows by noting that m r=1 P r R (−j) = r − 1 = 1. The result follows by combining (S7) and (S9).
In the theorem we only assume that the p-value vectors are independent and we do not discuss about the component pairwise p-values. This implies that the p-values within a gene across tumor samples may have any dependence structure. The mdFWER controlling procedure used will tell us under what kind of dependence structures of p-values within genes is the procedure valid. For example, if we use Holm's mdFWER controlling procedure, which is proved to control the md-FWER when the p-values are independent, then this theorem is valid under the additional assumption that the pairwise p-values P ij , i = 1, 2, . . . , q of a vector P j are independent.
The generality of this algorithm makes it a flexible procedure to apply to several practical situations where multidimensional directional decisions are required to make. Although, in the paper, we discuss testing of differential gene expression in each tumor size against the normal sample for each gene, this procedure can be applied to any type of pairwise comparison desired to be tested for each gene. For example, if it is of interest to group genes by the inequalities among the mean responses, we would want to detect the pattern of mean responses in the p categories, known as directional pattern, and see how the mean responses vary across the categories. Some common inequalities are µ 1j ≤ µ 2j ≤ · · · ≤ µ pj (monotone pattern), µ 1j ≤ · · · ≤ µ ij ≥ µ (i+1)j ≥ · · · µ pj (umbrella pattern with peak µ ij ). To test for the pattern we need to test the differences of mean response of the categories, θ ij = µ i+1j − µ ij , i = 1, 2, ..., q and j = 1, 2, ..., m and q = p − 1. If the problem of interest is testing all pairwise differences of the p categories, possibly unordered, then q = p(p − 1)/2. Based on the question we want to answer from a data, appropriate methodology can be developed from this general procedure.

S3 Details of Statistical Methodology for FGS Gene Expression Data
Dunnett P screen for Step 1: The scenario is of comparing multiple groups with a common control group and the standard method used in this situation is the Dunnett test [7]. Dunnett test is a powerful method that is designed specifically for this kind of comparison. The test assumes that the underlying distributions of the data from the different groups have same variance and the test statistics are obtained by using a pooled estimate of the variance. This assumption is valid for the Uterine fibroid data as the gene expressions are normalized to have similar means and variances for comparability. The test statistic for testing (S1) is given by, where,x ij = (1/n i )  (q, ν), is the quantile of the above q-variate t-distribution such that, or equivalently, The observed values of T . We obtain the screening P -value for testing the screening hypotheses (S2) as follows: where, the probability is obtained from the CDF of q-variate t-distribution with ν degrees of freedom and the correlation structure defined in Dunnett [7]. Let R denote the number of rejected screening hypotheses while applying the BH procedure to these screening p-values.
Dunnett mdFWER controlling procedure for Steps 2-3: We use Dunnett procedure to obtain the Dunnett-adjusted p-values,P ij Dunnett , for testing the pairwise hypotheses as follows, (S12) We reject H j 0i if the corresponding adjusted P -valueP ij Dunnett ≤ Rα m and conclude θ ij > 0 if T Dunn ij > 0 and vice versa.

Methods Used in Simulation Study
In this section we describe the different mdFDR controlling methodologies used in the simulation study. We develop methodologies by combining Dunnett screening procedure with four different mdFWER controlling procedures for steps 2 and 3 and compare the performance of the resulting four methodologies with Guo et al. [6] methodology in terms of mdFDR control and power.
Screening Procedure for Step 1: Dunnett P screen : The Dunnett method [7,8,9] is a powerful method specifically designed to test hypotheses where several treatments are compared with a common control in an unbalanced one-way layout. The multiple pairwise comparisons we make with the FGS gene expression data fit into the framework of Dunnett test [7]. The test statistics, T ij , for testing the pairwise hypotheses, are obtained as described in [7], with details given in Section S3.

Procedures for Steps 2 and 3:
Dunnett Procedure: We obtain the Dunnett-adjusted pairwise p-values [8,9], to be used in Step 2 of the algorithm and call themP ij . The details are given in section S3. The procedure rejects H j 0i ifP ij ≤ Rα qm and conclude on sign of θ ij based on the sign of the test statistic T ij .
Holm Procedure: We use Holm's step-down procedure at level Rα m within each significant gene. Order the pairwise p-values for each significant gene j as P j (1) ≤ · · · ≤ P j (q) with corresponding hypotheses denoted as H j 0 (1), ..., H j 0 (q). Let k (k = 1, . . . , q) be the maximum index such that P j (i) ≤ Rα m(q−i+1) for all i ≤ k, then reject H j 0 (1), ..., H j 0 (k), conclude on direction based on the sign of the test statistic and accept the rest of the hypotheses.
Hochberg Procedure: We use Hochberg's step-up procedure at level Rα m within each significant gene to identify significant categories. Let k (k = 1, . . . , q) be the maximum index such that P j (k) ≤ Rα m(q−k+1) , then reject H j (1), ..., H j (k), conclude on direction based on the sign of the test statistic and accept the rest of the hypotheses.
Bonferroni Procedure: Bonferroni procedure is a commonly used single step multiple testing procedure that strongly controls FWER. Reject H j 0i if P ij ≤ Rα qm and conclude on direction based on the sign of the test statistic T ij .
Guo et al. [6] Procedure: The procedure of Guo et al. [6] is a special case of the general testing procedure. They first obtain the p-values, {P 1j , ..., P qj }, for testing the pairwise hypotheses in (S1) and use the Bonferroni pooling to obtain the screening p-values as follows: P j screen = q · min {P 1j , ..., P qj } .
Note down R, the number of significant genes. For each significant gene, use the Bonferroni procedure discussed above to identify significant pairwise differences and conclude on direction.

Results for the Simulation Study
In this section we present the results from the simulation study that consider different kinds of dependencies of the gene expressions. The dependence within genes across groups means that the gene expressions from different tumor samples are dependent but given any two genes for a sample, the expressions are independent between the two genes; as several tumor samples belong to same subject, this kind of dependence structure is valid. The dependence among genes means that the gene expressions from different genes are dependent but given any two samples for a gene, the expressions are independent; as the genes belonging to same gene set have similar activity, this kind of dependence structure is also valid in the FGS microarray data.
We define the concept of average power for the three step procedure for comparing different methodologies that control the mdFDR at the same level. Let, R denote the set of indices of rejected screening hypotheses H j 0screen with |R| = R. Let, R j , j ∈ R denote the number of pairwise hypotheses H j 0i rejected for each significant gene. Let, S denote the number of false null screening hypotheses rejected and let, S j , j ∈ R denote the number of false null pairwise hypotheses rejected for each discovered gene. Then, the average power for the three step procedure is defined as follows: Dependence within genes across groups.
In this case the components Z s ij , i = 1, ..., p are dependent with Z s ij ∼ N (µ ij , 1) and have a common correlation ρ = 0.2, 0.5, 0.8. The results are summarized in Figures 5 and Figures S2-S3. All five procedures control the mdFDR at less than α = 0.05. Once again, as in the case of independence, the proposed method gains in power compared to the other methods.

Dependence among genes.
We next considered the situation where gene expressions are dependent among genes. For this simulation, the components Z s ij , j = 1, ..., m are dependent with Z s ij ∼ N (µ ij , 1) and have a common correlation ρ = 0.2, 0.5, 0.8. The results are summarized in Figure 6 and Figures S4-S5. Again, all five procedures control the mdFDR at less than α = 0.05 and as in the case of independence, the proposed method gains in power compared to the other methods.
Author details