### Notations

Suppose we are interested in comparing two experimental conditions on the basis of mean expression levels of genes belonging to *K* pre-specified sets of genes *S*
_{1}, *S*
_{2},… *S*
_{
K
}. For instance, these gene sets may represent different pathways or biological functions, derived from databases such as GO, KEGG, IPA, etc. Furthermore, suppose each gene set *S*
_{
k
}, *k*=1,2,…,*K*, is a union of *m*
_{
k
} pre-specified subsets *S*
_{
k,1},· · · ,*S*
_{
k,m
k
} such that
. Note that
is not necessarily an empty set for any *i*≠*j*. Suppose there are a total of *G* genes on the microarray and suppose **X**
_{
ij
} is a *G*×1 random vector corresponding to the ^{
j
th
}sample, *j*=1,2,…,*n*
_{
i
}, in the ^{
i
th
} group, *i*=1,2 with mean vector *E*(**X**
_{
ij
})=μ
_{
i
} and covariance matrix *Cov*(**X**
_{
ij
})=Σ
_{
i
}, where μ
_{
i
}=(*μ*
_{
i1},…,*μ*
_{
iG
})^{
″
}, *i*=1,2.

For set *S*
_{
k
}, we are interested in testing the following null and alternative hypotheses; *H*
_{
k
}:μ
_{1,k
}=μ
_{2,k
} versus
:μ
_{1,k
}≠μ
_{2,k
}, where μ
_{
i,k
}=(*μ*
_{
i,j
}:*j*∈*S*
_{
k
}) denotes the mean vector of genes in the set *S*
_{
k
}for samples from the *i*th group, *i* = 1, 2. Similarly, for genes belonging to the subset
, the hypotheses of interest are, *H*
_{
k,j
}:μ
_{1,k
j
}=μ
_{2,k
j
}versus *H*
*″*
_{
k,j
}:μ1,_{
k
j
}≠μ
_{2,k
j
}, where
denotes the mean vector of genes in the subset *S*
_{
k,j
} for samples from the *i*th group, *i*=1,2.

### The test statistic and its null distribution

We shall now describe the test statistic using a generic notation. Suppose, for

*i* = 1, 2,

**X**
_{
i1}
**X**
_{
i2},…,

is a random sample of

*G*×1 vectors from a common population with mean vector

μ
_{
i
}and covariance matrix

Σ
_{
i
}. Let

denote the sample mean vector corresponding to the

^{
i
th
}population,

*i* = 1, 2, and let

**S**denote the usual pooled sample covariance matrix. Samples randomly drawn from these two populations are independent. Then under the assumption of

Σ
_{1}=

Σ
_{2}, the Hotelling’s

*T*
^{2} statistic is proportional to

. For large values of

*G*, statistics such as the Hotelling’s

*T*
^{2} and Fisher’s linear discriminant function can be unstable since they involve the inversion of a high dimensional covariance matrix

**S**. In the context of discriminant analysis [

25], it was surprisingly found that the linear discriminant function that ignored the off-diagonal elements of

**S**performed better than Fisher’s linear discriminant function that used the entire matrix

**S**. In addition, in practice it may not be suitable to assume that

Σ
_{1}=

Σ
_{2}. Motivated by these reasons, we use the following test statistic for testing the null hypotheses described in the above subsection:

where *Diag*(**S**
_{
i
}) is a diagonal matrix containing the diagonal elements of the sample covariance matrix **S**
_{
i
}
*i* = 1, 2.

Since the underlying gene expression data are not necessarily multivariate normally distributed and the covariance matrices of these two groups are potentially unequal, the exact distribution of the above test statistic under the null hypothesis cannot be determined easily. We therefore adopt bootstrap methodology for simulating the null distribution of the test statistic such that the resulting methodology is not only robust to heteroscedasticity but also preserves the underlying dependence structure among genes. To do so, we draw simple random sample (with replacement) of *n*
_{
i
} subjects from the *i*
^{
th
}group, *i* = 1, 2 and construct the bootstrap data using the residuals
*i* = 1, 2, *j*=1,2,…,*n*
_{
i
} from the resampled subject *j*. Thus the bootstrap data are given by
*i* = 1, 2, *j*=1,2,…,*n*
_{
i
}, where
, the weighted average of the two sample means, and
is the residual corresponding to the ^{
j
th
}subject selected. For more details regarding the residual bootstrap methodology we refer the reader to [26, 27]. It is important to recognize that the residual bootstrap methodology implemented here is different from the usual bootstrap methodology. The standard bootstrap may not honor the structure present in the data and hence may potentially result in an inflated false positive rate. We remark that our proposed test statistic resembles the test statistic of [9], in the sense that neither procedure uses the off diagonal elements of the estimated covariance matrices. The two procedures, however, differ in the denominators used in the test statistic. The proposed test allows for unequal variances in the two populations that are being compared. Secondly, and more importantly, the two procedures fundamentally differ in the resampling schemes used. As noted above, the proposed methodology bootstraps the residuals and thus allows for any underlying dependence structure in the data (unknown to the user) whereas the resampling scheme used in [9] intrinsically assumes that under the null hypothesis the two populations under comparison are identically distributed, which is often not the case in practice. This is a major and an important difference between the two methods.

### The proposed strategy

For each
, let the test statistic (1) be denoted by
and let the corresponding bootstrap *p*-value be denoted by *P*
_{
k,j
}. If we have only a single gene-set *S*
_{
k
}with *m*
_{
k
} gene-subsets *S*
_{
k,1},· · ·,
, then within *S*
_{
k
}, the problem of testing the significance of *S*
_{
k,1},· · ·,
is formulated as a problem of simultaneously testing a family of *m*
_{
k
} null hypotheses,
, using the available *p*-values *P*
_{
k,j
}’s. The gene-set *S*
_{
k
} is declared to be significant if and only if at least one *H*
_{
k,j
}is rejected in the above problem of multiple testing.

There are two popular notions of type I error rates when dealing with the problem of simultaneously testing multiple hypotheses, one is to control FWER, which is the probability of falsely rejecting at least one true null hypothesis, and the other is to control the FDR, which is the expected ratio of false rejections to the total number of rejections [23]. In this article we shall only describe methods to control the FWER.

There are several FWER controlling procedures available in the literature for testing the family of null hypotheses,
. In this paper we consider the following Bonferroni based procedure: For a given set of null hypotheses
, we reject
if *P*
_{
k,j
}≤*α*/*m*
_{
k
}. The corresponding Bonferroni-adjusted *p*-value for the set of null hypotheses
is
. Similarly, if we have multiple gene-sets *S*
_{
k
},*k*=1…,*K*, each of which having *m*
_{
k
}gene-subsets *S*
_{
k,1},· · ·,
, then the problem of testing the significance of all gene-subsets in the *K* gene sets is formulated as a problem of simultaneously testing *K* families of null hypotheses,
, using the available *p*-values *P*
_{
k,1},…,
, *k*=1,…,*K*, in which for each gene-set *S*
_{
k
},*k*=1,…,*K*, it is declared to be significant if and only if at least one hypothesis *H*
_{
k,j
}in
is rejected.

For testing the *K* families
, a simple Bonferroni based strategy is proposed as follows.

The Procedure

Step 1. Compute raw residual bootstrap *p*-value for each subset of genes.

Step 2. Compute adjusted *p*-values for each set *S*
_{
k
}(adjusting for the number of subsets within the set) as described above.

Step 3. Declare a set *S*
_{
k
}to be significant if its adjusted *p*-value is less than *α*/*K*. A subset *S*
_{
k,j
}within the set *S*
_{
k
}is declared to be significant if its raw *p*-value *P*
_{
k,j
}is less than *α*/*K*
_{
m
k
}.

It is easy to see that the above proposed procedure strongly controls the overall FWER for any dependent test statistics, the probability of falsely rejecting at least one true null hypothesis in some family.

When the number of gene sets and gene subsets is large, it might be preferable to control the FDR rather than the FWER. The above proposed testing strategy controlling the FWER can be easily modified to control the FDR by using the BH procedure to replace the Bonferroni procedure when simultaneously testing the significance of the gene sets. Such modified strategy is very similar to a two-stage test strategy developed in [28] for controlling the overall FDR while selecting significant gene sets and their significant individual genes.

### Simulation study

We evaluate the performance of the proposed methodology in terms of power (the probability of rejecting at least one false null hypothesis) and the FWER control with Tsai and Chen’s method in [11], which uses the shrinkage estimator of the sample covariance matrix proposed in [21]. Note also that, unlike the bootstrap residuals used in the proposed methodology for deriving the null distribution of the test statistic, the resampling scheme used in [11] resembles the scheme used in [9]. Such resampling schemes do not honor the differences (if any) in dependence structure of the two populations that are being compared. Thus, if the two populations have different covariance structures under the null hypothesis, then as stated earlier in this paper, the standard permutation or standard bootstrap methodology can potentially result in an inflated FWER.

### Study design

In the simulation study, we considered two patterns of total number of sets of genes, which were 5 and 10. Since, in practice, the number of subsets and the number of genes within a subset may be unknown a priori, we allowed the number of subsets within each set of genes to be uniformly distributed in the range 5 to 16 and the number of genes within each subset was generated according to a uniform distribution in the range 5 to 10. To understand the robustness of the two methods in terms of FWER control, we considered a variety of probability distributions for the gene expression as follows:

(1) Multivariate normal distribution, of appropriate dimension, with mean vectors **0**(for the control group) and μ(for the treatment group), and covariance matrices Σ
_{
1
}(for the control group) and Σ
_{
2
}(for the treatment group), respectively. As commonly done, we assumed intra-class correlation structure for the two covariance matrices, with variances
,
and correlation coefficients *ρ*
_{1}and *ρ*
_{2}, respectively. We considered two cases, namely, *σ*
_{1}=*σ*
_{2}, *ρ*
_{1}=*ρ*
_{2}(homoscedastic or homo.) and *σ*
_{1}≠*σ*
_{2}, *ρ*
_{1}≠*ρ*
_{2}(heteroscedastic or hetero.). In practice, one never knows a priori whether we have homoscedasticity or heteroscedasticity. Since genes within each subset (whether control or treated groups) may have different variances, for each gene we let *σ*
_{1}and *σ*
_{2}both take one of the five values, 0.1, 0.5, 1, 1.25 or 1.5, at random. Similarly, correlation coefficient between a pair of genes may not necessarily be constant across all subsets (whether control or treated groups), for each subset we let *ρ*
_{1}and *ρ*
_{2}both take one of the five values, 0, 0.25, 0.5, 0.75 or 0.9, at random. Thus the variance and correlation coefficients vary randomly from subset to subset. For each subset of genes, we always let mean vector μ=**0**for the control group, μ=*δ*
**1**for the treatment group, where *δ*was taken to be 0.5, 1 or 1.5 and **1**=(1,1,· · ·,1)^{
″
}. For each group, we considered two patterns of sample sizes, namely, 10 and 40.

(2) Multivariate log normal distribution, where the vector of natural logarithm of each component follows multivariate normal distribution, with parameters as defined in the above setting of multivariate normal distribution, with *Σ*
_{1}=*Σ*
_{2}.

(3) Multivariate beta distribution. This distribution is motivated by CpG methylation data. Within each treatment group the multivariate beta vector was generated as follows. To generate *p* dimensional beta variable, we randomly generated *p* independent chi-square random variables *U*
_{1},*U*
_{2},…,*U*
_{
p
}with either 4 or 5 degrees of freedom and generated an additional independent chi-square random variable *V* with either 1 or 2 degrees of freedom. The resulting multivariate beta type random vector for a given treatment group is defined as **Z**=(*Z*
_{1},*Z*
_{2},…,*Z*
_{
p
})^{
″
}, where *Z*
_{
i
}=*U*
_{
i
}/(*U*
_{
i
} + *V*), *i*=1,2,…,*p*. With the above choices of degrees of freedom, the mean methylation values (commonly called the “beta” value) for our simulated CpG’s ranged from approximately 0.67 to about 0.83.

(4) Mixtures of multivariate normal random vectors. For each treatment group we generated mixture of multivariate normally distributed data

**Z**as follows:

where *Π*=0.2. As in the case of multivariate normally distributed data in (1), we considered the homoscedastic as well as heteroscedastic covariance matrices for normal vector. The patterns of covariance matrices are as described in (1) above.

All our simulation results are based on a total on 1,000 simulation runs and 5,000 bootstrap samples.