A shortcut for multiple testing on the directed acyclic graph of gene ontology
- Garrett Saunders^{1, 3},
- John R Stevens^{1}Email author and
- S Clay Isom^{2}
https://doi.org/10.1186/s12859-014-0349-3
© Saunders et al.; licensee BioMed Central Ltd. 2014
Received: 16 May 2014
Accepted: 9 October 2014
Published: 1 November 2014
Abstract
Background
Gene set testing has become an important analysis technique in high throughput microarray and next generation sequencing studies for uncovering patterns of differential expression of various biological processes. Often, the large number of gene sets that are tested simultaneously require some sort of multiplicity correction to account for the multiplicity effect. This work provides a substantial computational improvement to an existing familywise error rate controlling multiplicity approach (the Focus Level method) for gene set testing in high throughput microarray and next generation sequencing studies using Gene Ontology graphs, which we call the Short Focus Level.
Results
The Short Focus Level procedure, which performs a shortcut of the full Focus Level procedure, is achieved by extending the reach of graphical weighted Bonferroni testing to closed testing situations where restricted hypotheses are present, such as in the Gene Ontology graphs. The Short Focus Level multiplicity adjustment can perform the full top-down approach of the original Focus Level procedure, overcoming a significant disadvantage of the otherwise powerful Focus Level multiplicity adjustment. The computational and power differences of the Short Focus Level procedure as compared to the original Focus Level procedure are demonstrated both through simulation and using real data.
Conclusions
The Short Focus Level procedure shows a significant increase in computation speed over the original Focus Level procedure (as much as ∼15,000 times faster). The Short Focus Level should be used in place of the Focus Level procedure whenever the logical assumptions of the Gene Ontology graph structure are appropriate for the study objectives and when either no a priori focus level of interest can be specified or the focus level is selected at a higher level of the graph, where the Focus Level procedure is computationally intractable.
Keywords
Background
Microarray technology and next generation sequencing have played an important role in discovering important associations between gene expression patterns and phenotype [1]. Such gene expression technologies have been instrumental in discoveries ranging from the retarding of aging in mice brought about by caloric restrictions in diet [2] to the identification of various types of diffuse large B-cell lymphoma in humans [3]; from characterizing the transcriptomes of in vitro maturated porcine embryos [4] to uncovering the underlying genes and pathways involved in Alzheimer’s disease [5]. While both microarray and next generation sequencing technologies allow researchers to study the differential expression of genes across conditions or treatments, each has their advantages and disadvantages [1]. However, in either case, the resulting increase in genetic knowledge has allowed researchers to group genes with common function into gene sets and test these gene sets for differential expression [6],[7].
One rich source of gene set knowledge is found in the Gene Ontology database [8]. The Gene Ontology (GO) provides a controlled vocabulary that is not specific to any particular species. This vocabulary is divided into three general ontologies, Molecular Function (MF), Cellular Component (CC), and Biological Process (BP). Individual GO Terms form the basis of these vocabularies and are structured through parent child relationships with more general terms as parents and more specific terms as children. Each GO Term typically contains a definition of its biological process (molecular function or cellular component) and other annotation as well as a mapping of all known gene products involved in its specified process (function or component). For convenience in presentation, the remainder of this work will focus on the biological process ontology.
Gene set testing allows for the quantification of the significance of activity level differences between treatment groups for specific biological processes of interest. For example, a recent study on human longevity compared the gene expression profiles corresponding to 1,808 different biological processes for nonagenarians and a control group to identify 73 biological processes associated with longevity [9]. When there are relatively few gene sets (biological processes) of a priori interest (1,808 in [9]), the impact of the multiplicity correction for the tests of differential expression (or differential activity) of the gene sets can be greatly lessened as compared to individually testing all member genes (45,164 in [9]), improving the power of the test. Even when no a priori gene set of interest can be specified, it can still be highly beneficial to test all known gene sets from a biological process database for differential expression, as the number of gene sets is still typically magnitudes smaller than the corresponding number of individual genes [6],[10].
Many methods of gene set testing have been proposed in the literature as reviewed in [11]. These can essentially be divided into two classes of gene set testing, often referred to as competitive tests and self contained tests. The competitive tests compare the expression profiles of the genes in the set to those not in the set. The self contained tests focus only on those genes within the set and compares them to some fixed standard. While the first are more popular [7],[12], the second have been shown to be more powerful [11],[13]. Further, the null hypothesis associated with the self containedtests,
${H}_{0}^{\mathit{\text{self}}}$ : no genes in the gene set are differentially expressed,
has been shown to be the more logical generalization of single gene testing (with other advantages that will be explained later on) as compared to the competitivetest null hypothesis
${H}_{0}^{\mathit{\text{comp}}}$ : the genes in the gene set are at most as often differentially expressed as the genes in the complement of the gene set.
While gene set testing methods are varied in their approach, they are alike in that they test each GO term, i.e. gene set, individually. Thus, when more than one GO term is tested simultaneously (typically hundreds or thousands are tested simultaneously) some sort of multiplicity adjustment is necessary to preserve control over either the familywise error rate (FWER) or the false discovery rate (FDR) or a derivative of these error rates. The FDR is typically the error rate of choice in exploratory studies where follow up confirmatory studies are then conducted [14]. On the other hand, the FWER is typically the suggested error rate for confirmatory studies [15]. We also suggest that the FWER is highly appropriate for exploratory gene set studies as, in our experience, it is seldom more results that are desired, but the most promising real significances that are sought. The FWER offers the best error rate control for such conclusions [15].
The Focus Level method is a powerful method of multiplicity adjustment for self contained gene set testing, which takes into account the structure of the GO graph while controlling (strongly) the FWER [10]. This approach is more powerful than standard FWER controlling methods such as the Bonferroni and uniformly more powerful Bonferroni-Holm [16] procedures for multiple testing with GO graphs [10]. However, it is important to note that this increase in power comes at the cost of requiring the logical structure of the GO graph to become part of the multiplicity adjustment.
The Focus Level method allows the researcher to select the level of the GO graph in which they are most interested. This is called the focus level. The procedure then applies a top-down and bottom-up approach from the specified focus level. First, the terms in the focus level are tested using the Bonferroni-Holm adjustment [16]. Then, in the bottom-up approach, any term above the focus level is declared significant when any of its offspring in the focus level have been declared significant. This inheritance of P-values is accomplished through the assumption that a parent term must be differentially expressed if any of its children terms are differentially expressed, a logical assumption for the GO graph structure [10]. In the top-down procedure, significance of the children of the focus level terms is decided through an application of the closed testing procedure of [17].
While the Focus Level method is a powerful approach to adjusting for multiplicity, it quickly becomes computationally infeasible when the selected focus level contains a large number of offspring in the GO graph [10]. This computational limitation makes it essentially impossible to perform the full top-down approach, a rather significant disadvantage [18]. Using the full top-down approach provides researchers the default focus level of the root node (GO:0008150 in the context of the BP GO ontology) whenever they have no a priori interest in a given focus level, a common scenario, see for example [18]. This also allows adjusted P-values to be considered apart from their context in the GO graph which is advantageous to reporting on single significant gene sets of interest. Discussions of the significant findings of the Focus Level method are currently restricted to their context within the GO graph [10].
This work proposes an improvement to the top-down portion of the Focus Level method [10] which we call the Short Focus Level as it performs a shortcut of the full Focus Level method. This is accomplished using a novel application of the general graphical Bonferroni adjustment for multiple testing as proposed by [19], which is a generalization of closed testing based on weighted Bonferroni tests [20]. The Short Focus Level procedure shows a significant improvement in computational speed (as much as ∼ 15,000 times faster) while maintaining similar power to that of the original Focus Level procedure and even showing a gain in power over the original Focus Level procedure for certain scenarios while experiencing a loss in power for others. Most importantly, the computational improvements are such that the full top-down method can now be performed on a standard operating system within just a few minutes. The R code [21] for the Short Focus Level procedure is included in the mvGSTmvGST package [22],[23]; see also Additional files 1 and 2.
Methods
- A1.
A non-differentially expressed parent gene set implies the children gene sets are also non-differentially expressed.
- A2.
If the children gene sets form a partition of the parent gene set and are all non-differentially expressed, then the parent gene set is also non-differentially expressed.
These assumptions ensure coherence of the resulting significant subgraph and facilitate interpretations [10]. Note that these assumptions are necessary if the objective is to control the FWER within the structure of the GO graph. If preserving the graph structure in the multiplicity correction is not of interest to the researcher, then the FWER (or even the false discovery rate) could be controlled by the standard Holm’s correction [16] (or any false discovery rate controlling method which allows for arbitrary dependence structures), however such an approach will result in a loss of coherence for the significant subgraph.
Assumptions A1 and A2 require that the null hypothesis for each gene set is that no genes in the gene set are differentially expressed. The alternative in each case being that at least one gene in the set is differentially expressed. Thus, only self contained gene set testing methods (which utilize this hypothesis framework) can be used to test the gene sets of the GO graph if the Focus Level method of multiplicity adjustment is used. This excludes gene set enrichment methods such as those proposed in [12] but supports very well the Global Test of [6], P-value combination methods such as Fisher’s and Stouffer’s methods [13],[24], as well as Global Ancova [25], PLAGE [26], and SAM-GS [27].
- FL1.
No offspring of a focus level term be contained in the focus level.
- FL2.
All remaining terms are either ancestors or offspring of the focus level terms.
The top-down portion of the Focus Level procedure of [10], which applies the closed testing approach of [17], requires closing the GO graph under all unions from the focus level down. This is done by treating each focus level term, along with all of its offspring terms, as separate graphs which are each closed under all possible unions. As these separate closed graphs will share common elements, the full closed graph $\stackrel{~}{G}$ is obtained by unioning each of the separately closed graphs into a single graph which is also unioned to all ancestor terms of the focus level.
To partially amend the computational difficulties of the Focus Level method, [10] implement a more efficient method of computing the closed graph using what they term “atom sets”. These atom sets form a core collection of gene sets which form a basis for all gene sets in the graph. All other gene sets in the graph (as well as its closure) can be created through unions of the atom sets. This ensures the size of the closed graph is 2^{ k }−1, where k is the number of atom sets, which is often smaller (and never larger) than the size of the original closed graph. Further, [10] recommend selecting the focus level so that no more than 9-12 atom sets are required to recreate the offspring of any single focus level term. They also suggest computing only the smallest few adjusted P-values to save computation time in place of computing all adjusted P-values.
This work offers an alternative solution to improve on the computational speed of the top-down portion of the Focus Level method through an application of the general graphical Bonferroni adjustment of [19]. This allows for a short-cut of length m in place of the currently applied full closed testing approach of [17]. In the following section, we summarize the general graphical Bonferroni adjustment approach and show how we tailor the method for a powerful application to the top-down portion of the Focus Level method.
The graphical Bonferroni adjustment
- R1.
The local thresholds α_{1},...,α_{ m } satisfy $\sum _{i=1}^{m}{\alpha}_{i}\le \alpha $.
- R2.
The edge weights satisfy 0≤g_{ ij }≤1, g_{ ii }=0, and $\sum _{k=1}^{m}{g}_{\mathit{\text{ik}}}\le 1$ for all i,j=1,...,m.
The updating algorithm defines a sequentially rejective test procedure and is given as follows [19]. Note that p_{ i } represents the observed P-value for the test of hypothesis H_{ i }.
The proof that Algorithm 1 defines a sequentially rejective closed testing procedure which strongly controls the FWER at level α is found in the Appendix of [19], and depends directly on Theorem 1 from [20]. Both [28] and [29] claim that Theorem 1 from [20] cannot be directly applied to the hypotheses of the GO graph as the hypotheses are nested, creating logical restrictions. In their own words, [28] claim that “the shortcut procedure of [20] cannot be applied to restricted hypotheses”. Similarly, [29] state, “these methods [19] cannot make use of logical relationships between hypotheses and, as such, do not incorporate graph-based methods which exploit such relationships, such as [the Focus Level procedure] of [10]”. However, in the following we present a restricted hypotheses example where the methods of [19] can be applied. The following section sets forward some important notation and vocabulary and then demonstrates that while these claims are technically true, the methods of [19] can be applied to the Focus Level method if one of the assumptions underlying Theorem 1 of [20] is slightly relaxed. We prove this with Theorem 1.
Restricted hypotheses example
Let H_{1},...,H_{ m } denote m hypotheses of interest and call these the elementary hypotheses. Let I denote a non-empty index set such that I⊆{1,...,m} and denote an intersection hypothesis by H_{ I } where H_{ I }=∩_{i∈I}H_{ i }. The closed test procedure [17] utilizes the intersection closed set of hypotheses $\mathcal{\mathscr{H}}=\{{H}_{I}:I\subseteq \{1,...,m\},I=\varnothing \}$. In the case that the hypotheses are unrestricted, $\left|\mathcal{\mathscr{H}}\right|={2}^{m}-1$ and Algorithm 1 of [19] is proven to hold. On the other hand, the hypotheses are restricted if for index sets I and J it is true that I≠J and H_{ I }=H_{ J } so that $\left|\mathcal{H}\right|<{2}^{m}-1$. In this case, Algorithm 1 cannot currently be applied [28],[29].
As the hypotheses corresponding to any GO graph are always restricted, the methods of [19] cannot be applied to the GO graph under the current framework. However, the following closed test example from [28] can be extended to demonstrate how Algorithm 1 can be applied to the case of restricted hypotheses. This example sets the stage for Theorem 1, where we relax the assumptions of [20] to formally establish how the methods of [19] can indeed be applied to restricted hypotheses, and hence, the GO graph.
A closed test approach to is given in [28] which begins with the raw p-values p_{1},p_{2},p_{3}, and p_{4} obtained from testing the original elementary hypotheses H_{1},H_{2},H_{3}, and H_{4}, each with α-level tests, respectively. To define the closed test approach, they compute the closed test p-values ${p}_{{H}_{i}}$ for each hypotheses H_{ i } in by the following rules. First, ${p}_{{H}_{1}}={p}_{1}$ and ${p}_{{H}_{3}}={p}_{3}$. Second, ${p}_{{H}_{2}}=max\{{p}_{1},{p}_{2}\}$ and ${p}_{{H}_{4}}=max\{{p}_{3},{p}_{4}\}$. Finally, ${p}_{{H}_{\mathit{\text{ij}}}}=min\{1,2{p}_{{H}_{i}},2{p}_{{H}_{j}}\}$, i=1,2 and j=3,4. The closed test procedure [17] is then applied to as depicted in panel (b) of Figure 3 using the closed test p-values ${p}_{{H}_{i}}$ as explained in the following paragraph.
Definitions and preliminaries to Theorem 1
A deeper inspection of Figure 5 will reveal the reason why the shortcut from [19] can be applied to the example of restricted hypotheses of the previous section. To explain how, we must first define two terms, consonance and natural consonance.
The traditional definition of consonance[30] relies on the idea of maximal hypotheses. It states that consonance is the property of certain closed tests where rejection of an intersection hypothesis ${H}_{i}\in \mathcal{H}$ implies rejection of a maximal hypothesis $H\in \mathcal{H}$. Here, a maximal hypothesis $H\in \mathcal{H}$ is such that there is no ${H}^{\text{'}}\in \mathcal{H}$ with H^{'}⊃H. (When the closed test corresponding to the hypotheses in is depicted graphically, as in panel (b) of Figure 3, in can be seen that maximal hypotheses correspond to the leaf nodes of the graph. Further, in context of the GO graph, maximal hypotheses correspond to the leaf nodes of the graph, while the minimal hypothesis corresponds to the root node of the graph). From the example of the previous section, it can be seen that only H_{2} and H_{4} are maximal. Thus, the closed test of the example is not consonant in the traditional sense as rejection of the intersection hypothesis H_{13} does not imply the rejection of either of the maximal hypotheses H_{2} or H_{4}.
Natural consonance is a similar, but slightly more relaxed property than consonance, and differs in that it implies the rejection of only an elementary hypothesis (not necessarily a leaf node in the closure graph) whenever any other hypothesis ${H}_{i}\in \mathcal{H}$ is first rejected. This relaxed definition is more recent and is due to [28]. Importantly, it is easier for a closed test to satisfy the property of natural consonance than that of consonance. The claims of both [28] and [29] that Algorithm 1 [19] is not applicable to restricted hypotheses rest on the subtle difficulty of how consonance is defined. Note (v) following Theorem 2 of [28] claims that “consonance with respect to the elementary hypotheses [natural consonance] always implies the existence of a nested shortcut of size m”, where m is the number of elementary hypotheses. The natural consonance of the closed test allows for the shortcut from [19] to be applied to the restricted hypothesis example of the previous section, as explained in the following paragraph.
Examining the flow chart of Figure 5 will reveal that the closed test procedure proposed by [28] has this property of consonance with respect to the elementary hypotheses H_{1}, H_{2}, H_{3}, and H_{4}, i.e., the closed test for this example is naturally consonant. This follows from the fact that rejection of the intersection hypothesis H_{13} implies rejection of either of the hypotheses H_{1} or H_{3} which are two of the original four elementary hypotheses. Note as before that rejection of H_{13} requires that either 2p_{1}<α or 2p_{3}<α by the definition of ${p}_{{H}_{13}}$. If say 2p_{1}<α, then H_{13} is rejected. Further, since 2p_{1}<α, H_{14} is also rejected as ${p}_{{H}_{14}}=min\{1,2{p}_{{H}_{1}},2{p}_{{H}_{4}}\}=min\{2{p}_{1},2{p}_{{H}_{4}}\}<\alpha $. Most importantly, 2p_{1}<α provides for H_{1} to be rejected, as the closed test p-value ${p}_{{H}_{1}}$ requires only p_{1}<α which is certainly satisfied if 2p_{1}<α. Hence, in this case, the rejection of the intersection hypothesis H_{13} implied rejection of the elementary hypothesis H_{1}. A similar scenario holds for the elementary hypothesis H_{3} if 2p_{3}<α instead of (or as well as) 2p_{1}<α. Finally, rejection of H_{24} similarly implies rejection of either H_{2} or H_{4}. Thus, the closed test procedure for these restricted hypotheses admits the shortcut of [19] because of the consonance of the closed test with respect to the elementary hypotheses, i.e. the closed test is naturally consonant.
We now extend Theorem 1 of [20] to restricted hypotheses, and thereby verify the appropriateness of the graphical shortcut of [19] for restricted hypotheses. To this end, let m elementary hypotheses H_{1},...,H_{ m } of interest be given and denote by their closure under intersection. For the purposes of Theorem 1, can be either restricted or unrestricted. Let α_{ i }(I) denote the local significance levels for an intersection hypothesis ${H}_{i}\in \mathcal{H}$ where $\underset{i\in I}{\Sigma}{\alpha}_{i}\le \alpha $ for all non-empty I⊆{1,...,m}.
Theorem1.
- 0.
Set M={1,...,m}.
- 1.
Set I equal to the smallest subset of M such that H _{ I }=H _{ M }.
- 2.
Reject H _{ j } if there exists j∈I such that p _{ j }≤α _{ j }(I). If no such j exists, then stop.
- 3.
Set M→M\j.
- 4.
If |M|≥1 return to Step 1. Otherwise, stop.
Proof.
First, note that in the case of unrestricted hypotheses, natural consonance and consonance are identical [28] so that the proof is already demonstrated in Theorem 1 of [20]. Consider then the case of restricted hypotheses in the sense that for ∅≠I,J⊆{1,...,m} with I≠J it is true that ∅≠H_{ I }=H_{ J } so that $\left|\mathcal{H}\right|<{2}^{m}-1$. Then, for I,J with ∅≠H_{ I }⊂H_{ J } it follows from α_{ i }(I)≤α_{ i }(J) that p_{ j }≤α_{ j }(I) implies p_{ j }≤α_{ j }(J). Thus, rejection of H_{ I } implies rejection of some elementary hypothesis H_{ j }, witnessing that the closed test for is indeed naturally consonant.
Discussion of Theorem 1
Some comments are in order regarding Theorem 1. First, while an intersection hypothesis H_{ I } may not be unique in , it must not be empty for the nested shortcut of length m to exist. Second, the only difference between the proof here and the proof for unrestricted hypotheses [20] is in the definition of consonance. Here we follow the suggestion in [28] and allow natural consonance, which can be seen as a loosening of the requirements of consonance to include all elementary hypotheses instead of just all maximal hypotheses. The important distinction is that for unrestricted hypotheses, all elementary hypotheses are maximal. The same is not necessarily true for restricted hypotheses. Third, as in the previous restricted hypotheses example, restricted hypotheses are often the result of nested elementary hypotheses. This is certainly the case for the hypotheses attached to the gene sets of the GO graphs. Fourth, the main importance of the extended Theorem 1 rests with its assurance that a naturally consonant closed test based on weighted Bonferroni tests exists so long as the monotonicity condition α_{ i }(I)≤α_{ i }(J) is satisfied for all ∅≠H_{ i }⊂H_{ J } in . Fifth, Theorem 1 does not specify that any graph with local thresholds of α=(α_{1},...,α_{ m }) and edge weights G={g}_{ ij }, denoted by (α,G), can combine with Algorithm 1 and lead to a consonant closed test. It simply specifies the conditions under which a consonant closed test based on local Bonferroni tests can be formed.
One important rule on the graph (α,G) when the hypotheses are restricted is that the local threshold α_{ i } for an elementary hypothesis H_{ i } must remain zero until all elementary hypotheses H_{ j } with H_{ j }⊂H_{ i } are first rejected. This property can be seen to hold for the graph of Figure 4. However, if the graph in Figure 4 allowed for any of H_{1}’s threshold to be passed to H_{4} or similarly, if H_{3} allowed for anything to be passed to H_{2}, this property would no longer hold. So, while Theorem 1 assures that a consonant closed test exists when local Bonferroni tests are used for the testing of each $H\in \mathcal{H}$, not just any graph (α,G) will result in that consonant closed test. In the following section we demonstrate how a graph (α,G) can be applied to the GO graph such that a consonant closed test based on weighted Bonferroni tests is achieved through the application of Algorithm 1.
That Algorithm 1, when applied to a graph (α,G), preserves the monotonic property that α_{ j }(I)α≤_{ j }(J) for I,J such that H_{ I }⊂H_{ J } can be seen by noting that Algorithm 1 only provides for the local thresholds α_{ i } to remain the same size or increase. Never does it allow for them to become smaller. Further, at any point in the iterative process, the local thresholds α_{ i } define the weighted Bonferroni test thresholds α_{ j }(I) for the intersection hypothesis I corresponding to the intersection of the elementary hypotheses with non-zero thresholds (see for example Figure 5). Hence, as H_{ J } will be tested only after H_{ I } is first rejected whenever H_{ I }⊂H_{ J }, it follows that Algorithm 1 will provide α_{ j }(I)≤α_{ j }(J).
The Shortfocus level procedure
We obtain the Short Focus Level procedure by modifying the top-down portion of the Focus Level method. This is done by tailoring the general graphical shortcut [19] to a GO graph as follows. Label the m hypotheses corresponding to the test of significance for each GO term (gene set) as H_{1},...,H_{ m } starting with the root node and proceeding in an organized manner through each level of the GO graph, ending with the terminal nodes. (The precise ordering is not important.) Let F⊂M={1,...,m} denote the index set of the nodes corresponding to the pre-selected focus level of the GO graph. For all m_{ F } nodes in the focus level, assign local significance levels of α_{ i }=∈/m_{ F } to each hypothesis H_{ i } with i∈F. Assign initial local significance levels of 0 to all children nodes of the focus level. Note that nodes above the focus level will still be tested using the bottom-up approach of the Focus Level method and are not considered when applying the top-down portion of the method.
Using the structure of the GO graph, assign to each edge from parent node i to child node j a weight of g_{ ij }=1/m_{ i }, where m_{ i } denotes the number of children nodes of node i. After all edge weights have been assigned for the edges defined by the GO graph, all terminal nodes are individually joined with m_{ F } new edges to each of the m_{ F } focus level nodes. These new edges are given weights of 1/m_{ F }. (In the case that a terminal node is also a focus level node, then edges are made only to all other focus level nodes with weight 1/(m_{ F }-1)).
An algorithm which implements the Short Focus Level procedure is detailed in Algorithm 2. Here, H denotes the index set of testable hypotheses (nodes) and w={w_{ i }}_{i∈H} the corresponding set of weights such that α/w_{ i } provides the local thresholds α_{ i } for each hypothesis H_{ i } indexed by i∈H. As described previously, F⊂{1,...,m} denotes the index set of all pre-selected focus level nodes. The notation C_{ i } denotes the index set of children nodes of the parent hypothesis H_{ i }. Similarly, the notations P_{ i } and A_{ i } denote the parents and all ancestors, respectively, of the node corresponding to the hypothesis H_{ i }. Finally, we use R and S to denote the index sets of the current and cumulative rejected hypotheses, respectively.
Results and discussion
A natural question at this point concerns the advantages and disadvantages of changing the top-down portion of the Focus Level procedure from the original closed test approach as in [10] to the graphical shortcut of [19] as proposed for the Short Focus Level. If the local tests for each intersection hypothesis were originally performed with weighted Bonferroni tests, then the difference between the methods would be that the first performed the full closure test requiring the testing of somewhere on the order of 2^{ m }-1 intersection hypotheses, while the second, which applies a shortcut, would test no more than m hypotheses with no reduction in the power of the tests. When using the Global Test for each intersection hypothesis as suggested by [10], the answer to the differences in computation time and power is not as clear. The following simulations demonstrate that neither method is uniformly more powerful than the other, with each having the advantage for certain scenarios. However, as these simulations demonstrate, the newly proposed Short Focus Level procedure is uniformly (and exponentially) computationally faster than the Focus Level method which will hopefully better enable its use by practitioners.
Simulation 1
The structure of gene assignments to the sets A, B, C, D, E, and F of Figure 4, as well as the total number of genes assigned, was allowed to vary in each simulation according to certain parameters. Genes were first assigned to the leaf node gene sets C, D, and E. This was accomplished by randomly selecting both the number of distinct sets in each of these sets (anywhere from 1 to a maximum specified size of either 10 or 40) as well as the number of genes shared by all possible combinations of the leaf node gene sets. Common genes between all or many gene sets was discouraged with small probabilities of occurrence, while common genes between a few gene sets was allowed to occur more frequently. Following the assignments of genes to leaf nodes, parent nodes were randomly assigned new genes (anywhere from 1 to the maximum specified size) as well as all genes contained by their children nodes. The result was a nested graph with at least some overlap common to many gene sets, as is the case within GO Graphs.
The data counts matrix X was simulated using an actual RNA-Seq data set as a sampling distribution for the per-gene means in the control group. Specifically, the counts k_{ ij } for all samples j assigned to the control group were generated from a NB(${\mu}_{i},{\mu}_{i}+{\mu}_{i}^{2}/d$) distribution, where the means μ_{ i } were randomly sampled from the per-gene means of the control group from the actual RNA-Seq data set. The scaling parameter d was set at 10 for all simulations. Leaf node gene sets (any of nodes C, D, or E in Figure 4) were then selected at random to be significant. Each gene mapping to the selected significant leaf nodes was assigned a treatment mean of where μ_{ i } denotes the mean sampled from the actual RNA-Seq data for gene i and β_{ i } was an effect size obtained from a Poisson(λ) distribution with the parameter λ set to one of 0, 1, 2, or 3. Thus, not all genes in the significant gene sets necessarily had non-zero effect sizes. The actual counts k_{ ij } for all samples j assigned to the treatment group were obtained from a NB( , ) distribution where, as with the control group, d=10 was constant across all simulations. (See [13] for a similar simulation approach where single gene sets were the object of interest as opposed to an entire GO graphs as in this simulation).
Summary of results for Simulation 1
Mean | ||||||||
---|---|---|---|---|---|---|---|---|
Node | computation | |||||||
n | Method | A | B | F | C | D | E | time (sec) |
5 | FL | 0.447 | 0.428 | 0.132 | 0.142 | 0.135 | 0.130 | 0.426134 |
SFL | 0.447 | 0.366 | 0.120 | 0.092 | 0.083 | 0.122 | 0.001778 | |
20 | FL | 0.574 | 0.567 | 0.180 | 0.186 | 0.192 | 0.179 | 0.102097 |
SFL | 0.574 | 0.552 | 0.178 | 0.184 | 0.188 | 0.179 | 0.001789 | |
100 | FL | 0.642 | 0.635 | 0.202 | 0.220 | 0.207 | 0.201 | 0.355848 |
SFL | 0.642 | 0.623 | 0.201 | 0.217 | 0.204 | 0.201 | 0.001793 |
;Simulation 2;
Allocation of simulated genes to the GO IDs of the GO graph in Figure 8
GO ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Genes | 1-100 | 1-40 | 21-60 | 61-100 | 1-10 | 11:20 | 21:40 |
GO ID | 8 | 9 | 10 | 11 | 12 | 13 | 14 |
Genes | 41-50 | 51-60 | 61-80 | 71-90 | 81-100 | 72-79 | 82-89 |
Results of the power simulation for the GO graph in Figure 8
GO:01 | GO:02 | GO:03 | GO:04 | GO:06 | GO:07 | GO:10 | GO:11 | GO:13 | Time | |
---|---|---|---|---|---|---|---|---|---|---|
FL | 0.995 | 0.968 | 0.890 | 0.462 | 0.512 | 0.872 | 0.380 | 0.399 | 0.344 | 3:42:938 |
SFL | 0.995 | 0.988 | 0.952 | 0.543 | 0.837 | 0.949 | 0.489 | 0.476 | 0.445 | 0:00:015 |
Key difference between focus level and short focus level methods
The Focus Level (FL) and Short Focus Level (SFL) methods are the same in the bottom-up approach. They differ in the top-down approach. Both are similar in the top-down approach in that they apply the closed testing strategy to the GO graph (from the focus level down to the leaf nodes) by closing the graph under all unions of gene IDs corresponding to the GO IDs (i.e., intersections of the hypotheses corresponding to the testing of each GO ID). The FL method uses the Global Test to test each intersection hypothesis of the closed GO graph[10]. The SFL method would be a direct shortcut of the FL method if the FL method instead used a weighted Bonferroni test to test each intersection hypothesis.
However, it is important to recognize that the FL method could only perform the full closure test, not the short-cut that the SFL method performs, even if the FL method was modified to use the weighted Bonferroni tests. The FL method is more consistent in applying the same Global Test to both original GO ID hypotheses as well as to all intersection hypotheses. However, it is computationally expensive because it performs the full closure test. The SFL method makes a slight shift in allowing any test (not just the Global Test) for the elementary hypotheses (the individual GO ID hypotheses) and then performing weighted Bonferroni tests for all intersection hypotheses. This simplification or generalization allows for the resulting short-cut. Hence, the power comparison between the FL and SFL methods is not obvious, and the simulations above show that neither method is uniformly more powerful than the other.
Real data application 1
A drawback to the otherwise powerful Focus Level method is the computational burden which prohibits the full top-down approach from being applied to real data sets [10]. When no a priori focus level exists, as is often the case [18], the root node of the GO graph is a logical default choice, but requires the full top-down approach. Under the newly proposed Short Focus Level method, this is now a computational possibility. The following application to RNA-Seq counts data from porcine oocytes demonstrates the performance of the full top-down approach of the Short Focus Level procedure to real data. The Biological Process (BP) root node was selected as the focus level for this study due to there being no other focus level of greater a priori interest.
Experimental design for the in vivo (IVV) and in vitro (IVM) oocyte maturation RNA-seq data
Oocyte no. | Embryo type | Pig (Mother) |
---|---|---|
1 | IVV | 1 |
2 | IVV | 2 |
3* | IVV | 3 |
4 | IVM | 3 |
5 | IVM | 1 |
6 | IVM | 2 |
7 | IVM | 4 |
8 | IVV | 3 |
Individual P-values testing the differential expression of 12,625 genes were calculated using the DESeq package of Bioconductor [33],[34] with pig mother, as identified in Table 4, included as a covariate. Specifically, these P-values were obtained under the null hypotheses that the per-gene expression strength of the in vivo maturated oocytes (IVV) is equal to that of the in vitro maturated oocytes (IVV) when accounting for any pig mother effect. This was done through the DESeq package [33] which compares a full model (regressing the RNA-Seq counts on both the oocyte type and pig mother by a generalized linear model) to a reduced model (regressing only on the pig mother) to determine significance for a given gene.
A gene set analysis using the GO BP ontology was then performed to characterize differentially expressed gene products between the two types of oocytes (IVV and IVM). P-values for each of 5,687 BP GO Terms containing at least 5 of the 12,625 Entrez IDs from the single gene (DESeq) analysis were calculated using Stouffer’s Method [24],[35]. The R code [21] for Stouffer’s Method is included in the mvGSTmvGST package [22],[23]; see also Additional file 1. Briefly, Stouffer’s method transforms each of the P-values (from the single gene analysis) corresponding to an individual gene in the gene set to a standard normal Z-score. A single P-value for the gene set is then obtained from the mean of the Z-scores by computing the appropriate tail probability (from a standard normal distribution) beyond the mean Z-score. Stouffer’s P-value combination method was applied here as it is more powerful for the consensus alternative than say Fisher’s P-value combination test [36] or Goeman’s globaltest [6], see discussions in [24]. Finally, multiplicity adjusted gene set P-values for each BP term were calculated using the Short Focus Level procedure, with the root BP GO term (GO:0008510) as the focus level. This adjustment (the full top-down approach) took just 3 minutes and 23 seconds of processing time on an Intel Pentium M 1.86 GHz processor with 1 GB of RAM. The current Focus Level method is computationally intractable for thesedata.
Real data application 2
As a second real data demonstration of the Short Focus Level method, and to further discuss differences between the Short Focus Level and Focus Level methods, we used a subset of the famous Golub data set [37], specifically the 38 training samples publicly available in the R package golubEsets [38]. Briefly, Affymetrix Hgu6800 chips were used to profile gene expression in leukemia patients, and we test for differential biological process activity between 27 patients with acute lymphoblastic leukemia (ALL) and 11 with acute myeloid leukemia (AML). Using the same Focus Level method demonstration as in the vignette of the globaltest package [39], we looked only at biological process “cell cycle” (GO:0007049) and its descendants in the GO graph, with 249 total nodes.
We also applied the Short Focus Level method to the same example (using familywise error rate 0.01), and note that while the Focus Level method (with its default focus level) takes nearly 4 minutes, the Short Focus Level method (using the same focus level as selected by the Focus Level method) takes 1 second. Finally, we applied the Short Focus Level method using the root node as the focus level, which also took 1 second. In stark contrast, the Focus Level method using the root node as the focus level was deemed computationally intractable, as even a run time of two weeks was not sufficient to complete it.
As discussed in the original Focus Level paper [10], different focus levels provide for different power at differing areas of the GO graph. For this reason, it is difficult to make definitive comparisons using results from different focus levels. While the stronger agreement seen between the FL method (with its default focus level) and SFL method (with the root node focus level) in Figure 10 may seem interesting, the important point is that the FL method is effectively computationally intractable using the root node focus level. The decision to use the FL or SFL method should not be based on power considerations, but rather on computational considerations, especially when no real reason exists to choose the focus level anywhere other than the root node (which will most often be the case), in which case the FL method is computationally intractable. However, the SFL method is computationally efficient and strongly controls the familywise error rate within the structure of the GO graph.
Conclusions
As pointed out in [10], the GO graphs are structured and “it is wasteful not to make use of that structure” in correcting for multiplicity. Further, they stress the importance of not making any assumptions on the joint distribution of the test statistics corresponding to each of the gene sets in the GO graph while correcting for multiplicity. The Focus Level procedure both avoids any such assumptions and capitalizes on the inherent structure of the GO graph to adjust for the multiple tests performed, resulting in a powerful approach. Another advantage of the Focus Level method is the possibility of incorporating biological knowledge into the adjustment approach through the selection of the focus level, where the method has the greatest power.
This work improves upon the Focus Level procedure of [10] by altering the top-down portion of the method to utilize the graphical shortcut of [19] in place of the full closed testing approach of [17] as originally suggested by [10]. This was made possible by extending the result from [20] to restricted hypotheses (Theorem 1) as the hypotheses corresponding to the GO graph are always restricted.
The main advantage of the Short Focus Level procedure proposed in this work is the exponential decrease in computational burden. This provides for the most logical default choice of the root node of the GO graph as the focus level when no other a priori choice can be specified. Another advantage of the improvement is in the ability to consider the adjusted P-values apart from their context within the significant subgraph of the full GO graph under the full top-down approach. When the focus level is selected to be anything other than the root node, individual hypotheses must be considered in context of their position within the significant subgraph. However, this is not altogether a disadvantage as “the interpretation of an individual adjusted P-value should depend on the location in the graph where it occurs” [10].
It is our hope that this shortcut for the Focus Level procedure, the Short Focus Level, will result in more wide-spread use of the method. Still, future work remains to be done. The simulations performed within this work demonstrate that each approach appears to be more powerful under different circumstances. Hence, further theoretical work is needed to determine the conditions under which each method is most powerful.
Additional files
Declarations
Acknowledgements
This research was supported by the Utah Agricultural Experiment Station (UAES), Utah State University, and approved as journal paper number 8676, with funding as UAES project number UTA01062, associated with the W2112 multi-state project “Reproductive Performance in Domestic Ruminants”.
Authors’ Affiliations
References
- Malone JH, Oliver B: Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol. 2011, 9: 34-10.1186/1741-7007-9-34.View ArticlePubMed CentralPubMedGoogle Scholar
- Lee CK, Klopp RG, Weindruch R, Prolla TA: Gene expression profile of aging and its retardation by caloric restriction. Science. 1999, 285: 1390-1393. 10.1126/science.285.5432.1390.View ArticlePubMedGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000, 403 (6769): 503-511. 10.1038/35000501.View ArticlePubMedGoogle Scholar
- Isom SC, Stevens JR, Li R, Spollen WG, Cox L, Spate LD, Murphy CN, Prather RS: Transcriptional profiling by RNA-seq of peri-attachment porcine embryos generated by a variety of assisted reproductive technologies. Physiol Genomics. 2013, 45 (14): 577-589. 10.1152/physiolgenomics.00094.2012.View ArticlePubMed CentralPubMedGoogle Scholar
- Miller JA, Woltjer RL, Goodenbour JM, Horvath S, Geschwind GH: Genes and pathways underlying regional and cell type changes in alzheimer’s disease. Genome Med. 2013, 5 (5): 48-10.1186/gm452.View ArticlePubMed CentralPubMedGoogle Scholar
- Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association in a clinical outcome. Bioinformatics. 2004, 20: 93-99. 10.1093/bioinformatics/btg382.View ArticlePubMedGoogle Scholar
- Efron B, Tibshirani R: On testing the significance of sets of genes. Ann Appl Stat2007:107-129.Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.View ArticlePubMed CentralPubMedGoogle Scholar
- Passtoors WM, Boer JM, Goeman JJ, Akker EB, Deelen J, Zwaan BJ, Scarborough A, van der Breggen R, Vossen RH, Houwing-Duistermaat JJ, Ommen GJ, Westendorp RG, van Heemst D, de Craen AJ, White AJ, Gunn DA, Beekman M, Slagboom PE: Transcriptional profiling of human familial longevity indicates a role for ASF1A and IL7R. PLOS ONE. 2012, 7: [E27759]-10.1371/journal.pone.0027759.View ArticleGoogle Scholar
- Goeman JJ, Mansmann U: Multiple testing on the directed acyclic graph of gene ontology. Bioinformatics. 2008, 24 (4): 537-544. 10.1093/bioinformatics/btm628.View ArticlePubMedGoogle Scholar
- Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007, 23 (8): 980-987. 10.1093/bioinformatics/btm051.View ArticlePubMedGoogle Scholar
- Khatri P, Drăghici S: Ontological analysis of gene expression data current tools, limitations, and open problems. Bioinformatics. 2005, 18: 3587-3595. 10.1093/bioinformatics/bti565.View ArticleGoogle Scholar
- Fridley BL, Jenkins GD, Biernacka JM: Self-contained gene-set analysis of expression data: an evaluation of existing and novel methods. PLoS One. 2010, 5 (9): [E12693]-10.1371/journal.pone.0012693.View ArticleGoogle Scholar
- Nettleton D, Recknor J, Reecy JM: Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis. Bioinformatics. 2008, 24 (2): 192-201. 10.1093/bioinformatics/btm583.View ArticlePubMedGoogle Scholar
- Hochberg Y, Tamhane AC: Multiple Comparison Procedures, 1st edition. 1987, Wiley, New YorkView ArticleGoogle Scholar
- Holm S: A simple sequentially rejective multiple test procedure. Scand J Stat. 1979, 6: 65-70.Google Scholar
- Marcus R, Peritz E, Gabriel KR: On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976, 63 (3): 655-660. 10.1093/biomet/63.3.655.View ArticleGoogle Scholar
- Liang K, Nettleton D: A hidden Markov model approach to testing multiple hypotheses on a tree-transformed gene ontology graph. J Am Stat Assoc. 2010, 105 (492): 1444-1454. 10.1198/jasa.2010.tm10195.View ArticleGoogle Scholar
- Bretz F, Maurer W, Brannath W, Posch M: A graphical approach to sequentially rejective multiple test procedures. Stat Med. 2009, 28 (4): 586-604. 10.1002/sim.3495.View ArticlePubMedGoogle Scholar
- Hommel G, Bretz F, Maurer W: Powerful short-cuts for multiple testing procedures with special reference to gatekeeping strategies. Stat Med. 2007, 26 (22): 4063-4073. 10.1002/sim.2873.View ArticlePubMedGoogle Scholar
- R: A Language and Environment for Statistical Computing. 2013, R Foundation for Statistical Computing, ViennaGoogle Scholar
- Mecham DS: mvGST: Tools for Multivariate and Directional Gene Set Testing. 2014, Department of Mathematics and Statistics, [MS Report]. Logan: Utah State UniversityGoogle Scholar
- Stevens JR, Mecham DS: mvGST: Multivariate and directional gene set testing2014. [R package version 1.0.0]. [], [http://www.bioconductor.org/packages/release/bioc/html/mvGST.html]
- Stevens JR, Isom SC: Gene set testing to characterize multivariately differentially expressed genes. Proc Conf Appl Stat Agric. 2012, 24: 125-137.Google Scholar
- Hummel M, Meister R, Mansmann U: Global ANCOVA: exploration and assessment of gene group effects. Bioinformatics. 2008, 24: 78-85. 10.1093/bioinformatics/btm531.View ArticlePubMedGoogle Scholar
- Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005, 6: 225-10.1186/1471-2105-6-225.View ArticlePubMed CentralPubMedGoogle Scholar
- Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics. 2007, 8: 242-10.1186/1471-2105-8-242.View ArticlePubMed CentralPubMedGoogle Scholar
- Brannath W, Bretz F: Shortcuts for locally consonant closed test procedures. J Am Stat Assoc. 2010, 105 (490): 660-669. 10.1198/jasa.2010.tm08127.View ArticleGoogle Scholar
- Goeman JJ, Solari A: The sequential rejection principle of familywise error control. Ann Stat. 2010, 38 (6): 3782-3810. 10.1214/10-AOS829.View ArticleGoogle Scholar
- Gabriel KR: Simultaneous test procedures’some theory of multiple comparisons. Ann Math Stat. 1969, 40: 224-250. 10.1214/aoms/1177697819.View ArticleGoogle Scholar
- Cox L, Ward A, Saunders G, Stevens JR, Isom SC: Gene expression analysis ofin vivo-andin vitro-matured porcine metaphase II oocytes. Reprod Fertil Dev. 2013, 26 (1): 117-117. 10.1071/RDv26n1Ab6.View ArticleGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.View ArticlePubMed CentralPubMedGoogle Scholar
- Anders S, Huber W: Differential expression analysis for sequence count data. Genome Biol. 2010, 11 (10): R106-10.1186/gb-2010-11-10-r106.View ArticlePubMed CentralPubMedGoogle Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 10: R80-10.1186/gb-2004-5-10-r80.View ArticleGoogle Scholar
- Stouffer S, Suchman E, DeVinney L, Star S, Williams RJ: The American Soldier. 1949, Princeton University Press, PrincetonGoogle Scholar
- Fisher RA: Statistical Methods for Research Workers. 1973, Hafner Publishing Company; (1st ed. in 1925) 14th edition, New YorkGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531-537. 10.1126/science.286.5439.531.View ArticlePubMedGoogle Scholar
- Golub T: golubEsets: exprSets for Golub leukemia data2014. [R package version 1.6.0]., [http://www.bioconductor.org/packages/release/data/experiment/html/golubEsets.html]
- Goeman JJ, Oosting J: globaltest2014. [R package version 5.18.0]. [], [http://www.bioconductor.org/packages/release/bioc/html/globaltest.html]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.