### Overview

FUNC is a set of four command line tools that allow the analysis of a set of genes with respect to their annotation (see Figure 1 for a schematic overview). It is particular useful when analyzing ontological annotations such as provided by the Gene Ontology consortium [1] or eVOC [26], but can be easily adapted to any other annotation. Each of the command line tools performs a specific statistical test on a certain type of input data. The user can select the ontology or subtree of an ontology on which the test should be performed, and restrict the tested categories to those containing some minimum number of genes. For each category in the ontology the statistical test of the used tool is performed resulting in the "raw p-value" for that category. Since many categories are tested and the tests are inter-dependent in a complex manner, FUNC compares the test results to results obtained from random datasets in which the gene-associated variables are permuted. This specifies the null hypotheses, namely that there is independency between the associated variable and the annotation of the genes. These random sets are used to calculate for each category, i.e. each raw p-value cut-off, two corrected p-values: a resampling-based false discovery rate (FDR) [27] and the family-wise error rate (FWER) [28]. The FDR is an estimate of the proportion of categories that are false positives among all the categories with a raw p-value equal or lower than the given category. The FWER is the estimated probability that at least one false positive category exists among all the categories with a raw p-value equal or lower than the given category. In addition, FUNC compares the distribution of raw p-values of the random sets with the distribution of raw p-values of the data set in order to obtain an overall significance p-value – a Kolmogorov-Smirnov-type test against a single null hypothesis stating that the gene associated variables are randomly distributed across all the categories. This global test statistic is useful since it is expected to be sensitive even if a weak signal is distributed among many categories. The overall significance value can be used to decide whether the data set is at all differently distributed from random. If this is the case, one can then pick a FDR to decide which of the categories deviate significantly. This procedure might be preferable to a-priori selection of an FDR or FWER significance level or changing them post-hoc.

The output of FUNC is a summary of the above-mentioned statistics as well as a table listing the analysed categories and the associated raw and corrected p-values. After picking a desirable p-value cut-off, the user can run the refinement algorithm to identify those significant categories that provide the most concise information, i.e. to identify those categories whose significance does not depend solely on significant descendant categories.

### Category tests

Each of the four FUNC tools is designed for one of the possible category tests: hypergeometric test, Wilcoxon rank test, binomial test and 2 × 2 test (Table 1). For each test, two p-values for both sides of the test statistic are calculated, which allows the detection of an enrichment or a depletion of gene-associated variables among the categories. A detailed description of the algorithms used can be found in the Supplement. Briefly, the hypergeometric test takes a binary variable (e.g. "0" and "1") that is associated with each gene (e.g. "1" for differentially expressed and "0" for equally expressed) and uses the hypergeometric distribution to calculate for each category the probability to "draw" this many or more (respectively this many or less) differently expressed genes from the top category of the subtree selected by the user.

The Wilcoxon rank test differs from this scheme only in that it takes a floating point variable instead of a binary variable and compares the ranks of the genes in the tested category with the ranks of the remaining genes in the top category. This test is useful when it is not possible to clearly classify genes to two distinct classes – as is often the case in microarray experiments. This kind of test has also been used previously in the comparison between the human and the chimpanzee genome to identify GO categories, which contain an excess of fast or slowly evolving genes [25].

Whereas the hypergeometric and Wilcoxon rank tests compare the distribution of one gene-associated variable among categories, the "binomial test" compares the ratio of two gene-associated counts among categories. Each gene is associated with two counts, and the test determines whether the ratio of these counts in a category is significantly different from the ratio in the top category. The binomial test has been used to identify categories containing more amino acid changes on the human lineage compared to the number of changes on the chimpanzee lineage, and to identify categories that have a higher than expected number of amino acid changes between human and chimpanzee compared to changes between mouse and rat [25](see also the example below). This test might also be useful when comparing counts of expressed sequence tags e.g. from two SAGE (Serial Analysis of Gene Expression) libraries [29].

The fourth test takes a 2 × 2 table as the associated gene variables, sums them over each category, and uses a Fisher's exact test or a chi-square test (if all values in the tested category are larger then ten) to test whether the two properties (in rows and columns, each in two states) are independent of each other. Note that in contrast to the other three tests, the calculated p-value is not dependent on an expectation taken from the top category. This test can be useful to conduct a McDonald-Kreitman type test [24] on GO categories. A McDonald-Kreitman type of test compares the number of fixed substitutions and the number of polymorphisms at two classes of sites, such as synonymous and non-synonymous sites. An excess of fixed non-synonymous substitutions can indicate the action of positive selection, whereas an excess of non-synonymous polymorphisms can indicate the presence of slightly deleterious amino acid variants (for a review see [30]). The 2 × 2 contingency table test implemented in FUNC calculates two separate p-values to test for an excess of non-synonymous substitutions and an excess of non-synonymous polymorphisms, respectively [see Additional file 1]. The availability of a large (and largely unbiased) genome-wide measurement of polymorphisms in humans and other species together with the already available data on substitutions should make this test very useful in the near future.

It is important to keep in mind that for all the four tests the power to reject the null hypothesis differs among different categories since categories differ in their amount of genes and/or their amount of gene associated counts. Hence, the category with the biggest size effect is not necessarily the most significant category and vice versa (see also [31] and reply). Also note, for the binomial test and the 2 × 2 contingency table test, that the null hypothesis that FUNC tests is that the genes and not the gene associated counts are a random sample in a category. As a consequence the raw p-values of these two tests should be considered more like an arbitrary test statistic which is compared to the distribution of p-values obtained by permuting genes and not single gene counts (see also example below).

### Correction for multiple testing

When many hypotheses are tested at the same time it is expected that a number will appear significant even if all the null hypotheses hold. Therefore, in order to confidently reject some of the null hypotheses, it is necessary to correct for multiple testing. The types of large-scale genomic experiments that have become possible during recent years, in particular microarrays, have revived interest in different statistical methods that deal with the issue of multiple testing (e.g. [13, 15]). The issue is somewhat complex, since (1) the tests are interdependent in a complex manner, (2) the power of each single test is often low, (3) more than one of the tested hypotheses are usually truly not null and (4) rejected hypotheses can be regarded as candidates for additional tests, so that less conservative significance levels can be tolerated for an increase in power. All of these issues are also relevant in the context described here, in particular the complex interdependency of the tests. To overcome the interdependency, we chose to use permutations, i.e. the randomization of gene-associated variables, in order to model the distribution under the null hypothesis that gene associated variables are independent of the gene annotation. This permuted data, the random sets, can be used to estimate the family wise error rate, i.e. the probability that among the tests that are declared significant one or more are false positives [28]. This approach is more powerful than the conservative Bonferroni correction and has, in the context of GO analyses programs, been implemented for example in FuncAssociate [10] and is also implemented in FUNC. For any of the four tests described above one can calculate a raw p-value cut-off for which the FWER is e.g. 5%. This approach is, however, very strict, and a certain (known) fraction of false positives among the significantly labelled tests can often be tolerated for an increase of power. This is the reason why the false discovery rate (FDR) has gained popularity within the genomic community. The FDR is (loosely) defined as the expected proportion of falsely rejected hypothesis among all rejected hypotheses. Different methods exist to estimate the FDR, differing in how they treat the case that no hypothesis is rejected and in how they estimate the number of falsely rejected hypotheses (see e.g. [14, 15, 32] for a discussion). Several programs that analyse functional annotations have implemented FDR methods (e.g. [5, 9, 16–21]), most often using the procedure of Benjamini and Hochberg [12]. In FUNC we implemented a similar method by Yekueteli and Benjamini that is well-suited for resampling methods [27]. Although it has been shown that the method also works well under positive dependency among the tests [33] and an easy (conservative) correction method can be used under different kinds of dependencies [33], it is not entirely clear whether it is well-suited for the kind of strong dependencies that exist among the functional annotations. Further, it is noteworthy that the current methods for computing the FDR are strictly valid only under the assumption of subset pivotality, i.e. under the assumption that the significance of one test does not depend on the significance of another test. However, this assumption is violated for the hypergeometric test, the Wilcoxon rank sum test and the binomial test, since the expectation for each category is derived from the top category, which includes the other categories tested. Hence, if one or more categories truly deviate from the null hypothesis, this influences the null expectation of other categories. However, there is often no reasonable alternative for estimating the expectations independently. In practice, all these issues will not be relevant for data sets which deviate considerably from a random functional annotation. But for data sets with a weak signal, the FDR might not qualify as a well-suited measurement to determine whether there is any (or no) indication for a non-random distribution of the gene-associated variables among functional categories. Therefore, in addition to estimating the FDR and FWER rate of the data set, we developed a method to test directly the "global null hypothesis" that gene-associated variables are randomly distributed among categories.

### Testing the global null hypothesis

As was pointed out above, when the strength of the deviation may be weak, it is useful to test whether the data shows any deviation from the random distribution before embarking on finding out which categories deviate from others. For this we calculate a p-value to test the null hypothesis that the gene-associated variables are randomly distributed among all categories. This is done by looking over all possible cut-offs of the raw p-values between 0 and 0.05, and for each of them calculating the proportion of random sets that have as many or more categories showing this many or fewer raw p-values. Then, in a similar fashion to the Kolmogorov-Smirnov test, we find the cut-off value for which the deviation from the random sets is maximal. We then do the same test (finding the cut-off with maximal distance) to every one of the random sets. The overall p-value is then determined by calculating the proportion of random sets that have the same or a larger maximal distance (see Additional file 1 and Figure 2). If this p-value is low (e.g. smaller than 0.05), one can reject the null hypothesis that the gene-associated variable(s) are distributed independently of their functional annotations.

This measure should be especially helpful, when the signal is weak and/or is distributed among many categories and should be more sensitive than an FDR estimate (see example below). The FDR can be interpreted using an analogy of how much money one would be willing to waste. Testing the global null hypothesis can determine whether it is worth spending any money in the first place, and the FDR can subsequently be used to estimate how much money one is willing to waste.

### Refinement

Once one is confident that categories showing enrichment or depletion of the gene-associated variable(s) exist, and after choosing a suitable raw p-value as a cut-off, based on a certain FDR or FWER, it is useful to specify the deviation as precisely as possible. This means one wants to exclude categories that are significant solely because they contain significant descendant categories. The refinement algorithm starts from the leaves (i.e. the most specific annotation), recursively removes the genes annotated in significant descendant categories, and tests the remaining genes in a significant parent category again (Figure 3). In this way the list of all significant categories can be limited to the most specific categories, which make the results more interpretable and manageable. This algorithm is similar to the *elim* algorithm proposed recently [34]. However, in contrast to the *elim* or the related *weight* algorithm [34] we interpret the results of the refinement like a *post hoc* test. Consider a hypothetical example where the gene associated variables are significantly overrepresented in the category carbohydrate metabolism, which is due to an overrepresentation in the two descendant categories glycolysis and tricarboxylic acid cycle. It is true regardless of the refinement that genes annotated in carbohydrate metabolism are overrepresented in the data set. That genes annotated in glycolysis and tricarboxylic acid cycle are overrepresented is just the more specific statement. We find it helpful and transparent to distinguish between significant categories and the most specific significant categories and hence find it useful to separate these two analyses.