Comparison of lists of genes based on functional profiles

Background How to compare studies on the basis of their biological significance is a problem of central importance in high-throughput genomics. Many methods for performing such comparisons are based on the information in databases of functional annotation, such as those that form the Gene Ontology (GO). Typically, they consist of analyzing gene annotation frequencies in some pre-specified GO classes, in a class-by-class way, followed by p-value adjustment for multiple testing. Enrichment analysis, where a list of genes is compared against a wider universe of genes, is the most common example. Results A new global testing procedure and a method incorporating it are presented. Instead of testing separately for each GO class, a single global test for all classes under consideration is performed. The test is based on the distance between the functional profiles, defined as the joint frequencies of annotation in a given set of GO classes. These classes may be chosen at one or more GO levels. The new global test is more powerful and accurate with respect to type I errors than the usual class-by-class approach. When applied to some real datasets, the results suggest that the method may also provide useful information that complements the tests performed using a class-by-class approach if gene counts are sparse in some classes. An R library, goProfiles, implements these methods and is available from Bioconductor, http://bioconductor.org/packages/release/bioc/html/goProfiles.html. Conclusions The method provides an inferential basis for deciding whether two lists are functionally different. For global comparisons it is preferable to the global chi-square test of homogeneity. Furthermore, it may provide additional information if used in conjunction with class-by-class methods.


Background
With the advent of genomic technologies it has become possible to perform, in a routine manner, experiments which simultaneously analyze the behavior of thousands of genes or proteins in different conditions. A common feature of these studies is that they generate huge quantities of data, which has led to the term "high-throughput" to describe them. There are different types of high-throughput experiments, but here we will refer specifically to the most well known: microarray experiments [1][2][3]. A typical differential expression study aims to identify genes that are differentially expressed under two or more conditions; for instance, healthy (or untreated or wild-type) cells compared to tumor (or treated or mutant) cells. Such experiments often result in long lists of genes which have been selected using a given criterion (for instance a moderated t-test followed by a p-value adjustment) to assign them statistical significance.
Obtaining one or more lists of genes is only the first step; they must be interpreted in a way that makes biological sense. One common approach is to relate the genes they contain with one or more functional annotation databases, such as the Gene Ontology (GO), or Kyoto Encyclopedia of Genes and Genomes (KEGG). For simplicity we will speak only of GO classes (or categories, or nodes) but many concepts are also applicable to other annotation systems. There are many methods and models for performing this re-processing [4][5][6]. Two of the most commonly used are Gene Enrichment Analysis [7] (EA) and Gene Set Enrichment Analysis [8,9] (GSEA). This paper is mainly concerned with the EA approach.
To some extent, EA methods may be considered onesample procedures in the sense that they try to elucidate the association between a "sample" gene list (e.g. differentially expressed genes in the presence of a tumor type) taken from a given "universe" (e.g. the whole set of genes in the microarray) and a characteristic such as being annotated in a given GO class. In contrast, microarray data may also be used in a context where the goal is mainly the comparison of two (or more) "sample" gene lists ( [10][11][12][13]), such as differentially expressed genes in a sample of induced apoptotic cells vs differentially expressed genes in a sample of senescent cells. These lists may share part of their genes, but possibly not all of them. Again, the comparison is made in terms of their annotations in a functional database. A clear example of this approach is the comparison of whole experiments performed by different laboratories, possibly using different microarray technologies, whose resulting gene lists do not always coincide, see for example [14]. Similar or complementary studies that are available may be compared or even combined; thus, the goal of the analysis may be to prove difference or, conversely, to prove similarity.
The statistical model underlying EA and comparison methods based on GO class counts or hits is usually the hypergeometric-multihypergeometric distribution, together with the assumption that the gene "samples" under consideration are taken from a finite universe, e.g. the entire microarray, [15]. This leads to inferential methods mainly based on Fisher's exact test. Sometimes, the underlying model is the less realistic, but simpler to handle, binomial-multinomial distribution, under the assumption that the "samples" are taken from an infinite population. This is the basis of the chi-square approach, e.g. in [16]. In general, the binomial model provides an adequate approximation to the hypergeometric model for sufficiently large marginal frequencies.
Comparison methods typically focus on only one GO class at a time. If multiple classes are considered, the analysis is performed in a class-by-class fashion followed by a correction for multiplicity. An obvious advantage of this class-by-class approach is that classes associated with difference are readily identified. The main drawback of this approach is a loss of power. Controls for multiplicity based on strict criteria such as the family-wise error rate (FWER) unavoidably impose a loss of power, while more permissive criteria such as the false discovery rate (FDR) may be associated to an incomplete type I error control. In other words, the FDR corresponds to the expected proportion of erroneous null hypothesis rejections (false positives) among the total number of observed positives; a good FDR controlling procedure may maintain FDR below a given level, but not maintain the probability of any false positive below a given (significance) level, see for example [17][18][19]. An alternative approach is testing for difference jointly for all classes under consideration. The basis for such an approach in EA is outlined in [20] and a general approach and method is established in [21]. The obvious advantage of the global approach (only one significance test is performed) is a more strict control of type I and II errors. The main drawback is that association or difference is established with respect to a collection of classes, with no identification of those that have a greater influence. Here we present a family of hypothesis tests, and a method based on them, which perform global comparisons but also provide the possibility of combining them with a class-by-class approach, in order to identify the most significant classes.
If s denotes the number of GO classes under consideration, note that the common procedures for 2 × s frequency tables, such as the usual homogeneity chi-square test, are not appropriate as the GO classes are not mutually exclusive-in the sense that a single gene may be annotated in more than one class. Previous work, [21], established a probabilistic model for the joint distribution of gene hits in GO classes and provided a method for testing the fit of observed annotation frequencies to a given, fully-specified model, in a similar way to the classic goodness-of-fit chisquare test. Here we present an evolution of this method which, under a quite general setting, accounts for global testing of the differences between two gene samples, e.g. in an enrichment or experiment comparison context. The analysis may be performed with the objective of either "demonstrating" differences, or conversely demonstrating (near) equivalence, e.g. as an argument in support of the combination of results from two experiments. In this paper we focus in the first approach, i.e. demonstrating difference. This global analysis may be of interest by itself, or may be followed by class-by-class post-hoc comparisons, to determine which classes are more responsible for determining the associations or differences. Under this setting, the global test may provide useful information when sample sizes for specific single classes are small (while global sample size may be adequate). Its type I error level is closer to the nominal level and its power is in general greater than that of the class-by-class approach. For example, at a deep GO level (such as level 10 in the examples below) the global test may detect difference while classby-class comparisons may be unable to detect any such difference. This may suggest exploring a less specific GO level or even (as the global test provides evidence of the significance of at least one class) to choose as significant the class with the smallest p-value.

Methods
In this section we introduce our method, some notation and the global test procedure, and give a brief description of the associated R software. We conclude this section with the proof of the validity of the global test.

Decision criteria and algorithm
To complement the information provided by the global test with that from the class-by-class approach, we suggest the method illustrated in Figure 1 and described as follows: 1. At a given GO level, or for a given target set of GO classes (in one or more GO levels), perform a global comparison test. If the global test gives a non-significant result, stop the process: there is no evidence of differences for these GO classes. Otherwise, proceed to the next step. 2. Perform class-by-class testing (e.g. by means of Fisher's exact test 1 , with p-value adjustment) to identify the significant classes. If any of these tests produce significant results, stop the process: significant GO classes have been identified. Otherwise, proceed to the next step. 3. If no significant classes were found in the preceding step (but remember, the global test for differences gave a significant result), either: (a) Declare as significant the class associated with the lowest unadjusted p-value or, alternatively, (b) go back to step 2 and test for less specific GO classes, if these classes are still biologically or clinically interesting.
Step  of the full procedure is dominated by the type I error of the new global test. Thanks to the safeguard provided in step 1, step 3 may provide an extra possibility to identify truly significant classes. Admittedly, sometimes the class-by-class approach will detect one or more truly significant classes while the global test will give a negative result. But our simulation results indicate that this is a comparatively rare event, and the better power properties of the global test compared to multiple classby-class testing, particularly in presence of low annotation frequencies, in general largely compensate this small loss of sensitivity.

Notation and statistical approach
Given a set of GO classes which cut a GO graph at a given (but not necessarily unique) level-or simply a set of "interesting" classes-our approach consists of expanding the original distribution (where one gene can appear in several classes) into a new expanded distribution in which each gene belongs to one, and only one, disjoint set. This expanded distribution can be modeled as multinomial or as multihypergeometric, and standard statistical methods can be used to derive the asymptotic distribution of the counts.
We define a functional profile as the full vector of counts of the n genes in the sample in the A 1 , A 2 , ..., A s classes of a given level of an ontology-or, more generally, s classes defining a cross section of an ontology, possibly at more than one level. Since single genes may be annotated in more than one class, these counts may sum more than the total number of genes under consideration (if taken as absolute frequencies) or more than one (if taken as relative frequencies). To overcome this problem [21] introduced the concept of an expanded profile, defined as the joint frequencies of counts in the set of all possible combined GO classes, which are mutually exclusive. In other words, we distinguish between genes that are annotated exclusively in node A 1 , genes that are annotated exclusively and simultaneously in node A 1 and node A 2 , genes that are annotated exclusively and simultaneously in nodes A 1 , A 2 and A 3 , and so on. Expanded profiles should not be confused with plain ("contracted") functional profiles. Figure 2 shows the contracted and expanded profiles associated with 4 genes in 3 GO classes.
With these ideas in mind, we establish notation as follows.
The column vector of relative frequencies evaluated over a set of n genes is represented byP = (p 1· ,p 2· , . . . ,p s· ) (or P n to emphasize that it comes from a "sample" of n genes). The "dot" notationp i· is used to highlight the fact that all the genes annotated in class i (but not exclusively in it) have been counted. The term "profile" will indistinctly be used to designate the absolute frequencies, nP n or the relative frequenciesP n given n.  The symbolP (orP n ) designates an expanded profile, that is, the column vector of relative frequencieŝ P = (p 1 ,p 2 , . . . ,p s ,p 12 ,p 13 , . . . ,p (s−1)s ,p 123 , . . .) . (1) Herep i corresponds to the frequency of genes exclusively annotated in node A i ,p ij to the frequency of genes exclusively annotated in nodes A i and A j , and so on.
All these profiles are taken as sampling realizations of theoretical or population profiles, say P and Q-or P and Q for expanded profiles.
The dissimilarity between two profiles is measured in terms of their squared Euclidean distance: (2)

A new global comparison test
Suppose that we wish to compare the GO profiles of two lists of genes, A and B, of size n and m, respectively. Following [22], we note that the lists may share k genes, with three possibilities available (see Figure 3): Now, letP be the sample profile for the first list in a given ontology, andQ the corresponding profile for the second list. We havê and whereP 0 is the profile of the k common genes, andP 1 andQ 1 are the profiles of the non-common genes. Similarly,P 0 ,P 1 andQ 1 are the corresponding expanded sample profiles.
To test a null hypothesis of profile equality versus an alternative of difference, that is: we can use the fact that, if H 0 is true, approximately follows the distribution of a linear combination of s independent central chi-square-distributed random variables, each one with one degree of freedom: The details of the calculation of the b i coefficients and in general the proof of the above result are delayed to the end of this section. The result ensures the validity of the following decision criterion: "reject H 0 if V n,m >v (a, s)", where v (a, s) stands for the 1 -a quantile of the distribution of (7). Likewise, a p-value for (6) can be calculated from (7).
When the population profiles are not equal, the statistic approximately follows a normal distribution N (0, s 2 ). As a consequence, defines an approximate 1 -2a confidence interval for d (P, Q), where z a stands for the 1 -a quantile of a standard normal distribution andσ is a suitable estimate of s. Additionally, expression (8) provides a way to approximately compute the power of the above test. Again, details such as the expression of the variance s 2 are considered at the end of this section ("Mathematical details").

Software
The functionalities described in this paper, together with those in [21], have been implemented in the R package goProfiles, available from Bioconductor http://bioconductor.org/packages/release/bioc/html/goProfiles.html.
Package goProfiles uses the CRAN package Comp-QuadForm [23] to compute the distribution associated with (7). As an illustration of its use, the R commands associated with the example in the next section are available at http://estbioinfo.stat.ub.es/pubs/goProfiles1_-BIF/goProfiles1.htm.

Mathematical details
From considerations similar to those in [21] we conclude that the asymptotic distribution of (P,Q) is approximately multivariate normal. More exactly, if then where the covariance matrix Σ PQ may be estimated by: and P 0 , P 1 and Q 1 correspond to the covariance matrices associated with the respective profilesP 0 ,P 1 andQ 1 , that have the general form [21]: s ii = p i· (1 -p i· ) for i = 1, ..., s and s ij = p ij· -p i· p j· , when i ≠ j, for i, j = 1, ..., s.
From the algebraic relation: where 2s is the 2s × 2s matrix defined from the identity matrices of dimension s, I s : we have: where D n, m has been defined in (10). Consider first the case P ≠ Q. The second summand on the right-hand side of the above expression tends to zero, while, as a direct consequence of (11), the first summand in (14): is asymptotically normal: with Consider now that P = Q. Then we have From general results on the asymptotic distribution of quadratic forms with a normal basis [24], it can be deduced that is approximately distributed as a mixture of independent chi-square random variables: where the b i correspond to the eigenvalues of the matrix 2s PQ and the χ 2 1,i are independent chi-square random variables with one degree of freedom.
From (16) it follows that under the null hypothesis in (5) (i.e., when P = Q) U n,m P − → 0 . Thus, the decision criterion for (5) may be based on (20).

Results
In this section we describe two illustrative case-studies and some simulation results on the performance of the statistical methods introduced above.

Case-studies
We selected two datasets to illustrate our method. The first one is inspired by the work presented in [25], using data kindly provided by those authors. They studied the relationships between phenotypic attributes of a disease and the features of the associated genes, including their ascribed annotated functional classes and expression patterns. The sample gene lists were obtained from the ENSEMBL and OMIM databases and manually curated by the authors. They compared the functional pattern of different groups of genes: (1) genes associated with dominant diseases vs genes associated with recessive diseases, (2) genes associated with diseases vs all the genes in the human genome. The authors performed their global comparisons using chi-square tests, although they fairly point out that GO classes do not define a true partition of the gene lists or, in other words, that a gene may be annotated in more than one class. Although their conclusions and ours will be similar, we believe our method provides a more appropriate framework for such comparisons. Here we illustrate our method by comparing dominant disease-inducing genes and recessive disease-inducing genes. Table 1 shows the results of applying the global difference test to a list of 985 dominant and 818 recessive genes from the NCBI Entrez database 2 projected at the second level of the GO. Figure 4 shows plots of the profiles corresponding to the second level of the molecular function (MF) ontology for dominant and recessive genes. The results of the analysis, which are consistent with those obtained by [25], show that the two sets of genes can be considered functionally distinct with respect to the molecular function (MF) and biological process (BP) ontologies, that is to say, the related dominant and recessive diseases can be associated with different concept categories in both ontologies. With respect to the cellular component (CC) ontology, there are also statistically significant differences although they may be less biologically significant because the profiles are very similar (0.0248 distance).
According to step 1 in our general algorithm, analysis may be continued in order to identify the GO classes associated with the observed global differences. Tables 2  and 3 show the significant GO classes at level 2 for the MF and BP ontologies, respectively. The only significant class for the CC ontology is GO:0032991 with a 4.258815 × 10 -6 p-value. The p-values are based on class-by-class analyses by means of Fisher's test, followed by correction for testing multiplicity using the Holm method, [26].
These differences are also observed in deeper levels of the GO, that is, for more specific categories of molecular functions or biological processes. For illustrative purposes we briefly discuss some results at level 10. For the MF ontology, the global p-value is 0.0001307 but no significant classes are detected when class-by-class analyses are performed in the same conditions as before. However, according to step 3, the ontology class GO:0008094 (DNA-dependent ATPase activity) may be significant. For the BP ontology, a significant global result is also obtained, with a p-value of 8.639 × 10 -8 . The subsequent   Analysis at level 2. Significant MF level 2 GO classes after a class-by-class analysis based on Fisher's test and correction for multiple testing class-by-class analyses indicate the GO classes in Table 4 as significant.
In the second example we compare two microarray experiments described in [27] and [28] to study prostate tumors on the basis of gene expression data. Although the studies were performed independently, the type of tumor they analyzed, the microarray platforms (both studies were based on Affymetrix technology U95AV2) and the sample sizes were all similar; see Table 1 in [29]. Even a substantial proportion of the genes were common to both lists, which makes global comparison methods such as the chi-square test for homogeneity highly inadequate for determining the extent to which these genes represent different functional GO profiles or not. Obviously, the answer to the preceding question may depend on the level of specificity of the GO classes under consideration. The results for very general classes, at level 2 in the GO, are summarized in Table 5. There is only evidence of statistically significant differences (though possibly with little biological relevance, given the very small distances) for the CC ontology, with a 0.004 p-value. When class-by-class Fisher's tests are performed for the CC ontology (this testing step should be avoided for the globally non-significant ontologies, MF and BP) two classes (Table 6) emerge as significant. Significance for the CC ontology is maintained for more specific GO classes. When the analysis is performed at level 10, the global comparison test only produces significant results for the CC ontology, with a p-value of 0.01511. Class-by-class analyses (Table 7) reveal differences in some classes, all related to the ribosome.

Simulations
Simulations were performed in order to assess the validity of the above tests. Their true probability of rejecting the null hypothesis was estimated in different circumstances. Each simulation consisted of the generation of series of 10,000 sample profiles from hypothetical populations whose configurations were suggested (number of GO classes, sample sizes, etc.) by the observed profiles in some selected datasets and studies. The simulated Table 3   Analysis at level 10. Significant BP level 10 GO classes after a class-by-class analysis based on Fisher's test and correction for multiple testing profiles were always generated, in a first step, as "expanded profiles" according to a multinomial distribution, and subsequently converted to "contracted" profiles in order to compute the test statistics. The simulation programs were written in R [30] and executed in 64-bit R 2.12.1 under 64-bit Windows 7 Enterprise edition. An exhaustive simulation study is in process and is the subject of a forthcoming paper. Some early results of this study are available at the above address; they are fully concordant with the preliminary study described here. Table 8 shows the main results for two simulation scenarios based on [25] and [27,28] and simulating level 10 in the GO. The results in the "true H 0 " column correspond to "sample" profiles that were generated from equal "population" profiles, based on pooled data. The results in the "false H 0 " column correspond to population profiles directly taken as the observed profiles in the preceding examples (and that to a greater or lesser extent are truly different). For each simulation scenario, the following estimated quantities are displayed: • Probability of detecting at least one significant class when a class-by-class analysis with Holm's correction for multiplicity is performed, • Probability of rejecting the null hypothesis for the standard chi-square test of homogeneity • Probability of rejecting the null hypothesis for the global test based on (6) and (7), and • Additional detections of significant classes, according to step 3 of the proposed algorithm, i.e. when no significant classes are detected in a class-by-class analysis but the global test gave a significant result.
All the tests are simulated under a nominal significance level of 0.05.
The classical global chi-square test is clearly incorrect, as may be expected from the arguments in the background section. Its true significance level is very erratic, with very low but also very high values that may largely exceed the nominal level, with an observed maximum of 0.152 for the simulation based on the [25] scenario and the BP ontology.
Also as expected, the new method is at least as powerful (and in general more powerful) than a standard class-by-class analysis. The proportion of true positives that are detected by the class-by-class approach and not by the global test is very low. In the simulated scenarios it ranges from 0 to a maximum of 0.0038 in the simulation scenario inspired by [27,28] and the MF ontology. So, the possible loss in power associated to step 1 in our method is largely compensated by the greater power of the global test.

Discussion
In this work we present a method for performing global comparisons between groups of genes based on their functional profiles that is itself based on their projections at fixed GO levels, or their projections on a set of "interesting" GO classes which could even be at different levels in the ontology.
The method has been shown to perform well in the real case situations analyzed as well as in the simulation studies performed, even for very sparse frequency tables.
We noticed that it has become common practice to perform global tests (that is comparing two lists of genes) based on class-by-class analysis (declaring global differences if there is at least one significant class). Our work suggests that this approach is not appropriate because the dependence between the individual tests for each class and the objective of controlling the FDR or FWER error rates may result in a loss of power. Indeed our simulation results show that this approach yields a less powerful test than the method we present. This is not surprising since making global comparisons is not the main objective of these tests.
Another alternative to performing global tests, the classical homogeneity chi-square test, has also proven not to be valid. On the basis of aprioristic validity reasons explained above in the Background section, and specially in view of its lack of adequate type I error control, its use should be avoided as a tool for making global comparisons between profiles.
Although making a global comparison may often be the main objective of a study, especially if interest is focused on the GO classes that make the difference, the global test may work in conjunction with the usual methods to provide some extra insight. This is particularly clear when many GO classes are considered, for example for deep levels of the GO. Table 8 reflects the  Even for these cases where thousands of possible GO classes are considered, the global test still has a test size that is near the nominal significance level while at the same time it is more powerful than (or as powerful as) the class-by-class approach.
For those cases where highlighting GO classes causing the difference is of interest, we suggest the following strategy: if class-by-class analysis fails to detect any significant classes but the global test provides a significant result, then highlight as significant the class with the smallest uncorrected p-value. This strategy allows the detection of additional significant classes without inflating the type I error rate.

Conclusion
In conclusion, the method presented here provides a suitable approach for making global comparisons between lists of genes and should be considered to be complementary to some of the existing ways of comparing lists of genes derived from microarray studies.
In those cases where the user is interested in focusing on a few genes or specific classes, other methods may be more suitable. However, when a global comparison based on the biological meaning of the list of selected genes is required, our method may be the option of choice. It is statistically reliable (Table 8 and http://estbioinfo.stat.ub. es/pubs/goProfiles1_BIF/goProfiles1.htm) and an adequate alternative to the chi-square homogeneity test, which is incorrect to compare GO profiles. Additionally, it may provide some extra insight into GO classes that Probability of rejecting the null hypothesis of equality of profiles at a nominal 5% significance level in different scenarios associated with real case studies at level 10 in the GO. In the column "testing procedure", "Class-by-class" stands for declaring global differences (i.e. rejecting the null hypothesis of profile equality) if at least one significant class is detected in a class-by-class analysis with correction for testing multiplicity; "Chi-square" stands for the classical chi-square test of homogeneity; "New global" stands for the global test presented in this paper and, finally, "Additional signif. classes" stands for step 3 in the algorithm proposed in the methods section, i.e. proportion of simulation replicates where additional significant classes were detected when a class-by-class analysis was unable of any detection.
prove to be interesting but which would not be detected otherwise. This is more apparent at deep GO levels for sparse frequency tables (i.e. profiles), where correcting for a great degree of testing multiplicity imposes a heavy load on the class-by-class approach. Finally, it is worth mentioning that the applicability of our global comparison method largely surpasses the scope of our conducting examples. As mentioned in the second example (prostate cancer studies), comparison of functional profiles associated with distinct datasets can be used to decide if they can be merged or used combinedly for further studies. Another interesting application appears if one is interested in comparing gene signatures, that is groups of genes whose combined expression pattern is uniquely characteristic of a given condition or disease state and which are usually used to characterize or to predict this condition. One problem with signatures is that in many cases there are many signatures for similar situations. A comparison of their associated functional profiles may be used to help deciding if two given signatures are functionally equivalent. Also other useful applications may arise -as kindly reported by a referee -when one is interested in comparing the effect of applying different filtering methods. If two lists of genes obtained by applying different filters, or different cutoffs, do not differ in their functional profiles they might be considered functionally equivalent. Last, although outside the scope of this journal, the method may also have potential applications, like to compare lists of words (e.g. from two literary styles) in terms of their annotation profiles in semantic databases.

1
Fisher's exact test constitutes nearly a standard in this context and does not require new software development; clearly, a 2 × 2 version of our own test will be a more canonical possibility, but make the method less comparable to the mainstream approach without avoiding the need for a multiple testing correction., 2 Given the dynamic nature of the content of biological databases, these lists may have experienced some changes. In order to have a "frozen" version they have not been modified since they were included in the goProfiles package (first version Bioconductor 2.3) so that, in spite of possibly being out of date, they allow the examples in the package to be reproduced.