Comparing functional annotation analyses with Catmap

Background Ranked gene lists from microarray experiments are usually analysed by assigning significance to predefined gene categories, e.g., based on functional annotations. Tools performing such analyses are often restricted to a category score based on a cutoff in the ranked list and a significance calculation based on random gene permutations as null hypothesis. Results We analysed three publicly available data sets, in each of which samples were divided in two classes and genes ranked according to their correlation to class labels. We developed a program, Catmap (available for download at ), to compare different scores and null hypotheses in gene category analysis, using Gene Ontology annotations for category definition. When a cutoff-based score was used, results depended strongly on the choice of cutoff, introducing an arbitrariness in the analysis. Comparing results using random gene permutations and random sample permutations, respectively, we found that the assigned significance of a category depended strongly on the choice of null hypothesis. Compared to sample label permutations, gene permutations gave much smaller p-values for large categories with many coexpressed genes. Conclusions In gene category analyses of ranked gene lists, a cutoff independent score is preferable. The choice of null hypothesis is very important; random gene permutations does not work well as an approximation to sample label permutations.


Rank(i)
The rank sum is located in the interval from n i=1 i = n(n+1) 2 to n i=1 (N − n + i) = n(2N −n+1) 2 . Another measure is the ROC area (Receiver Operating Characteristic) defined as follows: Take all pairs of genes (g 1 , g 2 ), where g 1 is one of the n category genes and g 2 is one of the N − n genes from the list but not from the category. There are n(N − n) such pairs. Find all pairs (g 1 , g 2 ) where the rank of g 1 is lower than g 2 . Call the number of such pairs num. The ROC area is then the fraction of all pairs with g 1 having lower rank than g 2 .
We see that ROC is a number between 0 and 1. Higher ROC means that the category genes are located more towards the top of the ranked list. There is a relation between the rank sum, R, and the ROC area, ROC, which comes about as follows: For category gene i with Rank(i) there are N − Rank(i) genes with higher rank. Some of these however could be category genes. The sum n i=1 (N − Rank(i)) counts the total number of correctly ordered pairs except that pairs of category genes are included as well. To get num we should subtract the number of pairs of category genes, of which there are n(n−1) Since ROC area and rank sum are equivalent it makes no difference which one we work with.
We have chosen to give the ROC area as the output of the program since it has the advantage that it is always normalized in the interval from 0 to 1. For the calculation of p-values, discussed below, it is easier to use the rank sum.

p-value
Now we turn to the calculation of p-values. For a given ranked list and category with a rank sum R the p-value is the probability that a random ranked list and the same category produces a smaller rank sum than R. The p-value depends on the distribution (null hypothesis) from which ranked lists are drawn. For most null hypotheses it would be necessary to draw a large set of ranked lists, calculate the rank sums and count the number of times it was smaller than R. However for the null hypothesis of randomly permuted genes it can be done analytically. Randomly permuting the ranked list and mapping the category genes to the list is the same as selecting the ranks of the category genes randomly. In other words the ranks, Rank(i), are n distinct numbers in the interval from 1 to N , randomly drawn. The total number of such configurations is N n . Let A(N, n, R) be the number of such configurations where the rank sum is less than or equal to R. The p-value is then We use three different methods for calculating the p-value analytically.

Iterative method
The iterative method is exact. It employs the fact that which follows from the observation that either rank N is occupied by a category gene or not. If rank N is not occupied by a category gene, the n category genes are sitting in a list of length N − 1 giving rise to the first term. If rank N is occupied by a category gene, the n − 1 others must be located in a list of length N − 1 with a rank sum less than or equal to R − N . This relation can be repeatedly used to reduce the parameters to trivial cases. The A(N, n, R) found in this way is exact but the method is slow for large parameters.

Gaussian
For large parameter values the distribution of the rank sum is approximated by a Gaussian distribution ( Troyanskaya et al., 2002, Walpole et al., 1998. Only the mean and variance of the rank sum needs to be known in order to use the Gaussian approximation. The mean is easily found: The variance can be found by a similar but more tedious calculation to be V ar(rank sum) = n(N − n)(N + 1) 12 The p-value is approximated by the probability to draw a smaller number than the real rank sum from a Gaussian distribution with this mean and variance. The Gaussian approximation is very fast but can deviate from the exact result. In our experience, it deviates most in the tail of the distribution, which is where the most significant p-values are found. However, it is usually not important whether the p-value is 10 −10 or 10 −12 . Qualitatively the Gaussian approximation gives the correct result.

Volume approximation
The volume approximation is used for cases where the exact method is too slow and the Gaussian result needs to be improved. It works as follows. Take a coordinate system with n coordinates. Each coordinate corresponds to a category gene. Consider points in the coordinate system with integer coordinates, each of which is between 1 and N , and no two coordinates of the point are equal. Such allowed points are in exact correspondence with a set of ranks of a category. Draw a hyperplane in the coordinate system with the equation The number A(N, n, R) is then given by the number of allowed points below the hyperplane. The volume approximation replaces the number of allowed points below the hyperplane with the volume below the hyperplane and the total number of allowed points with the volume of the n dimensional hypercube. This approximation introduces two errors. Firstly it ignores that no two ranks must be equal, secondly there is a boundary error at the hyperplane because of the integrality of the points. The first error is small if n is small compared to N . If n is close to N one can consider the complement of the category genes instead. The second error is small unless R is very close to its minimum.
After rescaling with a factor of N the approximate p-value is equal to the volume of the intersection of the n dimensional hypercube and the points below the hyperplane with the equation n i=1 x i = r Here r = R N is a number between 0 and n. Call this volume V n (r). For small n it is readily calculated: It is, of course, generally true that V n (r) = 0 if r ≤ 0 and V n (r) = 1 if r ≥ n. The volume can be written as the value of an integral where Θ(x) is the indicator function, Using this formula and induction in n and the form of V 1 or V 2 it is seen that V n (r) is a piecewise polynomial function. The polynomials are pieced together at integer values of r. Let V n,k be the polynomial that gives V n in the interval [k; k + 1] where k is an integer between 0 and n − 1. Again using induction in n and the above equation it is furthermore seen that V n,k is a polynomial of degree n. We can thus write V n,k as We have expanded the polynomial in r−k instead of r and introduced n! for convenience, as will be clear below. In order to calculate the volume it is necessary to know the coefficients c n,k,j . One way to find them is to start with n = 1 and repeatedly use eq.(1). However it can be done faster by noting that V n is n − 1 times continuously differentiable. This can again be proven by induction using eq.(1). V n,k can now almost be derived from V n,k−1 using continuity of the function and the first n − 1 derivatives in the point r = k where these two polynomials glue together. Actually c n,k,j is the j derivative in r = k which explains the form of the expansion of V n,k . This determines the coefficients c n,k,j for j = 0, .., n − 1. The n'th degree coefficient c n,k,n is not determined by continuity. However it can be derived exactly. By matching (n − 1)'th degree coefficients in eq.(1) one gets c n,k,n = c n−1,k,n−1 − c n−1,k−1,n−1 Using the result for n = 1 and induction this equation has the unique solution c n,k,n = (−1) k n − 1 k Here the famous relation from Pascal's triangle was used. The polynomial V n,0 is easily found either by matching to the region r ≤ 0 or by induction or by direct calculation of the integral.
V n,0 (r) = 1 n! r n The procedure to find the volume for parameters n, r is now the following. Let k be the integer part of r. Start with V n,0 as above and use it to find V n,1 by continuity of the function and the n − 1 derivatives in the point r = 1. Then find V n,2 and so on until V n,k is found. The volume is then given by V n,k (r).

Choice of method to calculate the p-value
We need to choose between the 3 methods given the values of the 3 parameters: the length of the ranked list, N , the number of genes in the category n and the rank sum R. The choice is based on a balance between accuracy and speed. The iterative method is exact and is used whenever it is fast enough. By counting the number of variables, that needs to be calculated, the time complexity of the 3 methods can be estimated. For the iterative method the time is proportional to N · n · R. For the volume approximation it is proportional to n 2 · R N . The Gaussian method is for all practical purposes fast. The choice of method is as follows: If N · n · R ≤ 10 6 the iterative method is used since it is exact. In the special case R < N the rank sum effectively sets a cutoff on the number of relevant genes and the iterative method is still used provided n · R 2 ≤ 10 6 . If not the Gaussian p-value, p gauss is calculated. If p gauss > 0.1 it is used. This is because the Gaussian is reliable away from the extreme tail of the distribution. If p gauss ≤ 0.1 and n 2 · R N ≤ 10 5 the volume method is used. In the case that none of these conditions are satisfied the Gaussian result is given, even though it might be imprecise. The imprecision is not really serious however. Typically it could be the question of whether a pvalue is 10 −6 or 10 −8 . For any reasonable parameters it would never make a very significant category insignificant or vice versa. This finishes our description of the score functions and the algorithm for finding p-values.