 Methodology article
 Open Access
 Published:
Exact pvalues for pairwise comparison of Friedman rank sums, with application to comparing classifiers
BMC Bioinformatics volume 18, Article number: 68 (2017)
Abstract
Background
The Friedman rank sum test is a widelyused nonparametric method in computational biology. In addition to examining the overall null hypothesis of no significant difference among any of the rank sums, it is typically of interest to conduct pairwise comparison tests. Current approaches to such tests rely on largesample approximations, due to the numerical complexity of computing the exact distribution. These approximate methods lead to inaccurate estimates in the tail of the distribution, which is most relevant for pvalue calculation.
Results
We propose an efficient, combinatorial exact approach for calculating the probability mass distribution of the rank sum difference statistic for pairwise comparison of Friedman rank sums, and compare exact results with recommended asymptotic approximations. Whereas the chisquared approximation performs inferiorly to exact computation overall, others, particularly the normal, perform well, except for the extreme tail. Hence exact calculation offers an improvement when small pvalues occur following multiple testing correction. Exact inference also enhances the identification of significant differences whenever the observed values are close to the approximate critical value. We illustrate the proposed method in the context of biological machine learning, were Friedman rank sum difference tests are commonly used for the comparison of classifiers over multiple datasets.
Conclusions
We provide a computationally fast method to determine the exact pvalue of the absolute rank sum difference of a pair of Friedman rank sums, making asymptotic tests obsolete. Calculation of exact pvalues is easy to implement in statistical software and the implementation in R is provided in one of the Additional files and is also available at http://www.ru.nl/publish/pages/726696/friedmanrsd.zip.
Background
The Friedman [1] rank sum test is a widelyused nonparametric method for the analysis of several related samples in computational biology and other fields. It is used, for example, to compare the performance results of a set of (expressionbased) classifiers over multiple datasets, covering case problems, benchmark functions, or performance indicators [2,3,4]. Some recent examples of the numerous applications of the Friedman test in bioinformatics include [5,6,7,8,9,10,11,12,13,14,15,16,17]. The test is supported by many statistical software packages and it is routinely discussed in textbooks on nonparametric statistics [18,19,20,21,22,23].
The Friedman test procedure is an analysis of variance by ranks, i.e., observed rank scores or rank scores obtained by ordering ordinal or numerical outcomes, that is used when one is not willing to make strong distributional assumptions. A common approach is to present the test as a method for identifying treatment effects of k different treatments in a socalled randomized complete block design. This design uses n sets, called blocks, of k homogeneous units matched on some relevant characteristic, for example patients matched on SNP genotype. The k treatments are assigned randomly to the k units within each block, with each treatment condition being administered once within a block. The Friedman test is also conducted if the samples concern a repeated measures design. In such design each experimental unit constitutes a block that serves in all treatment conditions. Examples are provided by experiments in which k different treatments (classifiers) are compared on a single experimental unit (dataset), or if k units (e.g., genes, products, candidates) are ranked in order by each of n ‘judges’ (algorithms, panelists). In all these settings the objective is to determine if the k populations from which the observations were made are identically distributed.
Applied to classifier comparison, the null hypothesis for the Friedman test is that the performance results of the k classifiers over n datasets are samples that have been drawn from the same population or, equivalently, from different populations with the same distribution [18]. To examine this hypothesis, the test determines whether the rank sums of the k classifiers included in the comparison are significantly different. After applying the omnibus Friedman test and observing that the rank sums are different, the next step is to compare all classifiers against each other or against a baseline classifier (e.g., newly proposed method or best performing algorithm). In doing so, a series of comparisons of rank sums (i.e., rank sum difference tests) is performed, adjusting the significance level using a Bonferroni correction or more powerful approaches to control the familywise TypeI error rate [3, 4].
Preferably one should use the exact null distribution of the rank sum difference statistic in these subsequent analyses. Only if the decision on the null hypothesis is based on the exact distribution is the probability of committing a TypeI error known exactly. However, the exact distribution and the associated true tail probabilities are not yet adequately known. To be sure, tables of exact critical values at standard significance levels (e.g., [18, 21, 22]) and of exact pvalues [24] are available for small values of k and n, for which complete enumeration is possible. Also, the lower order moments of the exact sampling distribution have been documented in detail [25], and Stuart [26] proved the conjecture of Whitfield [24] that, on the null hypothesis, the difference between rank sum values is asymptotically normally distributed as n tends to infinity. Further, in a recent study Koziol [27] used symbolic computation for finding the distribution of absolute values of differences in rank sums. Apart from these important contributions there is, to the best of our knowledge, no publication available in the probability theory, rank statistics or other literature that addresses the exact distribution of the rank sum difference statistic.
As the null distribution in the general case is unknown and exact computation seemingly intractable, it is generally recommended to apply a largesample approximation method to test the significance of the pairwise difference in rank sums. It is well known, however, that calculating probabilities using an asymptotic variant of an exact test may lead to inaccurate pvalues when the sample size n is small, as in most applications of the Friedman test, and thereby to a false acceptance or false rejection of the null hypothesis. Furthermore, there are several largesample tests available with different limiting distributions, and these tests may vary substantially in their results. Consequently, there is little agreement in the nonparametric literature over which approximate method is most appropriate to employ for the comparison of Friedman rank sums [22]. This statement refers both to approximate tests of significance for the comparison of all ( ^{k}_{2} ) = k(k − 1)/2 pairs of treatments, and to tests for the comparison of k − 1 treatments with a single control. Obviously, the utility of the asymptotic tests depends on their accuracy to approximate the exact sampling distribution of the discrete rank sum difference statistic.
The purpose of this note is twofold. The foremost aim is to provide an expression for calculating the exact probability mass function of the pairwise differences in Friedman rank sums. This expression may be employed to quickly calculate the exact pvalue and associated statistics such as the critical difference value. The calculation does not require a complicated algorithm and as it is easily incorporated into a computer program, there is no longer need to resort to asymptotic pvalues. We illustrate the exact method in the context of two recently published analyses of the performance of classifiers and data projection methods. The second aim is to examine under what circumstances the exact distribution and the associated exact statistics offer an improvement over traditional approximate methods for Friedman rank sum comparison.
It is important to note at the outset that this article is not concerned with the Friedman test itself. The Friedman test is an overall test that evaluates the joint distribution of rank sums to examine equality in treatment distributions. Computation of the exact joint distribution under the null is discussed by van de Wiel [28], and an efficient algorithm for computing the exact permutation distribution of the Friedman test statistic is available in StatXact [29]. The present paper offers an overall exact test for pairwise comparison of Friedman rank sums. The reason is essentially that researchers are usually not only interested in knowing whether any difference exists among treatments, but also in discovering which treatments are different from each other, and the Friedman test is not designed for this purpose. Although the type of test dealt with here is not the same as the Friedman test, we will briefly discuss the latter as the procedures have important elements in common, such as the global null hypothesis. Also, we assume in our discussion that the available data are such that it is appropriate to perform simultaneous rank sum tests. Hence, we ignore empirical issues such as insufficient power (too few datasets), selection bias (nonrandom selection of datasets), and like complications that, as noted by Boulesteix et al. ([30]; see also [31]), tend to invalidate statistical inference in comparative benchmarking studies of machine learning methods solving realworld problems. In ANOVA, the term ‘treatment’ is used as a common term for the grouping variable for which a response is measured. To accommodate the wide variety of applications of the Friedman test, the more general term ‘group’ is used instead of ‘treatment’ in the remainder of this paper. The term subject is used hereafter to include both objects and individuals.
Methods
Friedman data
To perform the Friedman test the observed data are arranged in the form of a complete twoway layout, as in Table 1A, where the k rows represent the groups (classifiers) and the n columns represent the blocks (datasets).
The data consist of n blocks with k observations within each block. Observations in different blocks are assumed to be independent. This assumption does not apply to the k observations within a block. The test procedure remains valid despite withinblock dependencies [32]. The Friedman test statistic is defined on ranked data so unless the original raw data are integervalued rank scores, the raw data are ranktransformed. The rank entries in Table 1B are obtained by first ordering the raw data {x _{ ij }; i = 1, …, n, j = 1, …, k} in Table 1A columnwise from least to greatest, within each of the n blocks separately and independently, and then to assign the integers 1,…,k as the rank scores of the k observations within a block. The row sum of the ranks for any group j is the rank sum defined as R _{ j } = ∑ ^{n}_{ i = 1} r _{ ij }.
Null hypothesis
The general null hypothesis of the Friedman test is that all the k blocked samples, each of size n, come from identical but unspecified population distributions. To specify this null hypothesis in more detail, let X _{ ij } denote a random variable with unknown cumulative distribution function F _{ ij }, and let x _{ ij } denote the realization of X _{ ij }.
The null hypothesis can be defined in two ways, depending on whether blocks are fixed or random [33]. If blocks are fixed, then all the k × n measurement values are independent. If there are k groups randomly assigned to hold k unrelated X _{ ij } within each block, as in a randomized complete block design, then the null hypothesis that the k groups have identical distributions may be formulated as
H _{0} : F _{ i1}(x) = … = F _{ ik }(x) = F _{ i }(x) for each i = 1, …, n,
where F _{ i }(x) is the distribution of the observations in the ith block [28, 33]. The same hypothesis, but more specific, is obtained if the usual additive model is assumed to have generated the x _{ ij } in the twoway layout [23]. The additive model decomposes the total effect on the measurement value into an overall effect μ, block i effect β _{ i }, and group j effect τ _{ j }. If the distribution function is denoted F _{ ij }(x) = F(x − μ − β _{ i } − τ _{ j }), the null hypothesis of no differences among the k groups may be stated as
and the general alternative hypothesis as
\( {H}_1:\kern0.5em {\tau}_{j_1}\ne {\tau}_{j_2} \) for at least one (j _{1}, j _{2}) pair.
Note that this representation also asserts that the underlying distribution functions F _{ i1}(x), …, F _{ ik }(x) within block i are the same, i.e., that F _{ i1}(x) = … = F _{ ik }(x) = F _{ i }(x), for each fixed i = 1, …, n.
If blocks are random, measurements from the same random block will be positively correlated. For example, if a single subject forms a block and k observations are made on the subject, possibly in randomized order, the withinblock observations are dependent. Such dependency occurs in a repeated measures design where n subjects are observed and each subject is tested under k conditions. Denote the joint distribution function of observations within block i by F _{ i }(x _{1}, …, x _{ k }). Then the null hypothesis of no differences among the k groups is the hypothesis of exchangeability of the random variables X _{ i1}, …, X _{ ik } [28, 34], formulated as
H _{0} : F _{ i }(x _{1}, …, x _{ k }) = F _{ i }(x _{ σ(1)}, …, x _{ σ(k)}) for i = 1, …, n,
where σ(1), …, σ(k) denotes any permutation of 1, …, k. The model underlying this hypothesis is that the random variables X _{ ij } have an exchangeable distribution. This is a suitable model for repeated measures, where it is not appropriate to assume independence within a block [32, 33]. We also note that this formulation of the null hypothesis and the one for fixed blocks are consistent against the same alternative, namely the negation of H _{0}. For a detailed discussion of this matter, see [35].
Whether blocks are fixed or random, if the null hypothesis is true, then all the permutations of 1, …, k are equally likely. There are k ! possible ways to assign k rank scores to the k groups within each block and all these intrablock permutations are equiprobable under H _{0}. As the same permutation argument applies to each of the n independent blocks, there are (k !)^{n} equally likely rank configurations of the rank scores r _{ ij } in the twoway layout [23]. Each of these permutations has a probability of (k !)^{− n} of being realized. This feature is used to evaluate the null distribution of the rank sums R _{ j }, by enumerating all the permutations of the twoway layout of ranks.
Friedman test statistic
Under the Friedman null hypothesis, the expected row sum of ranks for each group equals n(k + 1)/2. The Friedman test statistic
sums the squared deviations of the observed rank sums for each group, R _{ j }, from the common expected value for each group, n(k + 1)/2, under the assumption that the k group distributions are identical. For small values of k and n, the exact distribution of X ^{2}_{ r } has been tabled, for example, by Friedman [1]. An algorithm for computing the exact joint distribution of the Friedman rank sums under the null is discussed in [28]. For the special case of two paired samples, see [36].
Calculating the test statistic using the null distribution of the (k !)^{n} possible permutations is time consuming if k is large. However, Friedman [1] showed that as n tends to infinity, X ^{2}_{ r } converges in distribution to χ ^{2}_{ df = k − 1} , a chisquared random variable with k − 1 degrees of freedom. This result is used in the asymptotic Friedman test. The Friedman test rejects H _{0} at a prespecified significance level α when the test statistic X ^{2}_{ r } exceeds the 100(1 − α)th percentile of the limiting chisquared distribution of X ^{2}_{ r } with k − 1 degrees of freedom [1]. The test statistic needs to be adjusted if there are tied ranks within blocks [22, 23]. Also, various modifications of the Friedman test have been proposed, for example the F distribution as an alternative to the chisquared distribution [37], as well as generalizations, such as the SkillingsMack [38] test statistic for use in the presence of missing data. These and various other adjustments and nonparametric competitors to the Friedman test (e.g., KruskalWallis, Quade, Friedman aligned ranks test) are not discussed here (see [4, 22, 23]).
Pairwise comparison tests and approximate critical difference
Frequently, researchers are not only interested in testing the global hypothesis of the equality of groups but also, or even more so, in inference on the equality of equality of pairs of groups. Further, even if one is mainly interested in H _{0} and the hypothesis is rejected, a followup analysis may be conducted to determine possible reasons for the rejection. Such analysis may disclose group differences, but it might also reveal that none of the pairs is significantly different, despite a globally significant test result.
To address these issues it is expedient to test hypotheses of equality for pairs of groups using simultaneous comparison tests. These multiple comparison procedures may involve, in 1 × N (or manyone) comparisons, testing k − 1 hypotheses of equality of all noncontrol groups against the study control or, in N × N (allpairs) comparisons, considering k(k − 1)/2 hypotheses of equality between all pairs of groups. For both types of comparisons, largesample approximate tests have been designed. They are derived for the situation where n, the number of blocks (i.e., ‘sample size’), is large.
Table 2 displays the critical difference (CD) approximate tests for 1 × N and N × N comparisons of Friedman rank sums, as recommended in highlycited monographs and papers and popular textbooks on nonparametric statistics. The critical difference is the minimum required difference in rank sums for a pair of groups to differ at the prespecified alpha level of significance. It is to note that in many publications the CD statistic is calculated using the difference in rank sum averages, i.e., R _{ j }/n, rather than rank sums. The results are identical, since each group has n observations, if the test statistic formulas are modified appropriately.
When the null hypothesis of equidistribution of ranks in n independent rankings is true, and the condition of a large sample size is met, the differences in rank sums are approximately normally distributed [26]. Let d = R _{ i } − R _{ j }, with i ≠ j, be the rank sum difference among a pair of groups i and j. The support of rank sum difference d is the closure [−n(k − 1), n(k − 1)]. Under the null hypothesis, the expected value E(d) = 0 and the variance Var(d) = nk(k + 1)/6 [18, 23, 25]. As the distribution of d is symmetric around E(d) = 0, the skewness is zero, as are all odd order moments. The kurtosis coefficient, derived by Whitfield [24] as
is less than 3 (i.e., negative excess kurtosis), implying that the discrete rank sum difference distribution has thinner tails than the normal. Notice, however, that the kurtosis tends to 3 with increasing n, thus a normal approximation is reasonable. This implies that d has an asymptotic N(0, Var(d)) distribution and that the normal deviate \( d/\sqrt{\mathrm{Var}(d)} \) is asymptotically N(0, 1).
As can be seen in Table 2, the normal approximate test is recommended by various authors when all groups are to be compared against each other pairwise. It is also discussed by Demšar [2] as a test statistic to be employed when all groups are compared with a single control. Note that the normal test procedures control the familywise TypeI error rate by dividing the overall level of significance α by the number of comparisons performed (i.e., c _{1} in 1 × N, and c _{2} in N × N comparisons). There are more powerful competitors to this Bonferronitype correction available, such as the Holm, Hochberg, and Hommel procedures. These methods to control the overall false positive error rate are not elaborated in this paper. For a tutorial in the realm of classifier comparison, see Derrac et al. [4].
In addition to the ordinary normal approximation, simultaneous tests have been proposed that exploit the covariance structure of the distribution of the values of differences in rank sums. Whereas the n rankings are mutually independent under H _{0}, the rank sums and the rank sum differences are dependent and correlated as well. The correlation among the rank sum differences depends on the rank sums involved. Specifically, as reported by Miller [25], when the null hypothesis is true
Hence the correlation is zero for pairs of rank sum differences with no group in common, and 0.5 for pairs of differences with one group in common to both differences. The number of correlated pairs decreases as k increases. For a study involving k groups, the proportion of correlated pairs equals 4/(k + 1) [43]. Hence when k = 7, for example, 50% of the pairs are correlated, but when k = 79 only 5% are correlated.
As noted in various studies (e.g., [23, 25, 39]), for 1 × N comparisons this correlation structure implies that, when H _{0} is true and n tends to infinity, the distribution of the differences between the k − 1 group rank sums and the control rank sum coincides with an asymptotic (k − 1) variate normal distribution with zero means. The critical difference value can therefore be approximated by the test statistic labeled CD _{ M } in Table 2, where the constant \( {m}_{\alpha, df= k1,\rho ={\scriptscriptstyle \frac{1}{2}}} \) is the upper αth percentile point for the distribution of the maximum value of (k − 1) equally correlated N(0,1) random variables with common correlation \( \rho ={\scriptscriptstyle \frac{1}{2}}. \) The procedure has an asymptotic familywise error rate equal to α [23, 25].
For N × N comparisons, it means that the covariance of the rank sum differences equals the covariance of the differences between k independent random variables with zero means and variances nk(k + 1)/12. Thus, the asymptotic distribution of \( max\left\{\left{R}_i{R}_j\right\right\}/\sqrt{nk\left( k+1\right)/12} \) coincides with the distribution of the range (Q _{ k,∞}) of k independent N(0, 1) random variables. The associated test statistic is CD _{ Q }, where the constant q _{ α,df = k,∞} is the upper αth percentile point of the Studentized range (q) distribution with (k, ∞) degrees of freedom [23, 25, 39]. Again, as the test considers the absolute difference of all k groups simultaneously, the asymptotic familywise error rate equals α [23, 25].
The Friedman statistic test itself gives rise to the simultaneous test mentioned in the bottom row of Table 2. The null hypothesis is accepted if the difference in rank sums fails to exceed the critical value \( C{D}_{\chi^2}. \) This asymptotic chisquared approximation is recommended in some popular textbooks, although Miller [25] has argued that the probability statement is not the sharpest of tests.
Statistical power and alternative tests
Note that the CD test statistics presented in Table 2 do not require information about the withinblock ranks as determined in the experiment. Rather, the simultaneous rank tests all assume that within each block each observation is equally likely to have any available rank. When this is true, the quantity (k + 1)(k − 1)/12 is the variance of the withinblock rankings and nk(k + 1)/6 the variance of the difference between any two rank sums [25]. Hence the null distribution of d in the population has zero mean and known standard deviation. This is the precise reason why the normal approximate tests use the zscore as test statistic. However, it is important to emphasize in this context that the square root of nk(k + 1)/6 is the standard deviation of d when the overall null hypothesis is true, but not when it is false. It holds, similar to pvalues, only in a particular model, i.e. H _{0}; a model that may or may not be true. If the null hypothesis is false, the quantity nk(k + 1)/6 is typically an overestimate of the variance, and this causes simultaneous tests, approximate and exact, to lose power.
There are pairwise comparison tests for Friedman rank sums available that are computed on the observed rank scores rather than the rank sums. These tests, such as the RosenthalFerguson test [44] and the popular Conover test [45, 46], use the tscore as test statistic. The pairwise ttests are often more powerful than the simultaneous tests discussed above, however, there are also drawbacks. In brief, the RosenthalFerguson test uses the observed variances and covariance of the rank scores of each individual pair of groups, to obtain a standard error of d for the test of significance of the pairwise rank sum difference. This standard error is valid whether the null hypothesis of no pairwise difference is true or not. However, next to the formal constraint of the test that n should be larger than k + 1, the variance of d may be estimated poorly, as there are typically few degrees of freedom available for (co)variance estimation in smallsample Friedman test applications. Moreover, the observed (co)variances are different for each pair of groups. Consequently, it does not follow from the significance of a difference of a given rank sum A from another rank sum B, that a third rank sum C, more different from A than B is, would also be significantly different. This is an unpleasant feature of the test.
The Conover test estimates the standard deviation of d by computing a pooled standard error from the (co)variances of the observed rank scores of all groups, thus increasing statistical power. The method is similar to Fisher’s protected Least Significant Difference (LSD) test, applied to rank scores. In this methodology, no adjustment for multiple testing is made to the pvalues to preserve the familywise error rate at the nominal level of significance. Rather, the test is protected in the sense that no pairwise comparisons are performed unless the overall test statistic is significant. As in the Fisher protected LSD procedure, the Conover test has the property of incorporating the observed Fvalue of the overall test into the inferential decision process. However, in contrast to the Fisher protected LSD, which uses the observed Fvalue only in a 0–1 (‘go/no go’) manner, the Conover test uses the Fvalue in a smooth manner when computing the LSD. That is, it has the unusual characteristic that the larger the overall test statistic, the smaller the least significant difference threshold is for declaring a rank sum difference to be significant. The DuncanWaller test [47] has this same characteristic, but this test advocates a Bayesian approach to multiple comparisons with Bayes LSD. As the comparison tests in the second stage are conditional on the result of the first stage, the nominal alpha level used in the pairwise Conover test has no real probabilistic meaning in the frequentist sense. As noted by Conover and Iman ([48]: 2), “Since the α level of the secondstage test is usually not known, it is no longer a hypothesis test in the usual sense but rather merely a convenient yardstick for separating some treatments from others.”
Exact distribution and fast pvalue calculation
We present an exact test for simultaneous pairwise comparison of Friedman rank sums. The exact null distribution is determined using the probability generating function method. Generating functions provide an elegant way to obtain probability or frequency distributions of distributionfree test statistics [27, 28]. Application of the generating function method gives rise to the following theorem, the proof of which is in Additional file 1.
Theorem 1 For n mutually independent integervalued rankings, each with equally likely rank scores ranging from 1 to k, the exact probability to obtain pairwise difference d for any two rank sums equals
where
is the number of distinct ways a rank sum difference of d can arise, with d having support on d = [−n(k − 1), n(k − 1)].
Additional file 1 also offers a closedform expression for the exact pvalue of d. [49−51] The pvalue is defined as the probability of obtaining a result at least as extreme as the one observed, given that the null hypothesis is true. It is obtained as the sum of the probabilities of all possible d, for the same k and n, that are as likely or less likely than the observed value of d under the null. The exact pvalue is denoted P(D ≥ d; k, n), and it is computed using the expression
Calculating the exact pvalue with this triple summation expression provides a speedup of orders of magnitude over complete enumeration of all possible outcomes and their probabilities by a bruteforce permutation approach. For larger values of n, however, exact calculation is somewhat timeconsuming and to extend the practical range for performing exact tests, it is desirable to compute the pvalue more efficiently.
Also, because in practice multiple comparison tests are concerned with absolute differences, it is expedient to compute the cumulative probability of the absolute value of differences in rank sums. As the number of mass points of the symmetric distribution of d is an integer of the form 2n(k − 1) + 1, the distribution has an odd number of probabilities. This implies that, as the probability mass function of d is symmetric around zero, the probability mass to the left of d = 0 may be folded over, resulting in a folded distribution of nonnegative d. Consequently, the onesided pvalue of nonnegative d in the range d = 1, …, n(k − 1) may be obtained as the sum of the two onesided pvalues of the symmetric distribution with support d = [−n(k − 1), n(k − 1)]. As doubling the onesided pvalue leads to a pvalue for d = 0 that exceeds unity, the pvalue for d = 0 (only) is computed as P(D ≥ 0; k, n) = P(D = 0) + P(D ≥ 1), and this is exactly equal to 1.
To accelerate computation, we transform the double summation over the indices i and j in the expression for P(D ≥ d; k, n) to a summation over a single index, s say, using Theorem 2. The proof is given in Additional file 2.
Theorem 2 For nonnegative integers d and k
This reduction to a singlysum function implies that the pvalue can alternatively be calculated from the much simpler expression
and, as we will show, even for larger values of n in a computationally fast manner.
Software implementation
Although the two expressions for the exact pvalue are mathematically correct, straightforward computation may produce calculation errors. Even for moderate values of n (20 or so), the binomial coefficient that has d in the indices may become extremely large and storing these numbers for subsequent multiplication creates numerical overflow due to the precision limitation of fixedprecision arithmetic. One way to address this failure is to use a recurrence relation that satisfies the generating function [53, 54]. The recursions we examined were all computationally expensive to run, however, except for small values of n and/or k. A faster way to compute the exact pvalue correctly is to use arbitraryprecision arithmetic computation to deal with numbers that can be of arbitrary large size, limited only by the available computer memory.
The calculation of the pvalue of the absolute rank sum difference d given k and n is implemented in R [55]. The R code, which requires the package Rmpfr [56] for high precision arithmetic to be installed, is in Additional file 3. The script labeled pexactfrsd computes the exact pvalue P(D ≥ d), and additionally affords the possibility to compute the probability P(D = d), and the (cumulative) number of compositions of d (i.e., W(D = d) and W(D ≥ d)). The R code and potential future updates are also available at http://www.ru.nl/publish/pages/726696/friedmanrsd.zip.
To illustrate the derivations, Additional file 4 offers a smallsized numerical example (k = 3, n = 2), and Additional file 5 tabulates the number of compositions of d for combinations of k = n = 2,…,6, for inclusion in the OEIS [52]. As can be seen in Additional file 5, for small values of n the unfolded, symmetric distribution of d is bimodal, with modes at + 1 and − 1 [24]. This feature rapidly disappears as n increases, specifically, for k > 2 at n ≥ 6.
Hereafter, unless otherwise stated, we will consider the value of rank sum difference d to be either zero or positive, ranging from 0 to n(k − 1), and thus drop the absolute value symbol around d.
Incomplete rankings
Because the n rankings {1,2,…,k} are mutually independent, we may divide them into two (or more), equal or unequal sized parts, labeled (D _{1}; k, n _{1}) and (D _{2}; k, n _{2}), with ∑ ^{2}_{ t = 1} D _{ t } = D, and D _{ t } denoting the differences in rank sums of the two parts. The exact pvalue can be obtained using
where – as indicated by the summation’s lower bound – calculation is performed using the pvalue expression that allows for negative d. A unique and useful property of the exact method, which is not shared by the approximate methods discussed, is that it is easy to calculate pvalue probabilities for designs with unequal block sizes k; e.g., designs in which n _{1} has ranks {1, 2, …, k _{1}}, and n _{2} ranks {1, 2, …, k _{2}}, with k _{1} ≠ k _{2}. A general expression for calculating the exact pvalue in incomplete designs with j unequal sized parts is
where ∑ ^{j}_{ t = 1} D _{ t } = D, and an example in which n is subdivided into three parts, each with a unique value of k (k _{1}, k _{2}, k _{3}), is
Although the sum functions slow down calculation, this unique feature of exact pvalue computation enables one to conduct valid simultaneous significance tests whenever some withinblock ranks are missing by design. Such tests would be hard to accomplish using one of the largesample approximation methods. An empirical example will be given in the Application section.
Exact and mid pvalues
As pairwise differences with support on d = [−n(k − 1), n(k − 1)] are symmetrically distributed around zero under H _{0}, doubling onesided pvalue is the most natural and popular choice for an ordinary exact test. A test using exact pvalue guarantees that the probability of committing a TypeI error does not exceed the nominal level of significance. However, as the TypeI error rate is always below the nominal level, a significance test with exact pvalue is a conservative approach to testing, especially if the test involves a highly discrete distribution [57]. The mid pvalue, commonly defined as half the probability of an observed statistic plus the probability of more extreme values, i.e.,
ameliorates this problem. The mid pvalue is always closer to the nominal level than the exact pvalue, at the expense of occasionally exceeding the nominal size.
Tied rankings
The mid pvalue may also be used to handle tied rankings. When ties occur within blocks, the midrank (i.e., average of the ranks) is commonly assigned to each tied value. If, as a result of tied ranks, the observed rank sum difference is an integer value d plus 0.5, the pvalue may be obtained as the average of the exact pvalues of the adjacent integers d and d + 1, i.e., \( {\scriptscriptstyle \frac{1}{2}}\left[ P\left( D\ge d\right)+ P\left( D\ge\ d+1\right)\right], \) and this is equivalent to the mid pvalue. It is to note that the resulting probability is not exactly valid. Exact pvalues represent exact frequency probabilities of certain events, and mid pvalues have no such frequency interpretation. It may be argued, however, that this interpretational disadvantage is of little practical concern and that using mid pvalues is an almost exact frequency approach. For a discussion of other treatments of ties in rank tests, see [21].
Results and discussion
Time performance
The R program computes the exact pvalue P(D ≥ d; k, n) at a fast speed. It takes about half a second, for example, to calculate the exact pvalue for the rather demanding problem d = k = n = 100, on a HP desktop computer using the interpreted R language running under Windows 7 with an Intel Core i7 processor at 2.9GHz. To examine the effects of d, k and n on the algorithm’s runtime, we measured the time it takes to calculate the exact pvalue of d = 1 and d = n(k − 1) − 1, for n = 2, …, 100, and k = 10 and k = 100. The two support values next to the endpoints of the distribution were taken as the pvalues of the lower and upper support boundaries can be trivially obtained as 1 and 2{k(k − 1)}^{− n}, respectively. The computation time (in seconds) is shown in Fig. 1.
The figure indicates that running time is no limitation when it comes to calculating the exact pvalue, even for larger problems. Computation time is moderately affected by the magnitude of the computed pvalue. The smaller the pvalue is, the faster the computation speed. For rank sum difference d = 1 running time increases polynomially (of maximum order 3) with increasing n, and for d = n(k − 1) − 1 it increases virtually linearly. Also, for d = 1, the minor runtime difference between k = 10 and k = 100 increases slowly with increase in value of n. For d = n(k − 1) − 1 the time to do the calculation is essentially the same for k = 10 as for k = 100. In sum, these timing results show that the exact method admits an algorithm that is fast for all k and n values typically encountered in empirical studies testing differences in Friedman rank sums, such as those comparing classifiers. This quality makes the algorithm for exact calculation appealing, compared to alternative asymptotic approximations. Indeed, the algorithm is (considerably) faster than the one used here for evaluating the multivariate normalapproximate critical difference (CD _{ M }).
Exact distribution examples
We present some examples to illustrate the frequency probability distribution of rank sum difference d. The left panel of Fig. 2a displays the mass point probabilities P(D = d; k, n) for k = 5 and n = 5, over the entire support interval d = [0, 20]. The right panel shows the exact pvalues P(D ≥ d; k, n) for k = n = 5, i.e., the tailprobability at and beyond the value of d. The steps in the (cumulative) probability distributions are due to the discreteness of d, implying that events are concentrated at a few mass points. To adjust the pvalues for discreteness, one might opt to obtain mid pvalues. The mid pvalue is less than the exact pvalue by half the mass point probability of the observed result, and it behaves more like the pvalue for a test statistic with a continuous distribution.
The jumps at the steps decrease with increase in value of k and/or n. To exemplify this point, the left panel of Fig. 2b displays the less discrete pvalue distribution for k = n = 10. The powerful benefit of exact calculation is shown in the right panel of the same figure. The graph displays the log_{10}transformed pvalues obtained by exact calculation, with the cumulative normal density superimposed. As can be seen, the continuous normal is imperfect for estimating probabilities in the long right tail, where d values are large and pvalues are small. Note that the error increases as the pvalues decline. Compared to exact calculation, the cumulative normal is overly conservative in that it tends to overpredict the true pvalue and thus understate the evidence against the null.
For continuous test statistics, pvalues are known to be uniformly distributed over the interval [0,1] when the null hypothesis is true [58]. Also, uniformly distributed pvalues, with a mean of 0.5 and a variance of 1/12 ≈ 0.0833, produce a linear cumulative distribution function corresponding to the true overall null hypothesis, implying that points in the cumulative pvalue plot exhibit a straight line. We generated n = 5 Monte Carlo permutations of k = 50 integers from 1 to k inclusive, and calculated the rank sums and the exact pvalue of the rank sum differences. For this particular set of permutations, the mean of the ( ^{k}_{2} ) = 1, 225 pvalues was 0.512 and the variance 0.0824. The left panel of Fig. 2c confirms the intuitive notion that the discrete pvalues are approximately uniformly distributed under H _{0}. The right panel plots the 1 − pvalue against the number of pvalues (i.e., number of hypothesis tests), expressed in terms of proportions. As can be seen, the ensemble of pvalues in the cumulative plot is close to the diagonal line, as is to be expected when null hypotheses are all true.
Exact versus approximate critical differences
Table 3 presents the unadjusted and the Bonferroniadjusted exact and approximate critical differences for 1 × N and N × N comparisons of Friedman rank sums, for n = k = 5,10,25,50,100, at the familywise error rate of α=.05. The values for CD _{ M } were obtained using the R package mvtnorm [59], and the other approximate values using standard distributions available in the R stats package [55].
The first point to note from Table 3 is that, at the .05 level, the unadjusted normalapproximate critical differences (CD _{ N }) are identical to the exact CD for almost all k and n. In the event one chooses not to control the familywise error rate, the underestimation by CD _{ N } amounts to 1 at most, at least for the values of k and n considered here.
The close correspondence of normalapproximate and exact CD deteriorates once the pvalue threshold for significance is corrected for multiple testing. In 1 × N comparisons, the agreement is quite satisfactory as long as k is small relative to n, but the normal method overestimates the exact critical value if k is larger than n. The same goes for N × N comparisons, but worse. As can be seen, the normal approximation generally improves as n gets larger, for constant value of k, supporting largesample normal theory. However, the normal method overestimates the exact critical value considerably if k is larger than n. The disparity is most pronounced if k is large and n is small. For example, for k = 25 and n = 5, the exact CD is 83, whereas the (rounded) normal approximate critical difference value equals 88. The normal approximation produces larger than exact pvalues at the tails and larger than exact critical difference values.
The second point to note is that the ordinary normal method – while understating the evidence against the null hypothesis – is, by and large, the most accurate approximate test of the asymptotic variants studied here. The CD _{ M } for k − 1 comparisons with a control tends to underestimate the exact CD, even if n is large, which may lead one to incorrectly reject the null hypothesis. The same goes, but somewhat less so, for allpairs comparisons with CD _{ Q }. The Studentized range critical value is seen to be too liberal in the sense that it underestimates the critical difference value, even for larger values of n, and especially if n outnumbers k. The asymptotic procedure that draws on the chisquared distribution is seen to perform inadequately overall. As the inferences are suspect, this test statistic is not advocated as a criterion for judging whether differences in Friedman rank sums are significant.
Hence, in general, the normal approximation is overly conservative if n is smaller than k and the other approximations are too liberal if n is larger than k, and this holds even for relatively large values of n. For many parameter settings the biases are considerable. In any case, they are large enough to imply that if the observed rank sum difference is near to the critical value, the choice between exact and approximate methods can mean the difference between pairs of groups being considered significantly different or not. It is equally important to note that the above results apply to a familywise error rate of α=.05. The disparity between exact and asymptotic critical values increases, if the error rate is set to a lower value (e.g., .01). This issue is well visualized in the right panel of the earlier discussed Fig. 2b.
TypeI error and mid pvalues
The critical difference values denoted CD in Table 3 were obtained by setting the bound on TypeI error at 5%. For the asymptotic approximate methods, with a continuous reference distribution, the maximum probability of rejecting the null when it is in fact true is equal to α=.05. An exact test, however, keeps the actual probability of a TypeI error below 5%, as there are only certain pvalues possible when working with discrete data. Table 4 reports the actual probability of a TypeI error (i.e., exact pvalue) and the mid pvalue, for the unadjusted exact CD values presented in Table 3 (column 4).
Note that, whereas the alpha level was set at 5%, the actual probability of a TypeI error for the smallest n = k = 5 is a little above 3%. For larger values of k and n the ordinary exact test appears only slightly more conservative than the nominal level. Note further that the mid pvalue minimizes the discrepancy between the exact pvalue and the significance level. The mid pvalue occasionally exceeds the nominal level, and still tends to somewhat underrate the nominal in other instances, although necessarily less so than using the exact pvalue. As can be seen, the difference between exact and mid pvalue diminishes as k and/or n increases and the discreteness of the sample distribution diminishes.
We emphasize in this context that the inferential conservativeness associated with exact pvalues is introduced by testing at a prespecified alpha level of significance. In practice, it might be preferable to report observed levels of significance rather than testing at a particular cutoff value.
Normal error and continuity correction
Because the discrete rank sum difference distribution is approximated by a continuous distribution, a correction for continuity is advocated by some (e.g., [24]), to bring the asymptotic probabilities into closer agreement with the exact discrete probabilities. We restrict the discussion to the normal approximation and calculate the percentage relative error of the normal pvalues to the true pvalues using
where c is equal to 0.5 or 0 for the normal method with or without continuity correction, respectively. Figure 3 displays the percentage relative error R(d) versus exact pvalues, for n = k = 10,100.
The graphics indicate that the relative error decreases with increasing n, both for k = 10 and k = 100. They also show that, for k = 10 and n = 10,100, the normal approximation without continuity correction underestimates the true pvalue if the exact probabilities are large. However, small true pvalues are overestimated by the normal and this overestimation increases as the probabilities become smaller. Continuity correction brings large normal pvalues into closer correspondence with the exact pvalues, but for small pvalues (i.e., significant results) it may worsen agreement and increase overestimation by the normal. For k = 100, the rank sum difference distribution is less discrete and therefore correction for continuity has little effect. This suggests that the neglect of the continuity correction is not a serious matter, and may, indeed, occasionally be an advantage.
Finally, as indicated, the largesample approximations are derived for the situation where n is large. Frequently, however, the number of groups may be quite large whereas the number of replications per group is limited [60]. Such ‘large k, small n’ situation is fairly common in agricultural screening trials [61] for example, and it also occurs quite often in comparisons of classifiers using ranked data. Published examples in bioinformatics include classifier studies with dimensions k = 9 and n = 3 [62], k = 10 and n = 6 [63], and k = 13 and n = 4 [64]. A similar issue arises in the identification of k genes by ranking using n different algorithms, for example, k = 13 and n = 5 as in [65], and k = 88 and n = 12 as in [66]. Such ‘large k, small n’ data are common in geneexpression profiling studies [67, 68]. Particularly for these data conditions, the choice of an appropriate test statistic is vitally important to the validity of research inferences.
Application
We present two data examples to illustrate potential nonequivalence of exact and approximate inference, and the benefit of exact calculation. Recall that we assume that the data are such that it is appropriate to perform the Friedman test. We pass no judgement on this, as that would require expertise in the substantive fields and detailed ‘inhouse’ knowledge of selection and measurement procedures. For a proper statistical framework for comparison studies see Boulesteix et al. [30]. This review study also shows that realworld applications comparing classifiers are often underpowered. That is, in smallsample settings the differences between the performances of pairs of algorithms are sometimes so variable that one is unable to draw statistically meaningful conclusions.
To illustrate the benefit of exact calculation, Friedman rank data on the comparison of qPCR curve analysis methods were obtained from Ruijter et al. [69]. The aim of the comparison of the k = 11 methods was to test their performance in terms of the following (n = 4) indicators: bias, linearity, precision, and resolution in transcriptional biomarker identification. The null hypothesis is that there is no preferred ranking of the method results per gene for the performance parameters analyzed. The rank scores were obtained by averaging results across a large set of 69 genes in a biomarker data file.
Table 5 displays the Friedman rank sums of the methods and, in the upper top triangle, the absolute values of the differences in rank sums. We obtained the Bonferroniadjusted normalapproximate pvalue, Bonferroniadjusted exact pvalue, and Studentized range approximate pvalue for the 55 rank sum differences. The results are presented in the upper bottom, lower bottom, and lower top triangles of the table, respectively.
Straightforward comparison shows that the approximations are conservative estimates of the true probabilities. That is, the smallest exact pvalues are considerably smaller than both the normal and the Studentized range approximate pvalues. According to the normal approximate test there is, at a familywise error rate of .05, no evidence that the methods perform differently, except for Cy0 and FPFPCR, the pair of methods with the largest difference in rank sums. When applying the Studentized range distribution we find a rank sum difference of d = 31 or larger to be significant. The true pvalues are smaller however, and exact calculation provides evidence that the critical difference value at α=.05 is d = 30, implying that four pairs of methods perform significantly different. This example illustrates the practical implication of using exact pvalues in the sense that exact calculation uncovers more significantly different pairs of methods than the asymptotic approximations, and may thus lead to different conclusions.
We were reminded by the reviewers of this paper that the Friedman test assumes that the n blocks are independent, so that the measurement in one block has no influence on the measurements in any other block. This leads to questioning the appropriateness of the Friedman test in this application. We do not wish to make any firm judgement about this, other than making the observation that the rank scores presented in the source paper ([69]: Table 2) are strongly related. The same goes for the results of a similar analysis of much the same data by other researchers ([64]: Table 1).
The second illustration concerns exact calculation in incomplete designs. Zagar et al. [70] investigated the utility of k = 12 data transformation approaches and their predictive accuracy in a systematic evaluation on n = 10 cell differentiation datasets from different species (mouse, rat, and human) retrieved from the Gene Expression Omnibus. To compare the predictive accuracy performance of the k = 12 methods on the n = 10 datasets, they used the Friedman test. Table 6 presents the Friedman ranks obtained by ranking the raw scores presented in Table 1 of Zagar et al. [70].
Note that the ranks of Pathrecon and PCAMarkers for dataset GDS2688 are missing. Zagar et al. [70] therefore decided to exclude all ranks within GDS2688 from the computation of the rank sums and restricted their analysis to n = 9 datasets. The rank sums excluding GDS2688 are displayed in the rightmost column of Table 6.
Instead of deleting GDS2688, the missing data for Pathrecon and PCAMarkers could be dealt with by substitution, for example by imputing the mean of the observed raw scores, followed by reranking the 12 methods according to their scores on GDS2688. However, as noted by the authors, the score of PCAMarkers for GDS2688 is not given because “stem cell differentiation markers are not relevant for the process studied in this dataset” ([70]: 2549). Hence the rank score is missing by design, and thus imputation is inappropriate at least for the PCAMarkers method.
An alternative procedure is to divide the n = 10 independent ranking into two different parts, one consisting of k = 12 methods and n = 9 datasets and the other one having k = 10 methods and n = 1 dataset. The computation of exact pvalues in such incomplete design is readily accomplished, since the probabilities are easily obtained by the method outlined above. These pvalues afford the possibility to conduct valid significance tests using all available rank data.
The bottom part of Table 6 presents the exact pvalues obtained for the comparison of the MCEeuclidFC and the PLSAREAtime methods. Additional file 6 has the R code to reproduce the results. The nexttolast row displays the exact pvalues for the difference d = (73–36=)37 in rank sums, if the ranks for GDS2688 are not included in the sums. The bottom row shows the exact pvalues for the rank sums difference d = ([73 + 10][36 + 1]=)46 if the two rank sums include the available ranks of the methods for GDS2688. Note that for this particular comparison at least, the latter pvalues, whether adjusted or not, are considerable smaller than the pvalues obtained after listwise deletion of missing rank data.
The pvalue probabilities pertaining to difference of sums of all available rank data can also be estimated using permutation testing and most likely also with methodology such as Laplace approximation or the saddlepoint method. However, these stochastic and deterministic approximations tend to become rather complicated and more cumbersome to work with than the exact computation method described here.
Conclusions
We provide a combinatorial exact expression for obtaining the probability distribution of the discrete rank sum difference statistic for pairwise comparison of Friedman rank sums. The exact null distribution contributes to the improvement of tests of significance in the comparison of Friedman rank sums, and constitutes a framework for validating theoretical approximations to the true distribution. The numerical evaluations show that, in multiple comparison testing, determining the exact critical difference and the true pvalue offers a considerable improvement over largesample approximations in obtaining significance thresholds and achieved levels of significance. The empirical applications discussed exemplify the benefit, in practice, of using exact rather than asymptotic pvalues.
Of the largesample approximation methods considered in this study, the simple normal approximation corresponds most closely to the exact results, both for manyone and allpairs comparisons. However, the difference between exact and normal approximate pvalues can be large for significant events further in the tail of the distribution. Such events occur, in particular, whenever the number of groups k is large and the number of blocks n is small. In a multiple testing context with ‘large k and small n’, application of the normal approximation increases the probability of a TypeII error, hence false acceptance of the null hypothesis of ‘no difference’. The exact pvalues also greatly improve the ability to detect significant differences if the observed rank sum differences are close to the approximate critical value. In such situation, the choice between exact and approximate methods can mean the difference between pairs (classifiers) being considered significantly different or not. Further, we typically prefer tests that are as accurate as possible while still being fast to compute. As the exact pvalues can be computed swiftly by the method outlined in this note, there is no longer need to resort to occasionally flawed approximations.
Finally, the rank sum and rank product statistics are widely used in molecular profiling to identify differentially expressed molecules (i.e., genes, transcripts, proteins, metabolites) [67, 68, 71]. Molecule selection by ranking is important because only a limited number of candidate molecules can usually be followed up in the biological downstream analysis for subsequent study. The nonparametric statistic discussed here is potentially an additional new tool in the toolbox of methods for making justified, reproducible decisions about which molecules to consider as significantly differentially expressed.
Abbreviations
 CD:

Critical difference
 LSD:

Least significant difference
References
 1.
Friedman M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc. 1937;32:675–701.
 2.
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
 3.
García S, Herrera F. An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res. 2008;9:2677–94.
 4.
Derrac J, García S, Molina D, Herrera F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput. 2011;1:3–18.
 5.
Perrodou E, Chica C, Poch O, Gibson TJ, Thompson JD. A new protein linear motif benchmark for multiple sequence alignment software. BMC Bioinformatics. 2008;9:213.
 6.
Jones ME, Mayne GC, Wang T, Watson DI, Hussey DJ. A fixedpoint algorithm for estimating amplification efficiency from a polymerase chain reaction dilution series. BMC Bioinformatics. 2014;15:372.
 7.
de Souto MCP, Jaskowiak PA, Costa IG. Impact of missing data imputation methods on gene expression clustering and classification. BMC Bioinformatics. 2015;16:64.
 8.
Carvalho SG, GuerraSá R, de C Merschmann LH. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics. 2015;16 Suppl 19:S5.
 9.
Frades I, Resjö S, Andreasson E. Comparison of phosphorylation patterns across eukaryotes by discriminative Ngram analysis. BMC Bioinformatics. 2015;16:239.
 10.
Staržar M, Žitnik M, Zupan B, Ule J, Curk T. Orthogonal matrix factorization enables integrative analysis of multiple RNA binding proteins. Bioinformatics. 2016;32:1527–35.
 11.
Bacardit J, Widera P, MárquezChamorro A, Divina F, AguilarRuiz JS, Krasnogor N. Contact map prediction using a largescale ensemble of rule sets and the fusion of multiple predicted structural features. Bioinformatics. 2012;28:2441–8.
 12.
Allhoff M, Seré K, Chauvistré H, Lin Q, Zenke M, Costa IG. Detecting differential peaks in ChIPseq signals with ODIN. Bioinformatics. 2014;30:3467–75.
 13.
Gusmao EG, Dieterich C, Zenke M, Costa IG. Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications. Bioinformatics. 2014;30:3143–51.
 14.
Gong H, Liu H, Wu J, He H. Data construction for phosphorylation site prediction. Brief Bioinform. 2014;15:839–55.
 15.
Xue LC, Rodrigues JPGLM, Dobbs D, Honavar V, Bonvin AMJJ. Templatebased protein–protein docking exploiting pairwise interfacial residue restraints. Brief Bioinform. 2016. doi:10.1093/bib/bbw027.
 16.
Iranzo J, Gómez MJ, López de Saro FJ, Manrubia S. Largescale genomic analysis suggests a neutral punctuated dynamics of transposable elements in bacterial genomes. PLoS Comput Biol. 2014;10, e1003680.
 17.
Pontes B, Giráldez R, AquilarRuiz JS. Configurable patternbased evolutionary biclustering of gene expression data. Algorithm Mol Biol. 2013;8:4.
 18.
Siegel S, Castellan Jr NJ. Nonparametric Statistics for the Behavioral Sciences. 2nd ed. New York: McGrawHill; 1988.
 19.
Daniel WW. Applied Nonparametric Statistics. 2nd ed. Boston: Houghton Mifflin; 1990.
 20.
Zarr JH. Biostatistical analysis. 4th ed. Upper Saddle River: PrenticeHall; 1999.
 21.
Gibbons JD, Chakraborti S. Nonparametric Statistical Inference. 4th ed. New York: Marcel Dekker; 2003.
 22.
Sheskin DJ. Handbook of parametric and nonparametric statistical procedures. 5th ed. Boca Raton: Chapman and Hall/CRC; 2011.
 23.
Hollander M, Wolfe DA, Chicken E. Nonparametric statistical methods. 3rd ed. New York: Wiley; 2014.
 24.
Whitfield JW. The distribution of the difference in total rank value for two particular objects in m rankings of n objects. Brit J Statist Psych. 1954;7:45–9.
 25.
Miller Jr RG. Simultaneous statistical inference. New York: McGrawHill; 1966.
 26.
Stuart A. Limit distributions for total rank values. Brit J Statist Psych. 1954;7:31–5.
 27.
Koziol JA. A note on multiple comparison procedures for analysis of ranked data. Universal Journal of Food and Nutrition Science. 2013;1:11–5.
 28.
van de Wiel MA. Exact null distributions of quadratic distributionfree statistics for twoway classification. J Stat Plan Infer. 2004;120:29–40.
 29.
Cytel. StatXact: Statistical Software for Exact Nonparametric Inference. Cambridge: Cytel Software Corporation; 2016.
 30.
Boulesteix AL, Hable R, Lauer S, Eugster MJA. A statistical framework for hypothesis testing in real data comparison studies. Am Stat. 2015;69:201–12.
 31.
Boulesteix AL. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics. 2013;20:2664–6.
 32.
Jensen DR. Invariance under dependence by mixing. In: Block HW, Sampson AR, Savits TH, editors. Topics in Statistical Dependence. Lectures Notes  Monograph Series Volume 16. Hayward: Institute of Mathematical Statistics; 1990. p. 283–94.
 33.
Hettmansperger TP. Statistical inference based on ranks. New York: Wiley; 1984.
 34.
Puri ML, Sen PK. Nonparametric methods in multivariate analysis. New York: Wiley; 1971.
 35.
Laurent RS, Turk P. The effects of misconceptions on the properties of Friedman’s test. Commun Stat Simulat. 2013;42:1586–615.
 36.
Munzel U, Brunner E. An exact paired rank test. Biometrical J. 2002;44:584–93.
 37.
Iman RL, Davenport JM. Approximations of the critical region of the Friedman statistic. Comm Stat A Theor Meth. 1980;9:571–95.
 38.
Skillings JH, Mack GA. On the use of a Friedmantype statistic in balanced and unbalanced block designs. Technometrics. 1981;23:171–7.
 39.
Nemenyi PB. Distributionfree multiple comparisons, PhD thesis. Princeton: Princeton University; 1963.
 40.
Desu MM, Raghavarao D. Nonparametric statistical methods for complete and censored data. Boca Raton: Chapman and Hall/CRC; 2004.
 41.
Bortz J, Lienert GA, Boehnke K. Verteilungsfreie Methoden in der Biostatistik. Berlin: Springer; 1990.
 42.
Wike EL. Data analysis. A statistical primer for psychology students. New Brunswick: Aldine Transaction; 2006.
 43.
Saville DJ. Multiple comparison procedures: the practical solution. Am Stat. 1990;44:174–80. doi:10.2307/2684163.
 44.
Rosenthal I, Ferguson TS. An asymptotically distributionfree multiple comparison method with application to the problem of n rankings of m objects. Brit J Math Stat Psych. 1965;18:243–54.
 45.
Conover WJ. Practical x. 3rd ed. New York: Wiley; 1990.
 46.
Sprent P, Smeeton NC. Applied nonparametric statistical methods. 3rd ed. Boca Raton FL: Chapman and Hall/CRC; 2001.
 47.
Waller RA, Duncan DB. A Bayes rule for symmetric multiple comparisons problem. J Am Stat Assoc. 1969;64:1484–503. doi:10.2307/2286085.
 48.
Conover WJ, Iman RL. On multiplecomparisons procedures. Technical report LA7677MS. Los Alamos: Los Alamos Scientific Laboratory. 1979.
 49.
Feller W. An introduction to probability theory and its applications, volume I. New York: Wiley; 1968.
 50.
Koziol JA, Feng AC. A note on the genome scan metaanalysis statistic. Ann Hum Genet. 2004;68:376–80.
 51.
Szapudi I, Szalay A. Higher order statistics of the galaxy distribution using generating functions. Astrophys J. 1993;408:43–56.
 52.
OEIS Foundation Inc. The OnLine Encyclopedia of Integer Sequences, http://oeis.org; 2011.
 53.
Tsao CK. Distribution of the sum in random samples from a discrete population. Ann Math Stat. 1956;27:703–12.
 54.
Dobrushkin VA. Methods in algorithmic analysis. Boca Raton: Chapman and Hall/CRC; 2009.
 55.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2012.
 56.
Maechler M. Rmpfr: R MPFR – Multiple Precision FloatingPoint Reliable, Version 0.60, December 4 2015, https://cran.rproject.org/web/packages/Rmpfr/index.html
 57.
Agresti A. Categorical data analysis. 2nd ed. New York: Wiley; 2002.
 58.
Schweder T, Spjøtvoll E. Plots of Pvalues to evaluate many tests simultaneously. Biometrika. 1982;69:493–502.
 59.
Genz A, Bretz F, Miwa T, Mi X, Leisch F, Scheipl F, Bornkamp B, Maechler M, Hothorn T. Mvtnorm: multivariate normal and t distribution. Version. 2016;1. https://cran.rproject.org/web/packages/mvtnorm/.
 60.
Bathke A, Lankowski D. Rank procedures for a large number of treatments. J Stat Plan Infer. 2005;133:223–38.
 61.
Brownie C, Boos DD. Type I error robustness of ANOVA and ANOVA on ranks when the number of treatments is large. Biometrics. 1994;50:542–9.
 62.
Walia RR, Caragea C, Lewis BA, Towfic F, Terribilini M, ElManzalawy Y, Dobbs D, Honavar V. ProteinRNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics. 2012;13:89.
 63.
Wilm A, Mainz I, Steger G. An enhanced RNA alignment benchmark for sequence alignment programs. Algorithms Mol Biol. 2006;1:19.
 64.
Bultmann CA, Weiskirschen R. MAKERGAUL: an innovative MAK2based model and software for realtime PCR quantification. Clin Biochem. 2014;47:117–22.
 65.
Nascimento CS, Barbosa LT, Brito C, Fernandes RPM, Mann RS, Pinto APG, Oliviera HC, Dodson MV, Guimarães SEF, Duarte MS. Identification of suitable reference genes for real time quantitative polymerase chain reaction assays on Pectoralis major muscle in chicken (Gallus gallus). PLoS One. 2015;10, e0127935.
 66.
Hosseini I, Gama L, Mac Gabhann F. Multiplexed component analysis to identify genes contributing to the immune response during acute SIV infection. PLoS One. 2015;10, e0126843.
 67.
Eisinga R, Breitling R, Heskes T. The exact probability distribution of the rank product statistics for replicated experiments. FEBS Lett. 2013;587:677–82.
 68.
Heskes T, Eisinga R, Breitling R. A fast algorithm for determining bounds and accurate approximate pvalues of the rank product statistic for replicate experiments. BMC Bioinformatics. 2014;15:367. doi:10.1186/s1285901403671.
 69.
Ruijter JM, Pfaffl MW, Zhao S, Spiess AN, Boggy G, Blom J, Rutledge RG, Sisti D, Lievens A, De Preter K, Derveaux S, Hellemans J, Vandesompele J. Evaluation of qPCR curve analysis methods for reliable biomarker discovery: bias, resolution, precision, and implications. Methods. 2013;59:32–46.
 70.
Zagar L, Mulas F, Garagna S, Zuccotti M, Bellazzi R, Zupan B. Stage prediction of embryonic stem cell differentiation from genomewide expression data. Bioinformatics. 2011;27:2546–53. doi:10.1093/bioinformatics/btr422.
 71.
Breitling R, Armengaud P, Amtmann A, Herzyk P. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004;573:83–92.
Acknowledgements
The authors greatly appreciate comments by three reviewers leading to substantial improvements of the manuscript.
Funding
Not applicable.
Availability of data and materials
The rank data discussed in the main text were obtained from Table 2 in Ruijter et al. [69], and from Table 1 in Zagar et al. [70]. The R code in Additional file 3 and potential future updates are also available at http://www.ru.nl/publish/pages/726696/friedmanrsd.zip.
Author’s contributions
RE designed the exact method, implemented the algorithm, and drafted the manuscript. BP assisted in the implementation in R and drafted the manuscript. TH and MTG supervised the study and drafted the manuscript. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent to publish
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Affiliations
Corresponding author
Additional files
Additional file 1:
Proof of Theorem 1. (PDF 59 kb)
Additional file 2:
Proof of Theorem 2. (PDF 51 kb)
Additional file 3:
Friedmanrsd. A.zip file providing the script of the algorithm implemented in R. (ZIP 2 kb)
Additional file 4:
Numerical example for k = 3, n = 2. (PDF 67 kb)
Additional file 5:
Number of compositions of d for k,n = 2,…,6. (PDF 65 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Eisinga, R., Heskes, T., Pelzer, B. et al. Exact pvalues for pairwise comparison of Friedman rank sums, with application to comparing classifiers. BMC Bioinformatics 18, 68 (2017). https://doi.org/10.1186/s1285901714862
Received:
Accepted:
Published:
Keywords
 Friedman test
 Exact pvalue
 Rank sum difference
 Multiple comparison
 Nonparametric statistics
 Classifier comparison
 Machine learning