Mathematical method
We want to sketch briefly the derivation of the basic formula: Assume we distribute M balls over N urns according to an equidistribution. The probability p(k_{
i
}, i) to find k_{
i
}urns filled each with exactly i balls is given by
where
denotes the integer of x.
Note that the probability to find a number of k_{
i
}urns which contain exactly i balls is different form the probability to find the number of urns which contain at least i balls which is a simple textbook problem, whereas the derivation of Eq. (2) requires quite involved algebra. The relation between both probabilities is provided by the exclusioninclusion principle [2, 3]. For our purpose we need the number of urns filled with i balls which are found on average, i.e., we need the first moments of the probabilities Eq. (2). These values can be found in closed form applying the method of generating functions for the descending factorial moments. The averages have been derived in a different context earlier, the details of the derivation can be found in [4, 1]:
As an interesting detail of the solution, the average number of filled urns is given by the total number of urns minus the number of empty ones, N* = N  , i.e. [1],
Obviously, for small M (numbers of balls) there is a significant number of urns which, on average, stay empty. Translating back to the language of biology we come to a surprising result: given a population of N = 1000 species. If we investigate a number of M = 5000 individuals, from Eq. (4) we obtain N* ≈ 993.3, i.e., about 7 species are never found, although from naîve reasoning one expects each species occurring about 5 times.
The moments
given in Eq. (3) allow to reconstruct the rank ordered frequency distribution since they describe how many, on average, events do not occur (zero times), how many occur once, twice, etc. Hence, the desired rank ordered frequency distribution reads finally
We apply Eq. (5) to predict the frequency distribution which arises from an equidistribution for different sample sizes M and compare with direct numerical simulations, s. Figs. 3, 4. The predictions due to Eq. (5) agrees well with the numerical experiment.
Exploration of experimental data
The theoretical distribution of frequencies due to Eq. (5) can be compared with experimentally obtained frequencies. From the distance between both (rank ordered) frequency distributions we can conclude whether the experimental data obey an equidistribution. To this end we have elaborated a web based tool http://bioinf.charite.de/equifreq/. The user interface offers four alternative input masks which differ in the way the input file is generated:

(1)
The measured frequencies of each species M_{
i
}are given directly.

(2)
The number of species N and the total number of individuals M are specified. Each individual is assigned a species by chance.

(3)
As for (2) the rank ordered frequencies are computed but with the generalization that each species is assigned an individual probability. The theoretical basis for this computation is not given here but will be published elsewhere [5].

(4)
The last input mask is intended for the investigation of the spatial distribution of point mutation in genes which is presently the most specialized application of the described program.
The program computes the expected frequency distribution due to Eq. (5) with the assumption that the species obey an equiprobability distribution. Three output files are generated: freq, ktheo and kexp. The file freq contains the rank ordered frequencies as generated from the input data set (cases (1) and (4)) or randomly due to an equiprobability distribution (case (2)) or a general distribution (case (3)). ktheo contains the moments for each rank i, i.e. the expected number of individuals occurring i times, due to Eq. (3) for given numbers N of species and M of individuals. For cases (1) and (4) the values of M and N are extracted from the input data, for (2) and (3) they are provided by the user. (Note that these expectation values are real numbers in general.) The third column of line i contains the value . The last file, kexp contains the same data as ktheo, but based on the input data (cases (1) and (4)) or on the randomly generated data (cases (2) and (3)), respectively. Besides the pure output files the program generates a number of visualizations (see section Example: Distribution of point mutation in genes). In order to compare the experimental data with the mathematical prediction both, the experimental data and the theoretical data, are plotted in the same chart. Congruence of both curves indicates that the experimental data obey an equidistribution (case (2)) or the specified distribution (case (3)), respectively.
It may occur that the curve of the rank ordered experimental data decays significantly slower than the corresponding theoretical curve due to Eq. (5). Since there is no distribution more homogeneous than the equidistribution this situation may occur either as a rare fluctuation (recall that the theoretical curve was generated according to the averaged occupation numbers, Eq. (3)). In such cases there is no probability distribution {p_{
i
}} which reproduces the experiment on average. This case can be artificially evoked when the species in the input file occur with almost identical frequencies.
The difference between the experimental rank ordered frequency distribution and the corresponding theoretical distribution (Eq. (5)) evaluates the degree of coincidence of the input data with an equidistribution (case (1)) or with a specified distribution (case (3)). We define the score by
The significance of a particular difference score can be assessed by relating it to the distribution of difference scores. This distribution depends on M and N.
Example: Distribution of point mutations in genes
The increasing number of known point mutations and polymorphisms in many genes coding for pathogenetically important proteins offers the opportunity to apply statistical tests to correlate their type and location to evolutionary, biological and clinical features.
In each replication generation there occur mutations of the genome but frequently they remain unnoticed since they do not cause diseases. These socalled polymorphisms or variants may occur either in regions of the genome which are coding for amino acid sequences or in noncoding segments. Those changes of the DNA sequence that alter the amino acid sequence are frequently associated with diseases because the respective proteins cannot operate properly. Screenings for mutations using DNA of patients have been performed for many human diseases and the identified mutations are accessible in mutation databases [6].
The detection of socalled mutation hot spots, i.e. sequence regions with many mutation positions, is important for the identification of the functional and genetical properties of the genetic code [7]. These hot spots must be distinguished from statistical fluctuations that occur even when the probabilities for mutations are identical for each residue position. Moreover, the spatial distribution of point mutations in genes is of importance for the localization of coding and noncoding parts in the genome.
We wish to apply the described method to the investigation of the amino acid sequence of the cystic fibrosis transmembrane conductance regulator. The unperturbed gene (wild type) is given as a sequence of 1480 letters: MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLER..., each standing for one amino acid [8]. In experiments there has been observed a large number of mutations, i.e., deviations from this sequence. Such mutations are available from data bases, e.g. [6].
The codes on top of the underbraces stand for the found mutations, e.g., P 5L means that at position 5 it has been found that the amino acid proline (P) was replaced by leucine (L).
We subdivided the sequence into 74 parts of equal length 20 and counted the number of point mutations in each part. This way we obtain the measured frequencies M_{
i
}= {2,2,4,5,4, 5, 2, 2, 2, 3, 2,1,0,0,1,4,... } which serve as input data. (The subdivision into parts may be repeated with a different starting point which yields similar results.) Certainly, measured frequencies as small as given above do not allow for the application of the χ^{2}test. The measured frequencies are shown in Fig. 5. Obviously, based on this data it is not possible to decide a priori whether the frequencies are equidistributed.
After processing the data as described above we obtain the rank ordered measured distribution (bars in Fig. 6). The full line shows the expected (theoretical) frequency distribution due to Eq. (5) which has been generated with the hypothesis that the positions of the point mutations are equidistributed. Both curves deviate significantly from each other, therefore, we conclude that the mutations are not equidistributed. This conclusion agrees with the hypothesis in ref. [9].
Since the investigation of point mutation is an interesting field of application of the program we developed a separate input mask for this purpose (case (4) of the list in the previous section). The input syntax for this mode is described in detail in the online help file of the program.
Recently, it has been shown for point mutations in the human androgen receptor (AR) that the severity of the disease correlates with the local sequence conservation [11]. Germline mutations in the gene of the androgen receptor lead to the androgen insensitivity syndrome (AIS). In addition it was found that somatic point mutations associated with prostate cancer are more frequently found at locations with higher sequence variation compared to germline mutations leading to complete AIS. The related prediction method SIFT [10] has been proposed recently. Both methods, SIFT and the method used in [11] are based on the alignment of a large number of related proteins. Inspired by their observation we asked the question whether mutations in the androgen receptor are distributed randomly over the sequence depending on the association with AIS or prostate cancer. The diseaseassociated mutations in the AR were obtained from the AR gene mutation database [12]. Multiple mutations at identical positions were counted only once. Those mutations resulting in single amino acid substitutions were included in the analysis. The test was performed for 61 mutations associated with prostate cancer and 86 mutations found in patients with complete AIS. To perform the analysis we divided the sequence of 919 amino acids into 46 intervals of length 20 and counted the number of mutations in each interval. As expected, the results for the two datasets were different: Cancer associated mutations are more disseminated than congenital mutations found in patients with AIS. For mutations associated with prostate cancer the bar chart of the rank ordered frequencies nearly follows the theoretical curve for equal probabilities (Figs. 7, 8) whereas for AIS associated mutations the bar chart deviates markedly from the theoretical curve. Based on this finding we hypothesize that mutagenesis in the germline is followed by a selection process so that only a portion of the mutations are found in patients while others lead to early embryonal or fetal death. Conversely, mutations associated with prostate cancer may persist and are recorded.