Online tool for the discrimination of equidistributions
 Thorsten Pöschel^{1}Email author,
 Cornelius Frömmel^{1} and
 Christoph Gille^{1}
DOI: 10.1186/14712105458
© Pöschel et al; licensee BioMed Central Ltd. 2003
Received: 19 June 2003
Accepted: 21 November 2003
Published: 21 November 2003
Abstract
Background
For many applications one wishes to decide whether a certain set of numbers originates from an equiprobability distribution or whether they are unequally distributed. Distributions of relative frequencies may deviate significantly from the corresponding probability distributions due to finite sample effects. Hence, it is not trivial to discriminate between an equiprobability distribution and nonequally distributed probabilities when knowing only frequencies.
Results
Based on analytical results we provide a software tool which allows to decide whether data correspond to an equiprobability distribution. The tool is available at http://bioinf.charite.de/equifreq/.
Conclusions
Its application is demonstrated for the distribution of point mutations in coding genes.
Background
Assume a set of certain events occur with frequencies M_{ i }, i = 1... N, with , e.g., M_{ i }= {4, 5, 2, 3, 2, 9,3,3,5,12,4, 6,4,... }. We ask the question whether the events obey an equiprobability distribution p_{ i }≡ 1/N. According to the general definition of probabilities
If we conclude naîvely from the observed frequencies to the probabilities, i.e., if we assume p_{ i }/p_{ j }= M_{ i }/M_{ j }, in the extreme case M_{100} = 3 we end up with a relative error of 70%. In other words, from the frequencies measured in an experiment as shown in Figs. 1 and 2, it might be erroneously concluded that the events are strongly nonequally distributed.
Using the methods of statistics we can generate (predict) the rank ordered frequency distribution for given N and M under the precondition that the events are equidistributed [1]. The predicted frequency distribution can then be compared with the distribution as measured in an experiment with the same values of M and N. From the comparison it can be judged whether the events in the experiment obey an equidistribution.
Following this procedure we describe a tool which helps to decide whether a given set of frequencies complies with an equidistribution. For demonstration the tool is applied to the distribution of point mutations in human genes.
Implementation
The numerical tool is available via the web address http://bioinf.charite.de/equifreq/. The underlying kernel program which computes the most probable frequency distribution is implemented in C++ and the user interface is written in PHP. The program source is available at this address.
Results and Discussion
Mathematical method
We want to sketch briefly the derivation of the basic formula: Assume we distribute M balls over N urns according to an equidistribution. The probability p(k_{ i }, i) to find k_{ i }urns filled each with exactly i balls is given by
Note that the probability to find a number of k_{ i }urns which contain exactly i balls is different form the probability to find the number of urns which contain at least i balls which is a simple textbook problem, whereas the derivation of Eq. (2) requires quite involved algebra. The relation between both probabilities is provided by the exclusioninclusion principle [2, 3]. For our purpose we need the number of urns filled with i balls which are found on average, i.e., we need the first moments of the probabilities Eq. (2). These values can be found in closed form applying the method of generating functions for the descending factorial moments. The averages have been derived in a different context earlier, the details of the derivation can be found in [4, 1]:
As an interesting detail of the solution, the average number of filled urns is given by the total number of urns minus the number of empty ones, N* = N  , i.e. [1],
Obviously, for small M (numbers of balls) there is a significant number of urns which, on average, stay empty. Translating back to the language of biology we come to a surprising result: given a population of N = 1000 species. If we investigate a number of M = 5000 individuals, from Eq. (4) we obtain N* ≈ 993.3, i.e., about 7 species are never found, although from naîve reasoning one expects each species occurring about 5 times.
Exploration of experimental data
 (1)
The measured frequencies of each species M_{ i }are given directly.
 (2)
The number of species N and the total number of individuals M are specified. Each individual is assigned a species by chance.
 (3)
As for (2) the rank ordered frequencies are computed but with the generalization that each species is assigned an individual probability. The theoretical basis for this computation is not given here but will be published elsewhere [5].
 (4)
The last input mask is intended for the investigation of the spatial distribution of point mutation in genes which is presently the most specialized application of the described program.
The program computes the expected frequency distribution due to Eq. (5) with the assumption that the species obey an equiprobability distribution. Three output files are generated: freq, ktheo and kexp. The file freq contains the rank ordered frequencies as generated from the input data set (cases (1) and (4)) or randomly due to an equiprobability distribution (case (2)) or a general distribution (case (3)). ktheo contains the moments for each rank i, i.e. the expected number of individuals occurring i times, due to Eq. (3) for given numbers N of species and M of individuals. For cases (1) and (4) the values of M and N are extracted from the input data, for (2) and (3) they are provided by the user. (Note that these expectation values are real numbers in general.) The third column of line i contains the value . The last file, kexp contains the same data as ktheo, but based on the input data (cases (1) and (4)) or on the randomly generated data (cases (2) and (3)), respectively. Besides the pure output files the program generates a number of visualizations (see section Example: Distribution of point mutation in genes). In order to compare the experimental data with the mathematical prediction both, the experimental data and the theoretical data, are plotted in the same chart. Congruence of both curves indicates that the experimental data obey an equidistribution (case (2)) or the specified distribution (case (3)), respectively.
It may occur that the curve of the rank ordered experimental data decays significantly slower than the corresponding theoretical curve due to Eq. (5). Since there is no distribution more homogeneous than the equidistribution this situation may occur either as a rare fluctuation (recall that the theoretical curve was generated according to the averaged occupation numbers, Eq. (3)). In such cases there is no probability distribution {p_{ i }} which reproduces the experiment on average. This case can be artificially evoked when the species in the input file occur with almost identical frequencies.
The difference between the experimental rank ordered frequency distribution and the corresponding theoretical distribution (Eq. (5)) evaluates the degree of coincidence of the input data with an equidistribution (case (1)) or with a specified distribution (case (3)). We define the score by
The significance of a particular difference score can be assessed by relating it to the distribution of difference scores. This distribution depends on M and N.
Example: Distribution of point mutations in genes
The increasing number of known point mutations and polymorphisms in many genes coding for pathogenetically important proteins offers the opportunity to apply statistical tests to correlate their type and location to evolutionary, biological and clinical features.
In each replication generation there occur mutations of the genome but frequently they remain unnoticed since they do not cause diseases. These socalled polymorphisms or variants may occur either in regions of the genome which are coding for amino acid sequences or in noncoding segments. Those changes of the DNA sequence that alter the amino acid sequence are frequently associated with diseases because the respective proteins cannot operate properly. Screenings for mutations using DNA of patients have been performed for many human diseases and the identified mutations are accessible in mutation databases [6].
The detection of socalled mutation hot spots, i.e. sequence regions with many mutation positions, is important for the identification of the functional and genetical properties of the genetic code [7]. These hot spots must be distinguished from statistical fluctuations that occur even when the probabilities for mutations are identical for each residue position. Moreover, the spatial distribution of point mutations in genes is of importance for the localization of coding and noncoding parts in the genome.
We wish to apply the described method to the investigation of the amino acid sequence of the cystic fibrosis transmembrane conductance regulator. The unperturbed gene (wild type) is given as a sequence of 1480 letters: MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLELSDIYQIPSVDSADNLSEKLER..., each standing for one amino acid [8]. In experiments there has been observed a large number of mutations, i.e., deviations from this sequence. Such mutations are available from data bases, e.g. [6].
The codes on top of the underbraces stand for the found mutations, e.g., P 5L means that at position 5 it has been found that the amino acid proline (P) was replaced by leucine (L).
Since the investigation of point mutation is an interesting field of application of the program we developed a separate input mask for this purpose (case (4) of the list in the previous section). The input syntax for this mode is described in detail in the online help file of the program.
Conclusions
For small sample sizes the relative frequencies M_{ i }/M of occurrence of individuals of a certain species i deviate significantly from the probabilities of occurrence p_{ i }. With the assumption that the N species occur with equal probability p_{ i }= 1/N the expectation values of the numbers of events which are contained j times (j = 0,..., M) in a sample of M individuals can be determined based on combinatorial algebra. These expectation values allow for a prediction of the rank ordered frequency distribution.
For many practical problems the amount of available data is insufficient to employ standard tests, such as χ^{2}, to discriminate whether or not a certain set of events complies with an equiprobability distribution. For such situations which occur frequently in the biological sciences we have developed an online tool which is available at http://bioinf.charite.de/equifreq/. As demonstrated for the case of point mutations in the sequence of amino acids of the cystic fibrosis transmembrane conductance regulator and the androgen receptor, even for sample set sizes which are certainly not sufficient to decide this question directly from the observed frequencies (see Figs. 3, 4) this tool helps to make a reliable statement.
The proposed method may be generalized to arbitrary probability distributions provided there exists a hypothesis on the functional form of the distribution [13]. For mathematical reasons, however, (see [5]) it is more difficult to derive an equivalent to Eq. (5) formula for nonequiprobability distributions, which is subject of current research.
Avalability and requirements

Project name: equifreq

Project home page: http://bioinf.charite.de/equifreq/

Operating systems: platform independent

Programming language: C++

Other requirements: none

License: GNU GPL

Any restrictions to use by nonacademics: none
Declarations
Acknowledgments
The authors are grateful to W. Ebeling, J. Freund and R. Mrowka for helpful discussion. We thank the reviewers for their helpful remarks and recommendations. Particularly fruitful was the analysis of mutations in the androgen receptor.
Authors’ Affiliations
References
 Pöschel T, Freund JA: Finitesample frequency distributions originating from an equiprobability distribution. Physical Review E 2002, 66: 026103. 10.1103/PhysRevE.66.026103View ArticleGoogle Scholar
 von Mises R: Über Aufteilungs und Besetzungswahrscheinlichkeiten. Revue de la Faculté de Sciences de I'Université d'Istanbul 1939, 4: 145–163.Google Scholar
 Johnson JN, Kotz S: Urn Models and Their Application New York: Wiley 1977.Google Scholar
 Freund JA, Pöschel T: A statistical approach to vehicular traffic. Physica A 1995, 219: 95–114. 10.1016/03784371(95)00170CView ArticleGoogle Scholar
 Pöschel T, Ebeling W, Frömmel C, Ramírez R: Correction algorithm for finite sample statistics. European Physical Journal E, in press.
 Cotton RG, Horaitis O: The HUGO mutation database initiative. Human genome organization. Pharmacogenomics 2002, 2: 16–19. 10.1038/sj.tpj.6500070View ArticleGoogle Scholar
 Walker DR, Bond JP, Tarone RE, Harris CC, Makalowski W, Boguski MS, Greenblatt MS: Evolutionary conservation and somatic mutation hotspot maps of p53: correlation with p53 protein structural and functional features. Oncogene 1999, 7: 211–218. 10.1038/sj.onc.1202298View ArticleGoogle Scholar
 Zielenski J, Rozmahel R, Bozon D, Kerem B, Grzelczak Z, Riordan JR, Rommens J, Tsui LC: Genomic DNA sequence of the cystic fibrosis transmembrane conductance regulator (CFTR) gene. Genomics 1991, 10: 214–228. (see entry CFTR_HUMAN in the SWISSPROT database).View ArticlePubMedGoogle Scholar
 Rommens JM, lannuzzi MC, Kerem B, Drumm ML, Melmer G, Dean M, Rozmahel R, Cole JL, Kennedy D, Hidaka N, Zsiga M, Buchwald M, Riordan JR, Tsui LC, Collins FS: Identification of the cystic fibrosis gene: chromosome walking and jumping. Science 1989, 245: 1059–1065.View ArticlePubMedGoogle Scholar
 Ng PC, Henikoff S: SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Research 2003, 13: 3812–3814. 10.1093/nar/gkg509View ArticleGoogle Scholar
 Mooney SD, Klein TE, Altman RB, Trifiro MA, Gottlieb B: A functional analysis of diseaseassociated mutations in the androgen receptor gene. Nucleic Acids Research 2003, 31: e42. 10.1093/nar/gng042PubMed CentralView ArticlePubMedGoogle Scholar
 Gottlieb B, Lehvaslaiho H, Beitel LK, Lumbroso R, Pinsky L, Trifiro M: The Androgen Receptor Gene Mutations Database. Nucleic Acids Research 1998, 26: 234–238. 10.1093/nar/26.1.234PubMed CentralView ArticlePubMedGoogle Scholar
 Pöschel T, Ebeling W, Rosé H: Guessing probability distributions from small samples. Journal of Statistical Physics 1995, 80: 1443–1452.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.