 Methodology article
 Open Access
Estimating mutual information using Bspline functions – an improved similarity measure for analysing gene expression data
 Carsten O Daub^{1, 4}Email author,
 Ralf Steuer^{2},
 Joachim Selbig^{1} and
 Sebastian Kloska^{1, 3}
https://doi.org/10.1186/147121055118
© Daub et al; licensee BioMed Central Ltd. 2004
Received: 15 December 2003
Accepted: 31 August 2004
Published: 31 August 2004
Abstract
Background
The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size.
Results
In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two largescale gene expression datasets and the results are compared to those obtained using other similarity measures.
A C++ source code of our algorithm is available for noncommercial use from kloska@scienion.de upon request.
Conclusion
The utilisation of mutual information as similarity measure enables the detection of nonlinear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an adhoc basis without further justification, are thereby extended.
Keywords
 Mutual Information
 Surrogate Data
 Kernel Density Estimator
 Joint Entropy
 Spline Order
Background
The evaluation of the complex regulatory networks underlying molecular processes poses a major challenge to current research. With modern experimental methods in the field of gene expression, it is possible to monitor mRNA abundance for whole genomes [1, 2]. To elucidate the functional relationships inherent in this data, a commonly used approach is the clustering of coexpressed genes [3]. In this context, the choice of the similarity measure used for clustering, as well as the clustering method itself, is crucial for the results obtained. Often, linear similarity measures such as the Euclidean distance or Pearson correlation are used in an adhoc manner. By doing so, it is possible that subsets of nonlinear correlations contained in a given dataset are missed.
In this work, we discuss mutual information as a measure of similarity between variables. In the first section, we give a short introduction into the basic concepts including a brief description of the commonly used approaches for numerical estimation from continuous data. In the following section, we then present an algorithm for estimating mutual information from finite data.
The properties arising from this approach are compared to previously existing algorithms. In subsequent sections, we then apply our concept to largescale cDNA abundance datasets and determine if these datasets can be sufficiently described using linear measurements or if a significant amount of nonlinear correlations are missed.
Mutual information
Mutual information represents a general information theoretic approach to determine the statistical dependence between variables. The concept was initially developed for discrete data. For a system, A, with a finite set of M possible states {a_{1}, a_{2}, ... , }, the Shannon entropy H(A) is defined as [17]
where p(a_{ i }) denotes the probability of the state a_{ i }. The Shannon entropy is a measure for how evenly the states of A are distributed. The entropy of system A becomes zero if the outcome of a measurement of A is completely determined to be a_{ j }, thus if p(a_{ j }) = 1 and p(a_{ i }) = 0 for all i ≠ j, whereas the entropy becomes maximal if all probabilities are equal. The joint entropy H(A, B) of two systems A and B is defined analogously
This leads to the relation
H(A, B) ≤ H(A) + H(B) (3)
which fulfils equality only in the case of statistical independence of A and B. Mutual information MI(A, B) can be defined as [17]
MI(A, B) = H(A) + H(B)  H(A, B) ≥ 0 (4)
It is zero if A and B are statistically independent and increases the less statistically independent A and B are.
If mutual information is indeed to be used for the analysis of geneexpression data, the continuous experimental data need to be partitioned into discrete intervals, or bins. In the following section, we briefly review the established procedures; a description of how we have extended the basic approach will be provided in the subsequent section.
Estimates from continuous data
In the case of discrete data the estimation of the probabilities p(a_{ i }) is straightforward. Many practical applications, however, supply continuous data for which the probability distributions are unknown and have to be estimated. In a widely used approach [7], the calculation of mutual information is based on the binning of data into M discrete intervals a_{ i }, i = 1... M_{ A }. For experimental data consisting of N measurements of a variable x_{ u }, u = 1... N, an indicator function Θ_{ i }counts the number of data points within each bin. The probabilities are then estimated based on the relative frequencies of occurrence
with
Results
Fuzzy mutual information
Bspline functions
The first step in the definition of the Bspline functions is the definition of a knot vector t_{ i }for a number of bins i = 1... M and one given spline order k = 1... M  1 [22]
where the spline order determines the degree of the polynomial functions. The domain of the Bspline functions lies in the interval z ∈ [0, M  k + 1]. To cover the range of the variables, the new indicator function based on the Bspline functions needs to be linearly transformed to map their range. The recursive definition of the Bspline functions are as follows [22]
An important property of Bspline functions is the implicit standardisation of coefficients: All weights belonging to one data point sum up to unity.
Algorithm
Input

Variables x and y with values x_{ u }and y_{ u }, u = 1... N

Bins a_{ i }, i = 1... M_{ x }and b_{ j }, j = 1... M_{ y }

Spline order k
Output

Mutual information between variable x and y
Algorithm
 (b)
 (c)
Sum over all x_{ u }and determine p(a_{ i }) for each bin a_{ i }from
 (d)
Determine entropy H(x) according to Eq. (1)
 2.
Calculation of joint entropy of two variables x and y
 (a)
Apply steps 1 (a) and (b) to both variables x and y, independently
 (b)
Calculate joint probabilities p(a_{ i }, b_{ j }) for all M_{ x }× M_{ y }bins according to
Example
We show the estimation with the standard binning and our approach exemplarily on two artificial variables x = 0.0,0.2,0.4,0.6,0.8,1.0 and y = 0.8,1.0,0.6,0.4,0.0,0.2 for M = 3 bins, spline order k = 2, and the logarithm to basis two.
Simple binning
For both variables, each of the three histogram bins contains two values p(a_{1}) = p(a_{2}) = p(a_{3}) = , analogously for p(b_{ i }) due to the symmetry of data H(x) = H(y) = = log_{2} 3 ≈ 1.58. For the calculation of the joint probability, three of the nine two dimensional bins contain two values each p(a_{1}, b_{3}) = p(a_{2}, b_{2}) = p(a_{3}, b_{1}) = resulting in H(x, y) = log_{2} 3 and MI(x, y) = log_{2} 3.
Bspline approach
For the calculation of probabilities p(a_{ i }) according to the Bspline approach, M_{ x }weighting coefficients are determined for each value x_{ u }of variable x.
B_{ i }=1,k=2(x_{ u })  B_{ i }=2,k=2(x_{ u })  B_{ i }=3,k=2(x_{ u })  

x _{1}  1.0  0.0  0.0 
x _{2}  0.6  0.4  0.0 
x _{3}  0.2  0.8  0.0 
x _{4}  0.0  0.8  0.2 
x _{5}  0.0  0.4  0.6 
x _{6}  0.0  0.0  1.0 
p(a_{ i })  1.8/6  2.4/6  1.8/6 
In the next sections, we discuss some of the properties arising from the utilisation of Bspline functions for the estimation of mutual information and compare our approach to other commonly used estimators. We support this discussion using examples for which the underlying distributions and thereby the true mutual information is known.
Size of data
It has been discussed elsewhere [25–28, 20] that the estimated mutual information is systematically overestimated for a finite size of N data points. For the simple binning approach, the mean observed mutual information can be calculated explicitly as the deviation from the true mutual information
More importantly, a similar result also holds for the standard deviation of mutual information. As shown in Figure 3 (right), the standard deviation of the mutual information estimated with the simple binning (k = 1) scales with 1/N for statistically independent events [26, 29]. For the Bspline approach (k = 3), this scaling still holds, but the average values are decreased significantly. For the KDE approach, an asymptotic run above the values from the Bspline approach is observed, again with linear tail for large datasets. shows a linear scaling slightly below the simple binning.
The spline order
The interpretation of any results obtained from the application of mutual information to experimental data is based on testing to see if the calculated results are consistent with a previously chosen null hypothesis. By following the intuitive approach that the null hypothesis assumes the statistical independence of variables, mutual information is tested against a surrogate dataset, which is consistent with this null hypothesis. As discussed previously in more detail [20], one way of generating such a surrogate dataset is by random permutations of the original data. From the mutual information of the original dataset MI(X,Y)^{data}, the average value obtained from surrogate data <MI(X^{surr}, Y^{surr}) >, and its standard deviation σ^{surr}, the significance S can be formulated as
For each S the null hypothesis can be rejected to a certain level α depending on the underlying distribution. With increasing significance the probability of false positive associations drops.
From 300 shuffled realisations of this dataset, the mean and maximum mutual information are shown with the standard deviation as errorbars. For all spline orders the null hypothesis can be rejected, in accordance with the dataset shown in Figure 1. To estimate the strength of the rejection, we calculate the significance according to Eq. (13) (Figure 4, right). It can be observed that the largest change in the significance of the mutual information occurs in the transition from k = 1 (simple boxes) to k = 2 with an increase by roughly twofold. Using more sophisticated functions (k ≥ 3) does not further improve the significance. Similar findings have been reported in the context of kernel density estimators [19]. The major contribution leading to this increase of the significance is given by the distribution of surrogate data which becomes more narrow for k > 1 leading to smaller standard deviations σ^{surr}.
Application on data
We now turn to the analysis of experimentally measured gene expression data. As shown previously, the application of mutual information to largescale expression data reveals biologicallyrelevant clusters of genes [7, 30]. In this section, we will not repeat these analyses, but determine if the correlations detected using mutual information are missed using the established linear measures.
Among the most frequently used measures of similarity for clustering coexpressed genes are the Euclidean distance and the Pearson correlation coefficient R [3]. If correlations are well described by the Pearson correlation and the distribution of data is approximately Gaussian like, the relationship between the mutual information and the Pearson correlation given by [32]
The second dataset contains cDNA measurements for E_{2} = 102 experiments on G_{2} = 22608 genes derived from 20 different human tissues [33]. In contrast to the first dataset, tuples with low Pearson correlation but high mutual information are indeed detected. For two exemplary chosen tuples (Figure 8C and 8D), clusters of experimental conditions can be clearly detected by eye. Such type of correlations are missed by analyses based exclusively on linear measures, such as the the analysis done in the original publication of this dataset.
Discussion and conclusion
After a brief introduction into the information theoretic concept of mutual information, we proposed a method for its estimation from continuous data. Within our approach, we extend the bins of the classical algorithm to polynomial Bspline functions: Data points are no longer assigned to exactly one bin but to several bins simultaneously, with weights given by the Bspline functions. By definition, the weighting coefficients for each data point automatically sum up to unity. Though our algorithm is reminiscent of kernel density estimators [18], it keeps the basic idea to associate data points to discrete bins. In this way, we are able to avoid timeconsuming numerical integration steps usually intrinsic to estimates of mutual information using kernel density estimators [20].
Our findings show that the first dataset is fairly well described by the given relation of the Pearson correlation to the mutual information. No data points with high mutual information and low Pearson correlation are detected. Comparisons of genes containing outlying values, however, result in deviations with low mutual information and high Pearson correlation. From this, it follows that previous analyses on this dataset, based on Pearson correlation, did not miss any nonlinear correlations. This presents an important finding since it is by all means supposable that the regulations inherent in the genetic network under consideration might show more complex behaviour than the observed linear ones. Even for one of the largest expression datasets at hand, insufficient data might complicate the detection of such complex patterns of regulation. Alternatively, the biological mechanisms which underlay the regulatory networks might not lead to nonlinear correlations. It also has to be considered that the experimental methods applied for the generation of this dataset may make nonlinear correlations difficult to detect. The second dataset, in contrast, reveals highly significant tuples with high mutual information and low Pearson correlation. Detailed genegene plots of such tuples show that the expression values of the contributing genes fall into groups of experimental conditions. Without attempting to draw conclusions about the biological context of such clusters here, they might reflect interesting situations worth to be analysed in detail.
Declarations
Acknowledgements
The authors would like to thank Joachim Kopka and Janko Weise for stimulating discussions and Megan McKenzie for editing the manuscript (all of the MPIMP). RS acknowledges financial support by the HSPN grant of the the state of Brandenburg.
Authors’ Affiliations
References
 Schena M, Shalon D, Davis RW, Brown PO: Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science 1995, 270: 467–470.View ArticlePubMedGoogle Scholar
 Velculescu VE, Zhang I, Vogelstein B, Kinzler K: Serial Analysis of Gene Expression. Science 1995, 270: 484–487.View ArticlePubMedGoogle Scholar
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
 D'haeseleer P, Weng X, Fuhrman S, Somogyi R: Information processing in cells and tissues. Plenum Publishing 1997, 203–212. [http://www.cs.unm.edu/~patrik/networks/IPCAT/ipcat.html]Google Scholar
 D'haeseleer P, Liang S, Somogyi R: Genetic network inference: from coexpression clustering to reverse engineering. Bioinformatics 2000, 16: 707–726. 10.1093/bioinformatics/16.8.707View ArticlePubMedGoogle Scholar
 Michaels GS, Carr DB, Askenazi M, Fuhrmann S, Wen X, Somogyi R: Cluster analysis and data visualization of largescale gene expression data. Pac Symp Biocomput 1998, 42–53.Google Scholar
 Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000, 5: 427–439.Google Scholar
 Herwig R, Poustka AJ, Muller C, Bull C, Lehrach H, O'brien J: Largescale clustering of cDNAfingerprinting data. Genome Res 1999, 9: 1093–1105. 10.1101/gr.9.11.1093PubMed CentralView ArticlePubMedGoogle Scholar
 Korber BT, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: An information theoretic analysis. Proc Natl Acad Sci USA 1993, 90: 7176–7180.PubMed CentralView ArticlePubMedGoogle Scholar
 Gorodkin J, Heyer LJ, Brunak S, Stormo GD, Wen X, Somogyi R: Display the information contents of structural RNA alignments: the structure logos. Comput Appl Biosci 1997, 13: 583–586.PubMedGoogle Scholar
 Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 1998, 18–29.Google Scholar
 Roberts S, Everson R: Independent component analysis: Priciples and Practice. Cambridge: Cambridge University Press 2001.View ArticleGoogle Scholar
 Hyvärinen A, Karhunne J, Oja E: Independent component analysis. New York: Wiley 2001.View ArticleGoogle Scholar
 Fraser AM, Swinney HL: Independent coordinates for strange attractors from mutual information. Phys Rev A 1986, 33: 2318–2321. 10.1103/PhysRevA.33.1134View ArticleGoogle Scholar
 Thénevaz P, Unser M: Optimization of mutual information for multiresolution image registration. IEEE Trans Image Processing 2000, 9: 2083–2099. 10.1109/83.887976View ArticleGoogle Scholar
 Ellis DP, Bilmes JA: Using mutual information to design feature combinations. In Proceedings of the International Conference on Spoken Language Processing: Beijing 16–20 October 2000 [http://www.icsi.berkeley.edu/ftp/global/pub/speech/papers/icslp00cmi.pdf]
 Shannon CE: A mathematical theory of communication. The Bell System Technical Journal 1948, 27: 623–656.View ArticleGoogle Scholar
 Moon Y, Rajagopalan B, Lall U: Estimation of mutual information using kernel density estimators. Phys Rev E 1995, 52: 2318–2321. 10.1103/PhysRevE.52.2318View ArticleGoogle Scholar
 Silverman BW: Density estimation for statistics and data analysis. London: Chapman and Hall 1986.View ArticleGoogle Scholar
 Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting end evaluating dependencies between variables. Bioinformatics 2002, (Suppl.2):S231S240.Google Scholar
 Paninski L: Estimation of Entropy and Mutual Information. Neural Computation 2003, 15: 1191–1253. 10.1162/089976603321780272View ArticleGoogle Scholar
 DeBoor C: A practical guide to splines. New York: Springer 1978.View ArticleGoogle Scholar
 Unser M, Aldroubi A, Eden M: Bspline signal processing: Part 1 – Theory. IEEE Trans Signal Precessing 1993, 41: 821–832. 10.1109/78.193220View ArticleGoogle Scholar
 Unser M, Aldroubi A, Eden M: Bspline signal processing: Part 2 – Efficient design and applications. IEEE Trans Signal Precessing 1993, 41: 834–848. 10.1109/78.193221View ArticleGoogle Scholar
 Herzel H, Schmidt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4: 97–113. 10.1016/09600779(94)900205View ArticleGoogle Scholar
 Herzel H, Grosse I: Measuring correlations in symbol sequences. Physica A 1995, 216: 518–542. 10.1016/03784371(95)00104FView ArticleGoogle Scholar
 Grosse I: Estimating entropies from finite samples. In Dynamik, Evolution, Strukturen (Edited by: Freund JA). Berlin: Dr. Köster 1996, 181–190.Google Scholar
 Roulston MS: Estimating the error on measured entropy and mutual information. Physica D 1999, 125: 285–294. 10.1016/S01672789(98)002693View ArticleGoogle Scholar
 Herzel H, Grosse I: Correlations in DNA sequences: The role of protein coding segments. Phy Rev E 1997, 55: 800–810. 10.1103/PhysRevE.55.800View ArticleGoogle Scholar
 Klus GT, Song A, Schick A, Wahde M, Szallasi Z: Mutual Information Analysis as a Tool to Assess the Role of Aneuploidy in the Generation of CancerAssociated Differential Gene Expression Patterns. Pac Symp Biocomput 2001, 42–51.Google Scholar
 Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey HA, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional Discovery via a Compendium of Expression Profiles. Cell 2000, 102: 109–126. 10.1016/S00928674(00)000155View ArticlePubMedGoogle Scholar
 Steuer R, Daub CO, Selbig J, Kurths J: Measuring distances between variables by mutual information. In Proceedings of the 27th Annual Conference of the Gesellschaft für Klassifikation: Cottbus, in press. 12–14 March 2003Google Scholar
 He YD, Dai H, Schadt EE, Cavet G, Edwards SW, Stepaniants SB, Duenwald S, Kleinhanz R, Jones AR, Shoemaker DD, Stoughton RB: Microarray standard data set and figures of merit for comparing data processing methods and experiment design. Bioinformatics 2003, 19: 956–965. 10.1093/bioinformatics/btg126View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments
View archived comments (1)