Skip to main content

Mutual information estimation reveals global associations between stimuli and biological processes



Although microarray gene expression analysis has become popular, it remains difficult to interpret the biological changes caused by stimuli or variation of conditions. Clustering of genes and associating each group with biological functions are often used methods. However, such methods only detect partial changes within cell processes. Herein, we propose a method for discovering global changes within a cell by associating observed conditions of gene expression with gene functions.


To elucidate the association, we introduce a novel feature selection method called Least-Squares Mutual Information (LSMI), which computes mutual information without density estimaion, and therefore LSMI can detect nonlinear associations within a cell. We demonstrate the effectiveness of LSMI through comparison with existing methods. The results of the application to yeast microarray datasets reveal that non-natural stimuli affect various biological processes, whereas others are no significant relation to specific cell processes. Furthermore, we discover that biological processes can be categorized into four types according to the responses of various stimuli: DNA/RNA metabolism, gene expression, protein metabolism, and protein localization.


We proposed a novel feature selection method called LSMI, and applied LSMI to mining the association between conditions of yeast and biological processes through microarray datasets. In fact, LSMI allows us to elucidate the global organization of cellular process control.


Advances in microarray technologies enable us to explore the comprehensive dynamics of transcription within a cell. The current problem is to extract useful information from a massive dataset. The primarily used approach is clustering. Cluster analysis reveals variations of gene expression and reduces the complexity of large datasets. However, additional methods are necessary to associate genes in each cluster with genetic function using GO term finder [1], or to understand stimuli related to specific cellular status.

However, these clustering-association strategies cannot detect global cell status changes because of the division of clusters. Some stimuli activate a specific pathway, although others might change overall cellular processes. Understanding the effect of stimuli in cellular processes directly, in this paper, we introduce a novel feature selection method called Least-Squares Mutual Information (LSMI), which selects features using mutual information without density estimation. Mutual information has been utilized to measure distances between gene expressions [2]. To compute the mutual information in existing methods, density estimation or discritization is required. However, the estimation of gene expression is difficult because we have little knowledge about density function of gene expression profile. LSMI offers an analytic-form solution and avoid the estimation.

Feature selection techniques are often used in gene expression analysis [3]. Actually, LSMI has three advantages compared to existing methods: capability of avoiding density estimation which is known to be a hard problem [4], availability of model selection, and freedom from a strong model assumption. To evaluate the reliability of ranked features using LSMI, we compare receiver operating characteristic (ROC) curves [5] to those of existing methods: kernel density estimation (KDE) [6, 7], k-nearest neighbor (KNN) [8], Edgeworth expansion (EDGE) [9], and Pearson correlation coefficient (PCC). Thereby, we certify that our method has better performance than the existing methods in prediction of gene functions about biological processes. This fact implies that features selected using our method reflect biological processes.

Using the ranked features, we illustrate the associations between stimuli and biological processes according to gene expressions. Results show that stimuli damage essential processes within a cell, causing association with some cellular processes. From the response to stimuli, biological processes are divisible into four categories: DNA/RNA metabolic processes, gene expression, protein metabolic processes, and protein localization.


Approach – mutual information detection

In this study, we detect underlying dependencies between gene expressions obtained by groups of stimuli and gene functions. The dependencies are studied in various machine learning problems such as feature selection [10, 11] and independent component analysis [12]. Although classical correlation analysis would be useful for these problems, it cannot detect nonlinear dependencies with no correlation. On the other hand, mutual information (MI), which plays an important role in information theory [13], enables us to detect general nonlinear dependencies. Let x and y be a set of gene expressions and a set of known gene functions. A variant of MI based on the squared loss is defined by

I s ( X , Y ) : = ( p xy ( x , y ) p x ( x ) p y ( y ) 1 ) 2 × p x ( x ) p y ( y ) d x d y . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeGabiqaaaqaaiabdMeajnaaBaaaleaacqWGZbWCaeqaaOGaeiikaGIaemiwaGLaeiilaWIaemywaKLaeiykaKIaeiOoaOJaeyypa0Zaa8GaaeaadaqadaqaaKqbaoaalaaabaGaemiCaa3aaSbaaeaacqqG4baEcqqG5bqEaeqaaiabcIcaOGqadiab=Hha4jabcYcaSiab=Lha5jabcMcaPaqaaiabdchaWnaaBaaabaGaeeiEaGhabeaacqGGOaakcqWF4baEcqGGPaqkcqWGWbaCdaWgaaqaaiabbMha5bqabaGaeiikaGIae8xEaKNaeiykaKcaaOGaeyOeI0IaeGymaedacaGLOaGaayzkaaWaaWbaaSqabeaacqaIYaGmaaaabeqab0Gaey4kIiVaey4kIipaaOqaaiabgEna0kabdchaWnaaBaaaleaacqqG4baEaeqaaOGaeiikaGIae8hEaGNaeiykaKIaemiCaa3aaSbaaSqaaiabbMha5bqabaGccqGGOaakcqWF5bqEcqGGPaqkcqWGKbazcqWF4baEcqWGKbazcqWF5bqEcqGGUaGlaaaaaa@6A4B@

Note that I s vanishes if and only if x and y are independent. The use of MI allows us to detect no correlation stimulus with a specific gene function or process.

Estimating MI is known to be a difficult problem in practice [8, 9, 11]. Herein, we propose LSMI, which does not involve density estimation but directly models the density ratio:

w ( x , y ) : = p xy ( x , y ) p x ( x ) p y ( y ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4DaCNaeiikaGccbmGae8hEaGNaeiilaWIae8xEaKNaeiykaKIaeiOoaOJaeyypa0tcfa4aaSaaaeaacqWGWbaCdaWgaaqaaiabbIha4jabbMha5bqabaGaeiikaGIae8hEaGNaeiilaWIae8xEaKNaeiykaKcabaGaemiCaa3aaSbaaeaacqqG4baEaeqaaiabcIcaOiab=Hha4jabcMcaPiabdchaWnaaBaaabaGaeeyEaKhabeaacqGGOaakcqWF5bqEcqGGPaqkaaGccqGGUaGlaaa@4CFE@

Given a density ratio estimator w ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaaaaa@2D5C@ (x, y), squared loss MI can be simply estimated by

I ^ s ( X , Y ) = 1 n 2 i , j = 1 n ( w ^ ( x i , y j ) 1 ) 2 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmysaKKbaKaadaWgaaWcbaGaem4CamhabeaakiabcIcaOiabdIfayjabcYcaSiabdMfazjabcMcaPiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa42aaWbaaeqabaGaeGOmaidaaaaakmaaqahabaGaeiikaGIafm4DaCNbaKaacqGGOaakieWacqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=Lha5naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyOeI0IaeGymaeJaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeiilaWIaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdGccqGGUaGlaaa@520B@

Mathematical definitions related to LSMI are provided in the Methods section. LSMI offers an analytic-form solution, which allows us to estimate MI in a computationally very efficiently manner. It is noteworthy that x includes a multi-dimensional vector. In fact, LSMI can handle a group of stimuli, although generic correlation indices such as Pearson correlation between parameters and target value are calculated independently. Therefore, we can elucidate which type of stimulus has no dependency to biological processes using LSMI.

Datasets and feature selection

In this section, we first prepare datasets to show the association between stimuli and biological process, and introduce feature selection using the datasets.

Biological process

We compute mutual information between gene expression values grouped by stimuli and class of genes' biological processes. As the class, we use biological process terms in Gene Ontology (GO) categorization [14]. We select GO terms associated with more than 800 and less than 2,000 genes because terms having a small number of genes only describe a fraction of the cell status, whereas terms having a large number of genes indicate functions associated with almost all genes in yeast. Actually, GO has a directed acyclic graph (DAG) structure, and each term has child terms. The GO terms are classified into three categories; we use only biological process terms to identify the changes within a cell. Using this method, we select 12 GO terms.

Gene expression profiles

The gene expression profile is the best comprehensive dataset to associate stimuli and biological processes. We use two different microarray datasets. One is of 173 microarray data under stress conditions of various types [15]. We categorize the 173 stress conditions into 29 groups based on the type of condition such as heat shock, oxidizing condition, etc. The other is of 300 microarray data under gene-mutated conditions [16]. We categorize the genes into 146 groups based on associated GO terms. We use only the GO terms which are associated with 1,500 genes or fewer. We also use child terms on a GO layered structure if the term has more than 1,200 genes. When one gene belongs to multiple GO terms, we classify the gene into the the classification whose number of associated genes is smallest. In both profiles, we remove genes whose expression values are obtained from fewer than 30% of all observed conditions. All missing values are filled out by the average of all the expression values.

Feature selection using LSMI

We use a novel feature selection method called LSMI, which is based on MI, to associate stimuli with cellular processes. Here we consider the forward feature-group addition strategy, i.e., a feature-group score between each input feature-group and output cellular process is computed. The top m feature-groups are used for training a classifier. We predict 12 GO terms independently. We randomly choose 500 genes from among 6, 116 genes on the stress condition dataset for feature-group selection and for training a classifier; the rest are used for evaluating the generalization performance. For using the gene-mutated expression dataset, we select 500 genes from among 6, 210 genes. We repeat this trial 10 times. For classification, we use a Gaussian kernel support vector machine (GK-SVM) [4], where the kernel width is set at the median distance among all samples and the regularization parameter is fixed at C = 10. We explain the efficiency of feature selection of LSMI in the Discussion section.


The association between stress conditions and biological processes in GO terms is shown in Fig. 1. Each row and column respectively indicate a group of conditions and a GO term. Row and column dendrograms are clustering results by the Ward method according to cell values. Each cell contains an average ranking over 10 trials by LSMI. The red cell denotes that the parameter has a higher rank; that is, the parameter has association with the target GO term. A blue cell denotes that the parameter has a lower rank.

Figure 1

Stress conditions versus biological processes. Matrix of stress conditions (rows) versus biological processes (columns). Red cells have higher correlation.

As shown in this figure, conditions are divided into two groups. Almost all conditions in the upper cluster have higher rank, whereas those in a lower cluster have higher rank only under specific conditions. The conditions in the upper cluster include strong heat shocks, dithiothreitol (DTT) exposure, nitrogen depletion, and diamide treatments, which are non-natural conditions. The result reveals that non-natural conditions change overall cellular processes.

The GO term clusters are divided into three groups: DNA/RNA metabolism (right), localization of protein (middle), and others (left). The leftmost cluster contains bio synthesis, gene expression process, and protein metabolic process. From this figure, nucleic acid metabolism processes are inferred to be independent from amino acid metabolism processes. We will confirm the independence and consider the division of clusters by using other dataset later.

We herein investigate the details of difference among DNA metabolic process, protein metabolic process and localization of proteins. Under an overexpression condition indicated by sign (A) in Fig. 1, DNA/RNA metabolisms show no correlation with expressions of genes belonging to over-expression genes. This finding of no correlation is one advantage of LSMI. The menadione (vitamin K) exposure condition indicated by (B) in Fig. 1 is associated with localization of proteins. Menadione supplementation causes high toxicity; such toxicity might result from the violation of protein localizations.

Next, we compute the association using expressions of gene mutants. The results are shown in Fig. 2. The stimulus can be categorized into two parts: high association under almost all processes and under particular conditions. The division is the same because of stress condition associations. The GO terms also categorize three parts: DNA/RNA metabolic processes, protein metabolic processes, and localization. In this experiment, GO terms "gene expression" (GO:0010467) and "organelle organization and biogenesis" (GO:0006996) are in the DNA/RNA metabolic process cluster, although they are classified in protein metabolic processes cluster under stress conditions in Fig. 1. Because the both divisions are close to ancestor division, we can conclude that the cluster about gene expression exists. From these results, GO terms are divisible into four categories: DNA/RNA metabolic process, protein metabolic process, localization, and gene expression.

Figure 2

Mutated gene groups versus biological processes. Overview: a matrix of mutated gene groups (rows) versus biological processes (columns).

In Fig. 3, we present details of three clusters in Fig. 2. In fact, Fig. 3(I) presents a cluster whose members are correlated with any biological process. Furthermore, the functions of the mutated genes are essential processes for living cells, such as cellular localization, cell cycle, and growth. This result might indicate that the upper half stimulus in Fig. 1 destroys the functions of these essential genes. Furthermore, Fig. 3(II) includes the groups of genes associated with DNA/RNA metabolic processes. In this cluster, YEL033W/MTC1 is a gene with unknown function and is predicted to have a metabolic role using protein-protein interaction [17]. Our clustering result indicates that YEL033W would have some relation with metabolism, especially methylation (methylation is an important part of the one-carbon compound metabolic process). We show genes which have no significant association with DNA/RNA metabolic processes in Fig. 3(III). In the cluster, all genes except AQY2 are of unknown function. No correlation clusters cannot be found by existing methods. Our result might provide clues to elucidate these genes' functions.

Figure 3

Submatrices of Figure 2.


A common analytical flow of the expression data is first clustering and then associating clusters with GO terms or pathways. Although clustering reduces the complexity of large datasets, the strategy might fail to detect changes of entire genes within a cell such as metabolic processes.

To interpret such gene expression changes, gene set enrichment analysis [18] has been proposed. This method treats microarrays independently. Therefore, housekeeping genes are often ranked highly. When gene expressions under various conditions are available, our method would show us the better changes of cellular processes because of the comparison between groups of conditions. The module map [19] gives a global association between a set of genes and a set of conditions. However, this method requires important changes of gene expressions because it uses hypergeometric distributions to compute correlations. Our correlation index is based on MI. Therefore, we can detect nonlinear dependencies with no correlation. An example is depicted in Fig. 3(III).

The characteristics of LSMI and existing MI estimators are presented in Table 1. Detail comparisons are described in the Methods section. The kernel density estimator (KDE) [6, 7] is distribution-free. Model selection is possible by likelihood cross-validation (LCV). However, a hard task of density estimation is involved. Estimation of the entropies using k-nearest neighbor (KNN) samples [8] is distribution-free and does not involve density estimation directly. However, no model selection method exists for determining the number of nearest neighbors. Edgeworth expansion (EDGE) [9] does not involve density estimation or any tuning parameters. However, it is based on the assumption that the target distribution is close to the normal distribution. On the other hand, LSMI is distribution-free; it involves no density estimation, and model selection is possible by cross-validation (CV). Therefore, LSMI overcomes limitations of the existing approaches. Within a cell, most processes have a nonlinear relation such as enzyme effects and feedback loops. The lack of one advantage might cause difficulty of application to biological datasets. By virtue of these advantages, LSMI can detect correlation or independence between features of complex cellular processes.

Table 1 Relation between existing and proposed MI estimators. If the order of the Edgeworth expansion is regarded as a tuning parameter, model selection of EDGE is expected to be 'Not available'.

To investigate the efficiency of feature selection, we compare areas under the curve (AUCs) with LSMI (CV), KDE(LCV), KNN(k) for k = 1, 5, EDGE, and PCC. Details of these methods are described in the Methods section. Fig. 4 depicts AUCs for 12 GO term classifications. The x-axis shows the number of stimulus groups used for the prediction. The y-axis means averaged AUC over 10 trials, where AUCs are calculated as the area under the receiver operating characteristic (ROC) curve, which is often used for diagnostic tests. Each figure shows AUC curves calculated using the six methods.

Figure 4

Classification error. Classification error against the number of feature groups for the yeast cell datasets.

In the AUC figures, the higher curves represent better predictions. For example, Fig. 4(a) shows that LSMI is the highest position, which means that LSMI achieves the best performance among the six methods. In Figs. 4(b) and 4(d), KNN(1) and KNN(5), which are denoted by the light blue and dotted light blue lines, have the best performance. However, in Figs. 4(i), (j) and 4(l), averaged AUCs of KNN using numerous groups are high, whereas the AUCs using small and few groups are low. No systematic model selection strategies exist for KNN and therefore KNN would be unreliable in practice. Fig. 4(c) depicts that EDGE, which is indicated by the light green line, has the highest AUC. In fact, EDGE presumes the normal distribution. Consequently, it works well only on a few datasets. From these figures, LSMI indicated by the blue line appears to be the best feature selection method.


We provided a global view of the associations between stimuli and changes of biological processes based on gene expression profiles. The association is generally difficult to use for making models because of nonlinear correlation. To cope with this problem, we introduced a novel feature selection method called LSMI, which uses MI and can be computed efficiently. In comparison to other feature selection methods, LSMI showed better AUCs in prediction of biological process functions. Consequently, our feature selection results would be more reliable than those obtained using the other methods. We calculated the association between stimuli and GO biological process terms using gene expression profiles. The result revealed that the stimuli are categorized into four types: related to DNA/RNA metabolic process, gene expression, protein metabolic process, and protein localization. LSMI enabled us to reveal the global regulation of cellular processes from comprehensive transcription datasets.


Mutual information estimation

A naive approach to estimating MI is to use a KDE [6, 7], i.e., the densities pxy(x, y), px(x), and py(y) are separately estimated from samples and the estimated densities are used for computing MI. The band-width of the kernel functions could be optimized based on likelihood cross-validation (LCV) [20], so there remains no open tuning parameter in this approach. However, density estimation is known to be a hard problem [4] and therefore the KDE-based method may not be so effective in practice.

An alternative method involves estimation of entropies using KNN. The KNN-based approach was shown to perform better than KDE [21], given that the number k is chosen appropriately – a small (large) k yields an estimator with small (large) bias and large (small) variance. However, appropriately determining the value of k is not straightforward in the context of MI estimation.

Here, we propose a new MI estimator that can overcome the limitations of the existing approaches. Our method, which we call Least-Squares Mutual Information (LSMI), does not involve density estimation and directly models the density ratio:

w ( x , y ) : = p xy ( x , y ) p x ( x ) p y ( y ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4DaCNaeiikaGccbmGae8hEaGNaeiilaWIae8xEaKNaeiykaKIaeiOoaOJaeyypa0tcfa4aaSaaaeaacqWGWbaCdaWgaaqaaiabbIha4jabbMha5bqabaGaeiikaGIae8hEaGNaeiilaWIae8xEaKNaeiykaKcabaGaemiCaa3aaSbaaeaacqqG4baEaeqaaiabcIcaOiab=Hha4jabcMcaPiabdchaWnaaBaaabaGaeeyEaKhabeaacqGGOaakcqWF5bqEcqGGPaqkaaGccqGGUaGlaaa@4CFE@

The solution of LSMI can be computed by simply solving a system of linear equations. Therefore, LSMI is computationally very efficient. Furthermore, a variant of cross-validation (CV) is available for model selection, so the values of tuning parameters such as the regularization parameter and the kernel width can be adaptively determined in an objective manner.

A new MI estimator

In this section, we formulate the MI inference problem as density ratio estimation and propose a new method of estimating the density ratio.

MI inference via density ratio estimation

Let D X ( d x ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83aXt0aaSbaaSqaaiabbIfaybqabaGccqGGOaakcqGHckcZtuuDJXwAK1uy0HMmaeXbfv3ySLgzG0uy0HgiuD3BaGqbaiab+1risnaaCaaaleqabaGaemizaq2aaSbaaWqaaiabbIha4bqabaaaaOGaeiykaKcaaa@4A4E@ and D Y ( d y ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83aXt0aaSbaaSqaaiabbMfazbqabaGccqGGOaakcqGHckcZtuuDJXwAK1uy0HMmaeXbfv3ySLgzG0uy0HgiuD3BaGqbaiab+1risnaaCaaaleqabaGaemizaq2aaSbaaWqaaiabbMha5bqabaaaaOGaeiykaKcaaa@4A52@ be the data domains and suppose we are given n independent and identically distributed (i.i.d.) paired samples

{ ( x i , y i ) | x i D X , y i D Y } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaeiikaGccbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabcYha8jab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae43aXt0aaSbaaSqaaiabbIfaybqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemyAaKgabeaakiabgIGiolab+nq8enaaBaaaleaacqqGzbqwaeqaaOGaeiyFa03aa0baaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbaaaaa@5839@

drawn from a joint distribution with density pxy(x, y). Let us denote the marginal densities of x i and y i by px(x) and py(y), respectively. The goal is to estimate squared-loss MI defined by Eq.(1).

Our key constraint is that we want to avoid density estimation when estimating MI. To this end, we estimate the density ratio w(x, y) defined by Eq.(2). Given a density ratio estimator w ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaaaaa@2D5C@ (x, y), MI can be simply estimated by

I ^ s ( X , Y ) = 1 n 2 i , j = 1 n ( w ^ ( x i , y j ) 1 ) 2 . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmysaKKbaKaadaWgaaWcbaGaem4CamhabeaakiabcIcaOiabdIfayjabcYcaSiabdMfazjabcMcaPiabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa42aaWbaaeqabaGaeGOmaidaaaaakmaaqahabaGaeiikaGIafm4DaCNbaKaacqGGOaakieWacqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=Lha5naaBaaaleaacqWGQbGAaeqaaOGaeiykaKIaeyOeI0IaeGymaeJaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaemyAaKMaeiilaWIaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdGccqGGUaGlaaa@520B@

We model the density ratio function w(x, y) by the following linear model:

w ^ α ( x , y ) : = α φ ( x , y ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaadaWgaaWcbaaccmGae8xSdegabeaakiabcIcaOGqadiab+Hha4jabcYcaSiab+Lha5jabcMcaPiabcQda6iabg2da9iab=f7aHnaaCaaaleqabaWenfgDOvwBHrxAJf2maeHbnfgDOvwBHrxAJf2maGabaiab9rQiOcaakiab=z8aQjabcIcaOiab+Hha4jabcYcaSiab+Lha5jabcMcaPiabcYcaSaaa@4C3F@

where α= (α1, α2, ..., α b )are parameters to be learned from samples, denotes the transpose of a matrix or a vector, and

φ(x, y) = (φ1(x,y), φ2(x, y), ..., φ b (x,y))

are basis functions such that

φ ( x , y ) 0 b f o r a l l ( x , y ) D X × D Y . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeWaaaqaaGGadiab=z8aQjabcIcaOGqadiab+Hha4jabcYcaSiab+Lha5jabcMcaPiabgwMiZkabhcdaWmaaBaaaleaacqWGIbGyaeqaaaGcbaacbaGae0NzayMae03Ba8Mae0NCaiNae0hiaaIae0xyaeMae0hBaWMae0hBaWgabaGaeiikaGIae4hEaGNaeiilaWIae4xEaKNaeiykaKIaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGaeW3aXt0aaSbaaSqaaiabbIfaybqabaGccqGHxdaTcqaFdeprdaWgaaWcbaGaeeywaKfabeaakiabc6caUaaaaaa@5A9B@

0 b denotes the b-dimensional vector with all zeros. Note that φ(x, y) could be dependent on the samples { x i , y i } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaShcbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemyAaKgabeaakiabc2ha9naaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBaaaaaa@3AB2@ , i.e., kernel models are also allowed. We explain how the basis functions φ(x, y) are chosen in the later section.

A least-squares approach to direct density ratio estimation

We determine the parameter α in the model w ^ α MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaadaWgaaWcbaGaeqySdegabeaaaaa@2F27@ (x, y) so that the following squared error J0 is minimized:

J 0 ( α ) : = 1 2 ( w ^ α ( x , y ) w ( x , y ) ) 2 p x ( x ) p y ( y ) d x d y = 1 2 w ^ α ( x , y ) 2 p x ( x ) p y ( y ) d x d y w ^ α ( x , y ) p xy ( x , y ) d x d y + C , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaabbeaacqWGkbGsdaWgaaWcbaGaeGimaadabeaakiabcIcaOGGadiab=f7aHjabcMcaPiabcQda6iabg2da9KqbaoaalaaabaGaeGymaedabaGaeGOmaidaaOWaa8GaaeaacqGGOaakcuWG3bWDgaqcamaaBaaaleaacqWFXoqyaeqaaOGaeiikaGccbmGae4hEaGNaeiilaWIae4xEaKNaeiykaKIaeyOeI0Iaem4DaCNaeiikaGIae4hEaGNaeiilaWIae4xEaKNaeiykaKIaeiykaKYaaWbaaSqabeaacqaIYaGmaaGccqWGWbaCdaWgaaWcbaGaeeiEaGhabeaakiabcIcaOiab+Hha4jabcMcaPiabdchaWnaaBaaaleaacqqG5bqEaeqaaOGaeiikaGIae4xEaKNaeiykaKIaemizaqMae4hEaGNaemizaqMae4xEaKhaleqabeqdcqGHRiI8cqGHRiI8aaGcbaGaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGcdaWdcaqaaiqbdEha3zaajaWaaSbaaSqaaiab=f7aHbqabaGccqGGOaakcqGF4baEcqGGSaalcqGF5bqEcqGGPaqkdaahaaWcbeqaaiabikdaYaaakiabdchaWnaaBaaaleaacqqG4baEaeqaaOGaeiikaGIae4hEaGNaeiykaKIaemiCaa3aaSbaaSqaaiabbMha5bqabaGccqGGOaakcqGF5bqEcqGGPaqkcqWGKbazcqGF4baEcqWGKbazcqGF5bqEaSqabeqaniabgUIiYlabgUIiYdaakeaacqGHsisldaWdcaqaaiqbdEha3zaajaWaaSbaaSqaaiab=f7aHbqabaGccqGGOaakcqGF4baEcqGGSaalcqGF5bqEcqGGPaqkcqWGWbaCdaWgaaWcbaGaeeiEaGNaeeyEaKhabeaakiabcIcaOiab+Hha4jabcYcaSiab+Lha5jabcMcaPiabdsgaKjab+Hha4jabdsgaKjab+Lha5bWcbeqab0Gaey4kIiVaey4kIipakiabgUcaRiabdoeadjabcYcaSaaaaa@A325@

where C = 1 2 w ( x , y ) p xy ( x , y ) d x d y MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4qamKaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmaaGcdaWdcaqaaiabdEha3jabcIcaOGqadiab=Hha4jabcYcaSiab=Lha5jabcMcaPiabdchaWnaaBaaaleaacqqG4baEcqqG5bqEaeqaaOGaeiikaGIae8hEaGNaeiilaWIae8xEaKNaeiykaKIaemizaqMae8hEaGNaemizaqMae8xEaKhaleqabeqdcqGHRiI8cqGHRiI8aaaa@4AD7@ is a constant and therefore can be safely ignored. Let us denote the first two terms by J:

J ( α ) : = J 0 ( α ) C = 1 2 α H α h α , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOsaOKaeiikaGcccmGae8xSdeMaeiykaKIaeiOoaOJaeyypa0JaemOsaO0aaSbaaSqaaiabicdaWaqabaGccqGGOaakcqWFXoqycqGGPaqkcqGHsislcqWGdbWqcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabikdaYaaakiab=f7aHnaaCaaaleqabaWenfgDOvwBHrxAJf2maeHbnfgDOvwBHrxAJf2maGabaiab+rQiOcaaieWakiab9Heaijab=f7aHjabgkHiTiab9HgaOnaaCaaaleqabaGae4hPIGkaaOGae8xSdeMaeiilaWcaaa@546B@


H : = φ ( x , y ) φ ( x , y ) p x ( x ) p y ( y ) d x d y , h : = φ ( x , y ) p xy ( x , y ) d x d y . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaabbeaaieWacqWFibascqGG6aGocqGH9aqpdaWdcaqaaiabeA8aQjabcIcaOiab=Hha4jabcYcaSiab=Lha5jabcMcaPaWcbeqab0Gaey4kIiVaey4kIipakiabeA8aQjabcIcaOiab=Hha4jabcYcaSiab=Lha5jabcMcaPmaaCaaaleqabaWenfgDOvwBHrxAJf2maeHbnfgDOvwBHrxAJf2maGabaiab+rQiOcaakiabdchaWnaaBaaaleaacqqG4baEaeqaaOGaeiikaGIae8hEaGNaeiykaKIaemiCaa3aaSbaaSqaaiabbMha5bqabaGccqGGOaakcqWF5bqEcqGGPaqkcqWGKbazcqWF4baEcqWGKbazcqWF5bqEcqGGSaalaeaacqWFObaAcqGG6aGocqGH9aqpdaWdcaqaaiabeA8aQjabcIcaOiab=Hha4jabcYcaSiab=Lha5jabcMcaPiabdchaWnaaBaaaleaacqqG4baEcqqG5bqEaeqaaOGaeiikaGIae8hEaGNaeiilaWIae8xEaKNaeiykaKcaleqabeqdcqGHRiI8cqGHRiI8aOGaemizaqMae8hEaGNaemizaqMae8xEaKNaeiOla4caaaa@7EE6@

Approximating the expectations in H and h by empirical averages, we obtain the following optimization problem:

α ˜ : = arg min α b [ 1 2 α H ^ α h ^ α + λ α α ] , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xSdeMbaGaacqGG6aGocqGH9aqpdaWfqaqaaiGbcggaHjabckhaYjabcEgaNjGbc2gaTjabcMgaPjabc6gaUbWcbaGaeqySdeMaeyicI48efv3ySLgznfgDOjdaryqr1ngBPrginfgDObcv39gaiqaacqGFDeIudaahaaadbeqaaiabdkgaIbaaaSqabaGcdaWadaqaaKqbaoaalaaabaGaeGymaedabaGaeGOmaidaaOGae8xSde2aaWbaaSqabeaat0uy0HwzTfgDPnwyZaqeh0uy0HwzTfgDPnwyZaacfaGae0hPIGkaaGqadOGafWhsaGKbaKaacqWFXoqycqGHsislcuaFObaAgaqcamaaCaaaleqabaGae0hPIGkaaOGae8xSdeMaey4kaSIaeq4UdWMae8xSde2aaWbaaSqabeaacqqFKkcQaaGccqWFXoqyaiaawUfacaGLDbaacqGGSaalaaa@6AF0@

where we included a regularization term λ αα and

H ^ : = 1 n 2 i , j = 1 n φ ( x i , y j ) φ ( x i , y j ) , h ^ : = 1 n i = 1 n φ ( x i , y i ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaabbeaaieWacuWFibasgaqcaiabcQda6iabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa42aaWbaaeqabaGaeGOmaidaaaaakmaaqahabaaccmGae4NXdOMaeiikaGIae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemOAaOgabeaakiabcMcaPiab+z8aQjabcIcaOiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIae8xEaK3aaSbaaSqaaiabdQgaQbqabaGccqGGPaqkdaahaaWcbeqaamrtHrhAL1wy0L2yHndaryqtHrhAL1wy0L2yHndaiqaacqqFKkcQaaaabaGaemyAaKMaeiilaWIaemOAaOMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdGccqGGSaalaeaacuWFObaAgaqcaiabcQda6iabg2da9KqbaoaalaaabaGaeGymaedabaGaemOBa4gaaOWaaabCaeaacqGFgpGAcqGGOaakcqWF4baEdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiab=Lha5naaBaaaleaacqWGPbqAaeqaaOGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoakiabc6caUaaaaa@765B@

Differentiating the objective function (3) with respect to a and equating it to zero, we can obtain an analytic-form solution:

α ˜ = ( H ^ + λ I b ) 1 h ^ , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaaccmGaf8xSdeMbaGaacqGH9aqpcqGGOaakieWacuGFibasgaqcaiabgUcaRiabeU7aSjab+LeajnaaBaaaleaacqWGIbGyaeqaaOGaeiykaKYaaWbaaSqabeaacqGHsislcqaIXaqmaaGccuGFObaAgaqcaiabcYcaSaaa@3B48@

where I b is the b-dimensional identity matrix.

We call the above method Least-Squares Mutual Information (LSMI). Thanks to the analytic-form solution, the LSMI solution can be computed very efficiently.

Convergence bound

Here, we show a non-parametric convergence rate of the solution of the optimization problem (3).

Let G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ be a general set of functions on D X × D Y MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83aXt0aaSbaaSqaaiabdIfaybqabaGccqGHxdaTcqWFdeprdaWgaaWcbaGaemywaKfabeaaaaa@3DF9@ . For a function g ( G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ ), let us consider a non-negative function R(g) such that

sup x , y [ g ( x , y ) R ( g ) ] . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaCbeaeaacyGGZbWCcqGG1bqDcqGGWbaCaSqaaGqadiab=Hha4jabcYcaSiab=Lha5bqabaGccqGGBbWwcqWGNbWzcqGGOaakcqWF4baEcqGGSaalcqWF5bqEcqGGPaqkcqGHKjYOcqWGsbGucqGGOaakcqWGNbWzcqGGPaqkcqGGDbqxcqGGUaGlaaa@44AC@

Then the problem (3) can be generalized as

w ^ : = arg min g G [ 1 2 n 2 i , j = 1 n g i , j 2 1 n i = 1 n g i , i + λ n R ( g ) 2 ] , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaacqGG6aGocqGH9aqpdaWfqaqaaiGbcggaHjabckhaYjabcEgaNjGbc2gaTjabcMgaPjabc6gaUbWcbaGaem4zaCMaeyicI48enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeabeaakmaadmaabaqcfa4aaSaaaeaacqaIXaqmaeaacqaIYaGmcqWGUbGBdaahaaqabeaacqaIYaGmaaaaaOWaaabCaeaacqWGNbWzdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOgabaGaeGOmaidaaaqaaiabdMgaPjabcYcaSiabdQgaQjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aOGaeyOeI0scfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBaaGcdaaeWbqaaiabdEgaNnaaBaaaleaacqWGPbqAcqGGSaalcqWGPbqAaeqaaaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6gaUbqdcqGHris5aOGaey4kaSIaeq4UdW2aaSbaaSqaaiabd6gaUbqabaGccqWGsbGucqGGOaakcqWGNbWzcqGGPaqkdaahaaWcbeqaaiabikdaYaaaaOGaay5waiaaw2faaiabcYcaSaaa@771B@

where gi,j:= g(x i , y j ). We assume that the true density ratio function w(x, y) is contained in model G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ and satisfies

w(x , y) <M0 for all (x , y) D X × D Y .

We also assume that there exists γ (0 <γ < 2) such that

H [ ] ( G M , ϵ , L 2 ( p X p Y ) ) = O ( ( M / ϵ ) γ ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHG0aaSbaaSqaaiabcUfaBjabc2faDbqabaGccqGGOaakcqWFge=rdaWgaaWcbaGaemyta0eabeaakiabcYcaSmbvLH1qn1uy0Hws0fgBPngarCWyT1wAXadaiuaacqGF1pGScqGGSaalcqWGmbatdaWgaaWcbaGaeGOmaidabeaakiabcIcaOiabdchaWnaaBaaaleaacqWGybawaeqaaOGaemiCaa3aaSbaaSqaaiabdMfazbqabaGccqGGPaqkcqGGPaqkcqGH9aqpcqWGpbWtcqGGOaakcqGGOaakcqWGnbqtcqGGVaWlcqGF1pGScqGGPaqkdaahaaWcbeqaaiabeo7aNbaakiabcMcaPiabcYcaSaaa@603E@


G M : = { g G | R ( g ) M } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXF0aaSbaaSqaaiabd2eanbqabaGccqGG6aGocqGH9aqpcqGG7bWEcqWGNbWzcqGHiiIZcqWFge=rcqGG8baFcqWGsbGucqGGOaakcqWGNbWzcqGGPaqkcqGHKjYOcqWGnbqtcqGG9bqFaaa@4B2A@

and H [ ] MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83cHG0aaSbaaSqaaiabcUfaBjabc2faDbqabaaaaa@3937@ is the bracketing entropy of G M MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXF0aaSbaaSqaaiabd2eanbqabaaaaa@38A4@ with respect to the L2(pxpy)-norm [22, 23]. This means the function class G MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NbXFeaaa@3755@ is not too much complex.

Then we have the following theorem. Its proof is omitted due to lack of space.

Theorem 1 Under the above setting, if λ n → 0 and λ n 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4UdW2aa0baaSqaaiabd6gaUbqaaiabgkHiTiabigdaXaaaaaa@30F8@ = o(n2/(2+γ)) then

w ^ w 2 = O p ( λ n 1 / 2 ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaauWaaeaacuWG3bWDgaqcaiabgkHiTiabdEha3bGaayzcSlaawQa7amaaBaaaleaacqaIYaGmaeqaaOGaeyypa0ZenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NdX=0aaSbaaSqaaiabdchaWbqabaGccqGGOaakcqaH7oaBdaqhaaWcbaGaemOBa4gabaGaeGymaeJaei4la8IaeGOmaidaaOGaeiykaKIaeiilaWcaaa@4B3C@

where ||·||2 means the L2(pxpy)-norm and O p MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NdX=0aaSbaaSqaaiabdchaWbqabaaaaa@38FA@ denotes the asymptotic order in probability.

This theorem is closely related to [24, 25]. [24] considers least squares estimators for nonparametric regression, and related topics can be found in Section 10 of [23].

CV for model selection and basis function design

The performance of LSMI depends on the choice of the model, i.e., the basis functions φ(x, y) and the regularization parameter λ. Here we show that model selection can be carried out based on a variant of CV.

First, the samples { z i | z i = ( x i , y i ) } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaemOEaO3aaSbaaSqaaiabdMgaPbqabaGccqGG8baFcqWG6bGEdaWgaaWcbaGaemyAaKgabeaakiabg2da9iabcIcaOGqadiab=Hha4naaBaaaleaacqWGPbqAaeqaaOGaeiilaWIae8xEaK3aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGG9bqFdaqhaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4gaaaaa@4506@ are divided into K disjoint subsets { Z k } k = 1 K MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaS3efv3ySLgznfgDOfdaryqr1ngBPrginfgDObYtUvgaiqaacqWFzeVwdaWgaaWcbaGaem4AaSgabeaakiabc2ha9naaDaaaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsaaaaaa@407D@ . Then a density ratio estimator w ^ k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafm4DaCNbaKaadaWgaaWcbaGaem4AaSgabeaaaaa@2EE7@ (x, y) is obtained using {Ƶ j }jkand the cost J is approximated using the held-out samples Ƶ k as

J ^ k ( K CV ) = x , y Z k w ^ k ( x , y ) 2 2 n k 2 ( x , y ) Z k w ^ k ( x , y ) n k , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOsaOKbaKaadaqhaaWcbaGaem4AaSgabaGaeiikaGIaem4saSKaeyOeI0Iaee4qamKaeeOvayLaeiykaKcaaOGaeyypa0ZaaabuaKqbagaadaWcaaqaaiqbdEha3zaajaWaaSbaaeaacqWGRbWAaeqaaiabcIcaOGqadiqb=Hha4zaafaGaeiilaWIaf8xEaKNbauaacqGGPaqkdaahaaqabeaacqaIYaGmaaaabaGaeGOmaiJaemOBa42aa0baaeaacqWGRbWAaeaacqaIYaGmaaaaaaWcbaGaf8hEaGNbauaacqGGSaalcuWF5bqEgaqbaiabgIGioprr1ngBPrwtHrhAXaqeguuDJXwAKbstHrhAG8KBLbaceaGae4xgXR1aaSbaaWqaaiabdUgaRbqabaaaleqaniabggHiLdGccqGHsisldaaeqbqcfayaamaalaaabaGafm4DaCNbaKaadaWgaaqaaiabdUgaRbqabaGaeiikaGIaf8hEaGNbauaacqGGSaalcuWF5bqEgaqbaiabcMcaPaqaaiabd6gaUnaaBaaabaGaem4AaSgabeaaaaaaleaacqGGOaakcuWF4baEgaqbaiabcYcaSiqb=Lha5zaafaGaeiykaKIaeyicI4Sae4xgXR1aaSbaaWqaaiabdUgaRbqabaaaleqaniabggHiLdGccqGGSaalaaa@745F@

where n k is the number of pairs in the set Ƶ k . x , y Z k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeyyeIu+aaSbaaSqaaGqadiqb=Hha4zaafaGaeiilaWIaf8xEaKNbauaacqGHiiIZtuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGabaiab+Lr8AnaaBaaameaacqWGRbWAaeqaaaWcbeaaaaa@4021@ is the summation over all combinations of x' and y'(i.e., n k 2 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOBa42aa0baaSqaaiabdUgaRbqaaiabikdaYaaaaaa@2FB8@ terms), while ( x , y ) Z k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeyyeIu+aaSbaaSqaaiabcIcaOGqadiqb=Hha4zaafaGaeiilaWIaf8xEaKNbauaacqGGPaqkcqGHiiIZtuuDJXwAK1uy0HwmaeHbfv3ySLgzG0uy0Hgip5wzaGabaiab+Lr8AnaaBaaameaacqWGRbWAaeqaaaWcbeaaaaa@41D3@ is the summation over all pairs (x', y') (i.e., n k terms). This procedure is repeated for k = 1, 2, ..., K and its average J ^ ( K CV ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOsaOKbaKaadaahaaWcbeqaaiabcIcaOiabdUealjabgkHiTiabboeadjabbAfawjabcMcaPaaaaaa@332D@ is used as an estimate of J:

J ^ ( K CV ) = 1 K k = 1 K J ^ r ( K CV ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOsaOKbaKaadaahaaWcbeqaaiabcIcaOiabdUealjabgkHiTiabboeadjabbAfawjabcMcaPaaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaem4saSeaaOWaaabCaeaacuWGkbGsgaqcamaaDaaaleaacqWGYbGCaeaacqGGOaakcqWGlbWscqGHsislcqqGdbWqcqqGwbGvcqGGPaqkaaaabaGaem4AaSMaeyypa0JaeGymaedabaGaem4saSeaniabggHiLdGccqGGUaGlaaa@47A0@

We can show that J ^ ( K CV ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOsaOKbaKaadaahaaWcbeqaaiabcIcaOiabdUealjabgkHiTiabboeadjabbAfawjabcMcaPaaaaaa@332D@ is an almost unbiased estimate of the true cost J, where the 'almost'-ness comes from the fact that the number of samples is reduced in the CV procedure due to data splitting [4]. A good model may be chosen by CV, given that a family of promising model candidates is prepared. As model candidates, we propose using a Gaussian kernel model:

φ ( x , y ) = exp ( x u 2 2 σ 2 ) δ ( y = v ) , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqOXdO2aaSbaaSqaaiabloriSbqabaGcdaqadaqaaGqadiab=Hha4jabcYcaSiab=Lha5bGaayjkaiaawMcaaiabg2da9iGbcwgaLjabcIha4jabcchaWnaabmaabaGaeyOeI0scfa4aaSaaaeaadaqbdaqaaiab=Hha4jabgkHiTiab=vha1naaBaaabaGaeS4eHWgabeaaaiaawMa7caGLkWoadaahaaqabeaacqaIYaGmaaaabaGaeGOmaiJaeq4Wdm3aaWbaaeqabaGaeGOmaidaaaaaaOGaayjkaiaawMcaaiabes7aKnaabmaabaGae8xEaKNaeyypa0Jae8NDay3aaSbaaSqaaiabloriSbqabaaakiaawIcacaGLPaaacqGGSaalaaa@538A@


{ ( u , v ) } = 1 b MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaeiikaGccbmGae8xDau3aaSbaaSqaaiabloriSbqabaGccqGGSaalcqWF2bGDdaWgaaWcbaGaeS4eHWgabeaakiabcMcaPiabc2ha9naaDaaaleaacqWItecBcqGH9aqpcqaIXaqmaeaacqWGIbGyaaaaaa@3C10@

are 'center' points randomly chosen from

{ ( x i , y i ) } i = 1 n . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaSNaeiikaGccbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabc2ha9naaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBaaGccqGGUaGlaaa@3DA0@

δ(y= v) is a indicator function, which is 1 if y= v and 0 otherwise.

In the experiments, we fix the number of basis functions at

b = min(100, n),

and choose the Gaussian width σ and the regularization parameter λ by CV with grid search.

Relation to existing methods

In this section, we discuss the characteristics of existing and proposed approaches.

Kernel density estimator (KDE)

KDE [6, 7] is a non-parametric technique to estimate a probability density function p(x) from its i.i.d. samples { x i } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaShcbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGG9bqFdaqhaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4gaaaaa@36CA@ . For the Gaussian kernel, KDE is expressed as

p ^ ( x ) = 1 n ( 2 π σ 2 ) d / 2 i = 1 n exp ( x x i 2 2 σ 2 ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaKaacqGGOaakieWacqWF4baEcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabd6gaUnaabmaabaGaeGOmaiJaeqiWdaNaeq4Wdm3aaWbaaeqabaGaeGOmaidaaaGaayjkaiaawMcaamaaCaaabeqaaiabdsgaKjabc+caViabikdaYaaaaaGcdaaeWbqaaiGbcwgaLjabcIha4jabcchaWbWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4ganiabggHiLdGcdaqadaqaaiabgkHiTKqbaoaalaaabaWaauWaaeaacqWF4baEcqGHsislcqWF4baEdaWgaaqaaiabdMgaPbqabaaacaGLjWUaayPcSdWaaWbaaeqabaGaeGOmaidaaaqaaiabikdaYiabeo8aZnaaCaaabeqaaiabikdaYaaaaaaakiaawIcacaGLPaaacqGGUaGlaaa@5BBB@

The performance of KDE depends on the choice of the kernel width σ and it can be optimized by likelihood CV as follows [20]: First, divide the samples { x i } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaShcbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGG9bqFdaqhaaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOBa4gaaaaa@36CA@ into K disjoint subsets { X k } n = 1 K MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaS3enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83fXJ1aaSbaaSqaaiabdUgaRbqabaGccqGG9bqFdaqhaaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaem4saSeaaaaa@40B3@ . Then obtain a density estimate p ^ X k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaKaadaWgaaWcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83fXJ1aaSbaaWqaaiabdUgaRbqabaaaleqaaaaa@3AB3@ (x) from { X j } j k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaS3enfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83fXJ1aaSbaaSqaaiabdQgaQbqabaGccqGG9bqFdaWgaaWcbaGaemOAaOMaeyiyIKRaem4AaSgabeaaaaa@40B9@ and compute its hold-out log-likelihood for X k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83fXJ1aaSbaaSqaaiabdUgaRbqabaaaaa@3902@ :

1 | X k | x X k log p ^ X k ( x ) . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqcfa4aaSaaaeaacqaIXaqmaeaacqGG8baFt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFxepwdaWgaaqaaiabdUgaRbqabaGaeiiFaWhaaOWaaabuaeaacyGGSbaBcqGGVbWBcqGGNbWzcuWGWbaCgaqcamaaBaaaleaacqWFxepwdaWgaaadbaGaem4AaSgabeaaaSqabaGcdaqadaqaaGqadiab+Hha4bGaayjkaiaawMcaaaWcbaGae4hEaGNaeyicI4Sae83fXJ1aaSbaaWqaaiabdUgaRbqabaaaleqaniabggHiLdGccqGGUaGlaaa@53B4@

This procedure is repeated for k = 1, 2, ..., K and choose the value of σ such that the average of the hold-out log-likelihood over all k is maximized. Note that the average hold-out log-likelihood is an almost unbiased estimate of the Kullback-Leibler divergence from p(x) to p ^ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmiCaaNbaKaaaaa@2D4E@ (x), up to an irrelevant constant.

Based on KDE, MI can be approximated by separately estimating the densities pxy(x, y), px(x) and py(y) using { x i , y i } i = 1 n MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaei4EaShcbmGae8hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWF5bqEdaWgaaWcbaGaemyAaKgabeaakiabc2ha9naaDaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBaaaaaa@3AB2@ . However, density estimation is known to be a hard problem and therefore the KDE-based approach may not be so effective in practice.

k-nearest neighbor method (KNN)

Let N k MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8xdX70aaSbaaSqaaiabdUgaRbqabaaaaa@38EE@ (i) be the set of k-nearest neighbor samples of (x i , y i ), and let

ϵ x ( i ) : = max { x i x i | ( x i y i ) N k ( i ) } , ϵ y ( i ) : = max { y i y i | ( x i y i ) N k ( i ) } , n x ( i ) : = # { z i | x i x i ϵ x ( i ) } , n y ( i ) : = # { z i | y i y i ϵ y ( i ) } . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaqabeaatqvzynutnfgDOLeDHXwAJbqegmwBTLwmWaaceaGae8x9di7aaSbaaSqaaiabbIha4bqabaGcdaqadaqaaiabdMgaPbGaayjkaiaawMcaaiabcQda6iabg2da9iGbc2gaTjabcggaHjabcIha4naacmaabaWaaqGaaeaadaqbdaqaaGqadiab+Hha4naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iae4hEaG3aaSbaaSqaaiqbdMgaPzaafaaabeaaaOGaayzcSlaawQa7aaGaayjcSdWaaeWaaeaacqGF4baEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiab+Lha5naaBaaaleaacuWGPbqAgaqbaaqabaaakiaawIcacaGLPaaacqGHiiIZt0uy0HwzTfgDPnwy1egarCqtHrhAL1wy0L2yHvdaiuaacqqFneVtdaWgaaWcbaGaem4AaSgabeaakmaabmaabaGaemyAaKgacaGLOaGaayzkaaaacaGL7bGaayzFaaGaeiilaWcabaGae8x9di7aaSbaaSqaaiabbMha5bqabaGcdaqadaqaaiabdMgaPbGaayjkaiaawMcaaiabcQda6iabg2da9iGbc2gaTjabcggaHjabcIha4naacmaabaWaaqGaaeaadaqbdaqaaiab+Lha5naaBaaaleaacqWGPbqAaeqaaOGaeyOeI0Iae4xEaK3aaSbaaSqaaiqbdMgaPzaafaaabeaaaOGaayzcSlaawQa7aaGaayjcSdWaaeWaaeaacqGF4baEdaWgaaWcbaGafmyAaKMbauaaaeqaaOGaeyOeI0Iae4xEaK3aaSbaaSqaaiqbdMgaPzaafaaabeaaaOGaayjkaiaawMcaaiabgIGiolab91q8onaaBaaaleaacqWGRbWAaeqaaOWaaeWaaeaacqWGPbqAaiaawIcacaGLPaaaaiaawUhacaGL9baacqGGSaalaeaacqWGUbGBdaWgaaWcbaGaeeiEaGhabeaakmaabmaabaGaemyAaKgacaGLOaGaayzkaaGaeiOoaOJaeyypa0Jaei4iamYaaiWaaeaadaabcaqaaiabdQha6naaBaaaleaacuWGPbqAgaqbaaqabaaakiaawIa7amaafmaabaGae4hEaG3aaSbaaSqaaiabdMgaPbqabaGccqGHsislcqGF4baEdaWgaaWcbaGafmyAaKMbauaaaeqaaaGccaGLjWUaayPcSdGaeyizImQae8x9di7aaSbaaSqaaiabbIha4bqabaGcdaqadaqaaiabdMgaPbGaayjkaiaawMcaaaGaay5Eaiaaw2haaiabcYcaSaqaaiabd6gaUnaaBaaaleaacqqG5bqEaeqaaOWaaeWaaeaacqWGPbqAaiaawIcacaGLPaaacqGG6aGocqGH9aqpcqGGJaWidaGadaqaamaaeiaabaGaemOEaO3aaSbaaSqaaiqbdMgaPzaafaaabeaaaOGaayjcSdWaauWaaeaacqGF5bqEdaWgaaWcbaGaemyAaKgabeaakiabgkHiTiab+Lha5naaBaaaleaacuWGPbqAgaqbaaqabaaakiaawMa7caGLkWoacqGHKjYOcqWF1pGSdaWgaaWcbaGaeeyEaKhabeaakmaabmaabaGaemyAaKgacaGLOaGaayzkaaaacaGL7bGaayzFaaGaeiOla4caaaa@DC45@

Then the KNN-based MI estimator is given as follows 8:

I ^ ( X , Y ) = ψ ( k ) + ψ ( n ) 1 k 1 n i = 1 n [ ψ ( n x ( i ) ) + ψ ( n y ( i ) ) ] , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaabbeaacuWGjbqsgaqcamaabmaabaGaemiwaGLaeiilaWIaemywaKfacaGLOaGaayzkaaGaeyypa0JaeqiYdK3aaeWaaeaacqWGRbWAaiaawIcacaGLPaaacqGHRaWkcqaHipqEdaqadaqaaiabd6gaUbGaayjkaiaawMcaaiabgkHiTKqbaoaalaaabaGaeGymaedabaGaem4AaSgaaaGcbaGaeyOeI0scfa4aaSaaaeaacqaIXaqmaeaacqWGUbGBaaGcdaaeWbqaamaadmaabaGaeqiYdK3aaeWaaeaacqWGUbGBdaWgaaWcbaGaemiEaGhabeaakmaabmaabaGaemyAaKgacaGLOaGaayzkaaaacaGLOaGaayzkaaGaey4kaSIaeqiYdK3aaeWaaeaacqWGUbGBdaWgaaWcbaGaemyEaKhabeaakmaabmaabaGaemyAaKgacaGLOaGaayzkaaaacaGLOaGaayzkaaaacaGLBbGaayzxaaaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGUbGBa0GaeyyeIuoakiabcYcaSaaaaa@62B3@

where ψ is the digamma function.

A practical drawback of the KNN-based approach is that the estimation accuracy depends on the value of k and there seems no systematic strategy to choose the value of k appropriately.

Edgeworth expansion (EDGE)

MI can be expressed in terms of the entropies as

I(X, Y) = H(X) + H(Y) - H(X, Y),

where H(X) denotes the entropy of X:

H ( X ) : = p x ( x ) log p x ( x ) d x . MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemisaG0aaeWaaeaacqWGybawaiaawIcacaGLPaaacqGG6aGocqGH9aqpcqGHsisldaWdbaqaaiabdchaWnaaBaaaleaacqqG4baEaeqaaOWaaeWaaeaaieWacqWF4baEaiaawIcacaGLPaaacyGGSbaBcqGGVbWBcqGGNbWzcqWGWbaCdaWgaaWcbaGaeeiEaGhabeaakmaabmaabaGae8hEaGhacaGLOaGaayzkaaGaemizaqMae8hEaGNaeiOla4caleqabeqdcqGHRiI8aaaa@48E4@

Thus MI can be approximated if the entropies above are estimated.

In the paper [9], an entropy approximation method based on the Edgeworth expansion is proposed, where the entropy of a distribution is approximated by that of the normal distribution and some additional higher-order correction terms. More specifically, for a d-dimensional distribution, the entropy is approximated by

H H normal 1 12 i = 1 d κ i , i , i 2 1 4 i , j = 1 , i j d κ i , i , j 2 1 72 i , j , k = 1 , i < j < k d κ i , j , k 2 , MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGceaabbeaacqWGibascqGHijYUcqWGibasdaWgaaWcbaGaeeOBa4Maee4Ba8MaeeOCaiNaeeyBa0MaeeyyaeMaeeiBaWgabeaakiabgkHiTKqbaoaalaaabaGaeGymaedabaGaeGymaeJaeGOmaidaaOWaaabCaeaacqaH6oWAdaqhaaWcbaGaemyAaKMaeiilaWIaemyAaKMaeiilaWIaemyAaKgabaGaeGOmaidaaOGaeyOeI0scfa4aaSaaaeaacqaIXaqmaeaacqaI0aanaaGcdaaeWbqaaiabeQ7aRnaaDaaaleaacqWGPbqAcqGGSaalcqWGPbqAcqGGSaalcqWGQbGAaeaacqaIYaGmaaaabaGaemyAaKMaeiilaWIaemOAaOMaeyypa0JaeGymaeJaeiilaWIaemyAaKMaeyiyIKRaemOAaOgabaGaemizaqganiabggHiLdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGKbaza0GaeyyeIuoaaOqaaiabgkHiTKqbaoaalaaabaGaeGymaedabaGaeG4naCJaeGOmaidaaOWaaabCaeaacqaH6oWAdaqhaaWcbaGaemyAaKMaeiilaWIaemOAaOMaeiilaWIaem4AaSgabaGaeGOmaidaaOGaeiilaWcaleaacqWGPbqAcqGGSaalcqWGQbGAcqGGSaalcqWGRbWAcqGH9aqpcqaIXaqmcqGGSaalcqWGPbqAcqGH8aapcqWGQbGAcqGH8aapcqWGRbWAaeaacqWGKbaza0GaeyyeIuoaaaaa@877B@

where Hnormal is the entropy of the normal distribution with covariance matrix equal to the target distribution and κi,j,k(1 ≤ i, j, kd) is the standardized third cumulant of the target distribution. In practice, all the cumulants are estimated from samples.

If the underlying distribution is close to the normal distribution, the above approximation is quite accurate and the EDGE method works very well. However, if the distribution is far from the normal distribution, the approximation error gets large and therefore the EDGE method may be unreliable. In principle, it is possible to include the fourth and even higher cumulants for further reducing the estimation bias. However, this in turn increases the estimation variance; the expansion up to the third cumulants would be reasonable.


  1. 1.

    Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G: GO::TermFinder – open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. bioinformatics. 2004, 20: 3710-3715. 10.1093/bioinformatics/bth456.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  2. 2.

    Priness I, Maimon O, Ben-Gal I: Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007, 8: 111-10.1186/1471-2105-8-111.

    PubMed Central  Article  PubMed  Google Scholar 

  3. 3.

    Yvan Saeys II, Larranaga P: A review of feature selection techniques in bioinformatics. bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344.

    Article  PubMed  Google Scholar 

  4. 4.

    Schölkopf B, Smola AJ: Learning with Kernels. 2002, Cambridge, MA: MIT Press

    Google Scholar 

  5. 5.

    Pepe MS: Evaluation of Medical Tests for Classification and Prediction. 2003, Oxford Press

    Google Scholar 

  6. 6.

    Silverman BW: Density Estimation for Statistics and Data Analysis. 1986, Chapman & Hall/CRC

    Book  Google Scholar 

  7. 7.

    Fraser AM, Swinney HL: Independent coordinates for strange attractors from mutual information. Physical Review A. 1986, 33 (2): 1134-1140. 10.1103/PhysRevA.33.1134.

    Article  PubMed  Google Scholar 

  8. 8.

    Kraskov A, Stögbauer H, Grassberger P: Estimating mutual information. Physical Review E. 2004, 69: 066138-10.1103/PhysRevE.69.066138.

    Article  Google Scholar 

  9. 9.

    Hulle MMV: Edgeworth Approximation of Multivariate Differential Entropy. Neural Computation. 2005, 17 (9): 1903-1910. 10.1162/0899766054323026.

    Article  PubMed  Google Scholar 

  10. 10.

    Guyon I, Elisseeff A: An Introduction to Variable Feature Selection. Journal of Machine Learning Research. 2003, 3: 1157-1182. 10.1162/153244303322753616.

    Google Scholar 

  11. 11.

    Torkkola K: Feature extraction by non-parametric mutual information maximization. Journal of Machine Learning Research. 2003, 3: 1415-1438. 10.1162/153244303322753742.

    Google Scholar 

  12. 12.

    Comon P: Independent Component Analysis, A new concept?. Signal Processing. 1994, 36 (3): 287-314. 10.1016/0165-1684(94)90029-9.

    Article  Google Scholar 

  13. 13.

    Cover TM, Thomas JA: Elements of Information Theory. 1991, N. Y.: John Wiley & Sons, Inc

    Book  Google Scholar 

  14. 14.

    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet. 2000, 25: 25-29. 10.1038/75556.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  15. 15.

    Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell. 2000, 11 (12): 4241-4257.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  16. 16.

    Hughes TR, Marton MJ, Jones AR: Functional Discovery via a Compendium of Expression Proiles. Cell. 2000, 102: 109-126. 10.1016/S0092-8674(00)00015-5.

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Schlitt B, Palin K, Rung J, Dietmann S, Lappe M, Ukkonen E, Alvis : From Gene Networks to Gene Function. Genome Research. 2003, 13: 2568-2576. 10.1101/gr.1111403.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  18. 18.

    Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  19. 19.

    Segal E, Friedman N, Koller D, Regev A: A module map showing conditional activity of expression modules in cancer. Nature Genetics. 2004, 36 (10): 1090-1098. 10.1038/ng1434.

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Härdle W, Müller M, Sperlich S, Werwatz A: Nonparametric and Semiparametric Models. 2004, Springer Series in Statistics, Berlin: Springer

    Book  Google Scholar 

  21. 21.

    Khan S, Bandyopadhyay S, Ganguly A, Saigal S: Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E. 2007, 76: 026209-10.1103/PhysRevE.76.026209.

    Article  Google Scholar 

  22. 22.

    Vaart van der AW, Wellner JA: Weak Convergence and Empirical Processes. With Applications to Statistics. 1996, Springer, New York

    Book  Google Scholar 

  23. 23.

    Geer van de S: Empirical Processes in M-Estimation. 2000, Cambridge University Press

    Google Scholar 

  24. 24.

    Geer van de S: Estimating a Regression Function. The Annals of Statistics. 1990, 18 (2): 907-924. 10.1214/aos/1176347632.

    Article  Google Scholar 

  25. 25.

    Birgé L, Massart P: Rates of convergence for minimum contrast estimators. Probability Theory and Related Fields. 1993, 97: 113-150. 10.1007/BF01199316.

    Article  Google Scholar 

Download references


This work was partially supported by KAKENHI (Grant-in-Aid for Scientific Research) on Priority Areas "Systems Genomics" from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

T.S. was supported by the JSPS Research Fellowships for Young Scientists.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at

Author information



Corresponding author

Correspondence to Jun Sese.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TS developed the method, implemented the algorithm and wrote the manuscript. MS and TK discussed the method and revised the manuscript. JS discussed the method, interpreted the results and wrote the manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Suzuki, T., Sugiyama, M., Kanamori, T. et al. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinformatics 10, S52 (2009).

Download citation


  • Gene Ontology
  • Feature Selection
  • Mutual Information
  • Density Ratio
  • Protein Metabolic Process