Co-expression methods are widely used for analyzing gene expression data and other high dimensional “omics” data. Most co-expression measures fall into one of two categories: correlation coefficients or mutual information measures. MI measures have attractive information-theoretic interpretations and can be used to measure non-linear associations. Although MI is well defined for discrete or categorical variables, it is non-trivial to estimate the mutual information between quantitative variables, and corresponding permutation tests can be computationally intensive. In contrast, the correlation coefficient and other model based association measures are ideally suited for relating quantitative variables. Model based association measures have obvious statistical advantages including ease of calculation, straightforward statistical testing procedures, and the ability to include additional covariates into the analysis. Researchers trained in statistics often measure gene co-expression by the correlation coefficient. Computer scientists, trained in information theory, tend to use a mutual information (MI) based measure. Thus far, the majority of published articles use the correlation coefficient as co-expression measure [1–5] but hundreds of articles have used the mutual information (MI) measure [6–12].

Several articles have used simulations and real data to compare the two co-expression measures when clustering gene expression data. Allen et al. have found that correlation based network inference method WGCNA [5] and mutual information based method ARACNE [9] both perform well in constructing global network structure [13]; Steuer et al. show that mutual information and the Pearson correlation have an almost one-to-one correspondence when measuring gene pairwise relationships within their investigated data set, justifying the application of Pearson correlation as a measure of similarity for gene-expression measurements [14]. In simulations, no evidence could be found that mutual information performs better than correlation for constructing co-expression networks [15]. However, MI continues to be used in recent publications. Some authors have argued that MI is more robust than Pearson correlation in terms of distinguishing various clustering solutions [10]. Given the debates, it remains an open question whether mutual information could be supplanted by standard model based association measures. We affirmatively answer this question by i) reviewing the close relationship between mutual information and likelihood ratio test statistic in the case of categorical variables, ii) finding a close relationship between mutual information and correlation in simulations and empirical studies, and iii) proposing polynomial and spline regression models as alternatives to mutual information for modeling non-linear relationships.

While previous comparisons involved the Pearson correlation, we provide a more comprehensive comparison that considers i) different types of correlation coefficients, e.g. the biweight midcorrelation (bicor), ii) different approaches for constructing MI based and correlation based networks, iii) different ways of transforming a network adjacency matrix (e.g. the topological overlap reviewed below [4, 16–18]), and iv) 8 diverse gene expression data from yeast, mouse and humans. Our unbiased comparison evaluates co-expression measures at the level of gene pair relationships and at the level of forming co-expression modules (clusters of genes).

This article presents the following results. First, probably the most comprehensive empirical comparison to date is used to evaluate which pairwise association measure leads to the biologically most meaningful network modules (clusters) when it comes to functional enrichment with GO ontologies. Second, polynomial regression and spline regression methods are evaluated when it comes to defining non-linear association measures between gene pairs. Third, simulation studies are used to validate a functional relationship (cor-MI function) between correlation and mutual information in case that the two variables satisfy a linear relationship. Our comprehensive empirical studies illustrate that the cor-MI function can be used to approximate the relationship between mutual information and correlation in case of real data sets which indicates that in many situations the MI measure is not worth the trouble. Gene pairs where the two association measures disagree are investigated to determine whether technical artifacts lead to the incongruence.

Overall, we find that bicor based co-expression measure is an attractive co-expression measure, particularly when limited sample size does not permit the detection of non-linear relationships. Our theoretical results, simulations, and 8 different gene expression data sets show that MI is often inferior to correlation based approaches in terms of elucidating gene pairwise relationships and identifying co-expression modules. A signed correlation network transformed via the topological overlap matrix transformation often leads to the most significant functional enrichment of modules. Polynomial and spline regression model based statistical approaches are promising alternatives to MI for measuring non-linear relationships.

### Association measure and network adjacency

An association measure is used to estimate the relationships between two random variables. For example, correlation is a commonly used association measure. There are different types of correlations. While the Pearson correlation, which measures the extent of a linear relationship, is the most widely used correlation measure, the following two more robust correlation measures are often used. First, the Spearman correlation is based on ranks, and measures the extent of a monotonic relationship between *x* and *y*. Second, “bicor” (refer to Materials and Methods for definition and details) is a median based correlation measure, and is more robust than the Pearson correlation but often more powerful than the Spearman correlation [19, 20]. All correlation coefficients take on values between −1 and 1 where negative values indicate an inverse relationship. A correlation coefficient is an attractive association measure since i) it can be easily calculated, ii) it affords several asymptotic statistical tests (regression models, Fisher transformation) for calculating significance levels (p-values), and iii) the sign of correlation allows one to distinguish between positive and negative relationships. Other association measures, such as mutual information, will be introduced in the next sections.

Association measures can be transformed into network adjacencies. For

*n* variables

*v*
_{1},…,

*v*
_{
n
}, an adjacency matrix

*A* = (

*A*
_{
ij
}) is an

*n* ×

*n* matrix quantifying the pairwise connection strength between variables. An (undirected) network adjacency satisfies the following conditions:

An association network is defined as a network whose nodes correspond to random variables and whose adjacency matrix is based on the association measure between pairs of variables [

21]. Association networks describe the pair wise associations between variables (interpreted as nodes). For a given set of nodes, there is a one-to one relationship between the association network and the adjacency matrix. In order to build an association network for

*n* variables

*v* = (

*v*
_{1},…,

*v*
_{
n
}), we start by defining an association measure

*AssocMeasure*(

*x*
*y*) as a real valued function of two vectors

*x*,

*y*. We then apply this function on the set of

*N* =

*n*
^{2}variable pairs {

*Pai*
*r*
_{1} = (

*v*
_{1}
*v*
_{1}),

*Pai*
*r*
_{2} = (

*v*
_{1}
*v*
_{2}),…,

*Pai*
*r*
_{
N
}= (

*v*
_{
N
}
*v*
_{
N
})}, resulting in an

*n* ×

*n* dimensional matrix

Then, one needs to specify how the association matrix

*S* is transformed into an adjacency matrix. This involves three steps: 1) symmetrize

*S* ; 2) transform (and/or threshold)

*S* to [0,1] ; 3) set diagonal values to 1. As for step 1, many methods can be used to symmetrize

*S* if it is non-symmetric, such as the following three ways:

As for step 2, if

*LowerBounds*(

*S*) and

*UpperBounds*(

*S*) denote symmetric matrices of element-wise lower and upper bounds for

*S*, then a simple transformation can be defined as:

where the power

*β* is constant and denotes a soft threshold. As an example, assume that the association measure is given by a correlation coefficient, i.e.

*S* = (

*cor*(

x
_{
i
},

x
_{
j
})). Since each correlation has the lower bound −1 and upper bound + 1 , Eq.

6 reduces to the case of a signed weighted correlation network given by [

4,

22]:

Additional details of correlation based adjacencies (unweighted or weighted, unsigned or signed) are described in Materials and Methods.

#### Network adjacency based on co-expression measures

When dealing with gene expression data, *x*
_{
i
}denotes the expression levels of the i-th gene (or probe) across multiple samples. In this article, we assume that the *m* components of *x*
_{
i
}correspond to random independent samples. Co-expression measures can be used to define co-expression networks in which the nodes correspond to genes. The adjacencies *A*
_{
ij
} encode the similarity between the expression profiles of genes *i* and *j*. In practice, transformations such as the topological overlap measure (TOM) [4, 16–18] are often used to turn an original network adjacency matrix into a new one. Details of TOM transformation are reviewed in Materials and Methods.

### Mutual information networks based on categorical variables

Assume two random samples

*dx* and

*dy* of length

*m* from corresponding discrete or categorical random variables

*DX* and

*DY*. Each entry of

*dx* equals one of the following

*R* levels

*ld*
*x*
_{1},…,

*ld*
*x*
_{
R
}. The mutual information (MI) is defined as:

where

*p*(

*ld*
*x*
_{
r
}) is the frequency of level

*r* of

*dx*, and log is the natural logarithm. Note that the following

**simple relationship exists between the mutual information** (Eq.

8)

**and the likelihood ratio test statistic** (described in Additional file

1):

This relationship has many applications. First, it can be used to prove that the mutual information takes on non-negative values. Second, it can be used to calculate an asymptotic p-value for the mutual information. Third, it points to a way for defining a mutual information measure that adjusts for additional conditioning variables *z*
_{1},*z*
_{2},… Specifically, one can use a multivariate *multinomial regression model* for regressing *dy* on *dx* and the conditioning variables. Up to a scaling factor of 2*m*, the likelihood ratio test statistic can be interpreted as a (non-symmetric) measure of mutual information between *dx* and *dy* that adjusts for conditioning variables. More detailed discussion of mutual information can be found in [14, 23, 24]. In Additional file 1, we describe association measures between categorical variables in detail, including LRT statistic and MI.

As discussed below, numerous ways have been suggested for construct an adjacency matrix based on MI. Here we describe an approach that results in a weighted adjacency matrix. Consider

*n* categorical variables

*d*
*x*
_{1},

*d*
*x*
_{2},…,

*d*
*x*
_{
n
}. Their mutual information matrix

*MI*(

*d*
*x*
_{
i
},

*d*
*x*
_{
j
}) is a similarity matrix

*S* whose entries are bounded from below by 0. To arrive at an upper bound, we review the relationship between mutual information and entropy (the following equation is text book knowledge):

where

*Entropy*(

*dx*) denotes the entropy of

*dx* and

*Entropy*(

*dx*,

*dy*) denotes the joint entropy (refer to Additional file

1). Using Eq.

10, one can prove that the mutual information has the following 3 upper bounds:

Using Eq.

6 with

*β* = 1, lower bounds of 0 and

*UpperBound*
*s*
_{
ij
}= (

*Entropy*(

*d*
*x*
_{
i
}) +

*Entropy*(

*d*
*x*
_{
j
}))/2 (Eq.

12) results in the

*symmetric uncertainty based mutual information adjacency matrix*:

A transformation of

*A*
^{
MI,SymmetricUncertainty
}leads to the

*universal mutual information based adjacency matrix version 1* (denoted AUV1):

One can easily prove that
. The term “universal” reflects the fact that the adjacency based dissimilarity
turns out to be a universal distance function [25]. Roughly speaking, the universality of
implies that any other distance measure between *d*
*x*
_{
i
}and *d*
*x*
_{
j
} will be small if
is small. The term “distance” reflects the fact that *dissMI*
^{
UniveralVersion1 }satisfies the properties of a distance including the triangle inequality.

Another adjacency matrix is based on the upper bound implied by inequality 13. We define the

*universal mutual information based adjacency matrix version 2*, or AUV2, as follows:

The name reflects the fact that *dissMI*
^{
UniveralVersion2 }= 1−*A*
^{
MI,UniversalVersion2} is also a universal distance measure [25]. While *A*
^{
MI,UniversalVersion1 }and *A*
^{
MI,UniversalVersion2} are in general different, we find very high Spearman correlations (*r* > 0*.*9 ) between their vectorized versions.

Many alternative approaches exist for defining MI based networks, e.g. ARACNE [9], CLR [26], MRNET [27] and RELNET [6, 28] are described in Materials and Methods.

### Mutual information networks based on discretized numeric variables

In its original inception, the mutual information measure was only defined for discrete or categorical variables, see e.g. [

23]. It is challenging to extend the definition to

*quantitative* variables. But, several strategies have been proposed in the literature [

7,

28,

29]. In this article, we will only consider the following approach which is based on discretizing the numeric vector

*x* by using the equal width discretization method. This method partitions the interval [

*min*(

*x*),

*max*(

*x*)] into equal-width bins (sub-intervals). The vector

*discretize*(

*x*) has the same length as

*x* but its

*l*-th component reports the bin number in which

*x*
_{
l
} falls:

The number of bins, *no.bins*, is the only parameter of the equal-width discretization method.

In our subsequent studies, we calculate an MI-based adjacency matrix using the following three steps. First, numeric vectors of gene expression profiles are discretized according to the equal-width discretization method with the default number of bins given by
. Second, the mutual information *M*
*I*
_{
ij
}= *MI*(*discretize*(*x*
_{
i
}),*discretize*(*x*
_{
j
})) is calculated between the discretized vectors based on Eq. 10 and the Miller Madow entropy estimation method (detailed in Additional file 1). Third, the MI matrix is transformed into one of three possible MI-based adjacency matrices: *A*
^{
MI,SymmetricUncertainty
} (Eq. 14), *A*
^{
MI,UniversalVersion1} (Eq. 15), *A*
^{
MI,UniversalVersion2 }(Eq. 16).