minet: A R/Bioconductor Package for Inferring Large Transcriptional Networks Using Mutual Information

Results This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. Conclusion The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.


Background
Modelling transcriptional interactions by large networks of interacting elements and determining how these interactions can be effectively learned from measured expression data are two important issues in system biology [1]. It should be noted that by focusing only on transcript data, the inferred network should not be considered as a proper biochemical regulatory network, but rather as a gene-to-gene network where many physical connections between macromolecules might be hidden by short-cuts. In spite of some evident limitations the bioinformatics community made important advances in this domain over the last few years [2,3]. In particular, mutual information networks have been succesfully applied to transcriptional network inference [4][5][6]. Such methods, which typically rely on the estimation of mutual information between all pairs of variables, have recently held the attention of the bioinformatics community for the inference of very large networks (up to several thousands nodes) [4,[7][8][9].
R is a widely used open source language and environment for statistical computing and graphics [10] which has become a de-facto standard in statistical modeling, data analysis, biostatistics and machine learning [11]. An important feature of the R environment is that it integrates generic data analysis and visualization functionalities with off-the-shelf packages implementing the latest advances in computational statistics. Bioconductor is an open source and open development software project for the analysis and comprehension of genomic data [12] mainly based on the R programming language. This paper introduces the new R and Bioconductor package minet, where the acronym stands for Mutual Information NETwork inference. This package is freely available on the R CRAN package resource [10] as well as on the Bioconductor website [12].

Mutual information networks
Mutual information networks are a subcategory of network inference methods. The rationale of this family of methods is to infer a link between a couple of nodes if it has a high score based on mutual information [9].
Mutual informaton network inference proceeds in two steps. The first step is the computation of the mutual information matrix (MIM), a square matrix whose i, j-th element is the mutual information between X i and X j , where X i ∈ , i = 1,...,n, is a discrete random variable denoting the expression level of the ith gene. The second step is the computation of an edge score for each pair of nodes by an inference algorithm that takes the MIM matrix as input.
The adoption of mutual information in network inference tasks can be traced back to the Chow and Liu's tree algorithm [13,14]. Mutual information provides a natural generalization of the correlation since it is a non-linear measure of dependency. Hence with mutual information generalized correlation networks (relevance networks [7]) and also conditional independence graphs (e.g. ARACNE [8]) can be built. An advantage of these methods is their ability to deal with up to several thousands of variables also in the presence of a limited number of samples. This is made possible by the fact that the MIM computation requires only estimations of a bivariate mutual information term. Since each bivariate estimation can be computed fastly and is low variant also for a small number of samples, this family of methods is adapted for dealing with microarray data. Note that since mutual information is a symmetric measure, it is not possible to derive the direction of an edge using a mutual information network inference technique. Notwithstanding the orientation of the edges can be obtained by using algorithms like IC which are well known in the graphical modelling community [15].

Relevance Network
The relevance network approach [7] has been introduced in gene clustering and was successfully applied to infer relationships between RNA expressions and chemotherapeutic susceptibility [6]. The approach consists in inferring a genetic network where a pair of genes {X i , X j } is linked by an edge if the mutual information I(X i ; X j ) is larger than a given threshold I 0 . The complexity of the method is O(n 2 ) since all pairwise interactions are considered.
Note that this method does not eliminate all the indirect interactions between genes. For example, if gene X 1 regulates both gene X 2 and gene X 3 , this would cause a high mutual information between the pairs {X 1 , X 2 }, {X 1 , X 3 } and {X 2 , X 3 }. As a consequence, the algorithm will set an edge between X 2 and X 3 although these two genes interact only through gene X 1 .

CLR Algorithm
The CLR algorithm [4] is an extension of the relevance network approach. This algorithm computes the mutual information for each pair of genes and derives a score related to the empirical distribution of the MI values. In particular, instead of considering the information I(X i ; X j ) between genes X i and X j , it takes into account the score where and μ i and σ i are respectively the sample mean and standard deviation of the empirical distribution of the values I(X i , X k ), k = 1,...,n. The CLR algorithm was successfully applied to decipher the E. Coli TRN [4]. CLR has a complexity in O(n 2 ) once the MIM is computed.

ARACNE
The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) [8] is based on the Data Processing Inequality [16]. This inequality states that, if gene X 1 interacts with gene X 3 through gene X 2 , then I(X 1 ; X 3 ) ≤ min (I(X 1 ; X 2 ), I(X 2 ; X 3 )).
ARACNE starts by assigning to each pair of nodes a weight equal to the mutual information. Then, as in relevance networks, all edges for which I(X i ; X j ) <I 0 are removed, with I 0 a given threshold. Eventually, the weakest edge of each triplet is interpreted as an indirect interaction and is removed if the difference between the two lowest weights is above a threshold W 0 . Note that by increasing I 0 the number of inferred edges is decreased while the opposite effect is obtained by increasing W 0 .
If the network is a tree and only pairwise interactions are present, the method guarantees the reconstruction of the original network, once it is provided with the exact MIM. ARACNE's complexity is O(n 3 ) since the algorithm considers all triplets of genes. In [8] the method was able to recover components of the TRN in mammalian cells and outperformed Bayesian networks and relevance networks on several inference tasks [8].

MRNET
MRNET [9] infers a network using the maximum relevance/minimum redundancy (MRMR) feature selection method [17,18]. The idea consists in performing a series of supervised MRMR gene selection procedures where each gene in turn plays the role of the target output.
The MRMR method has been introduced in [17,18] together with a best-first search strategy for performing filter selection in supervised learning problems. Consider a supervised learning task where the output is denoted by Y and V is the set of input variables. The method ranks the set V of inputs according to a score that is the difference between the mutual information with the output variable Y (maximum relevance) and the average mutual information with all the previously ranked variables (minimum redundancy). The rationale is that direct interactions (i.e. the most informative variables to the target Y) should be well ranked whereas indirect interactions (i.e. the ones with redundant information with the direct ones) should be badly ranked by the method. The greedy search starts by selecting the variable X i having the highest mutual information to the target Y. The second selected variable X j will be the one with a high information I(X j ; Y) to the target and at the same time a low information I(X j ; X i ) to the previously selected variable. In the following steps, given a set S of selected variables, the criterion updates S by choosing the variable that maximizes the score where u j is a relevance term and r j is a redundancy term. More precisely, is the mutual information of X j with the target variable Y, and measures the average redundancy of X j to each already selected variables X k ∈ S. At each step of the algorithm, the selected variable is expected to allow an efficient trade-off between relevance and redundancy. It has been shown in [19] that the MRMR criterion is an optimal "pairwise" approximation of the conditional mutual information between any two genes X i and X j given the set S of selected variables I(X i ; X j |S).
The MRNET approach consists in repeating this selection procedure for each target gene by putting Y = X i and V = X \ {X i }, i = 1,...,n, where X is the set of the expression levels of all genes. For each pair {X i , X j }, MRMR returns two (not necessarily equal) scores s i and s j according to (4). The score of the pair {X i , X j } is then computed by taking the maximum of s i and s j . A specific network can then be inferred by deleting all the edges whose score lies below a given threshold I 0 (as in relevance networks, CLR and ARACNE). Thus, the algorithm infers an edge between X i and X j either when X i is a well-ranked predictor of X j (s i > I 0 ) or when X j is a well-ranked predictor of X i (s j > I 0 ).
An effective implementation of the best-first search for quadratic problems is available in [20]. This implementation demands an O(f × n) complexity for selecting f features using a best first search strategy. It follows that MRNET has an O(f × n 2 ) complexity since the feature selection step is repeated for each of the n genes. In other terms, the complexity ranges between O(n 2 ) and O(n 3 ) according to the value of f. In practice the selection of features stops once a variable obtains a negative score.

Implementation of the inference algorithms in minet
All the algorithms discussed above are available in the minet package. The RELNET algorithm is implemented by simply running the command build.mim which returns the MIM matrix which can be considered as a weighted adjacency matrix of the network. CLR, ARACNE and MRNET are implemented by the commands aracne(mim), clr(mim), mrnet(mim) respectively that return a weighted adjacency matrix of the network.
It should be noted, that the modularity of the minet package makes possible to assess network inference methods on similarity matrices other than MIM [21].

Mutual information estimation
An information-theoretic network inference technique aims at identifying connections between two genes (variables) by estimating the amount of information common to any pair of genes. Mutual information is a measure which calculates dependencies between two discrete random variables. An important property of this measure is that it is not restricted to the identification of linear relations between the random variables [16].
If X is a continuous random variable taking values between a and b, the interval [a, b] can be discretized by partitioning it into | | subintervals, called bins, where the symbol denotes the bin index vector. We use also nb(x k ) to denote the number of data points in the kth bin and the symbol to denote the number of samples. If X is a random vector each element X i can be discretized separately into | | bins with index vector .
Let X be a random vector and p a probability measure. The i, j-th element of the mutual information matrix (MIM) is defined by where the entropy of a random variable X is defined as and I(X i ; X j ) is the mutual information between the random variables X i and X j .
Hence, each mutual information calculus demands the estimation of three entropy terms (Eq. 5). A fast entropy estimation is therefore essential for an effective network inference based on MI. Entropy estimation has gained much interest in feature selection and network inference over the last decade [22]. Most approaches focus on reducing the bias inherent to entropy estimation. In this section, some of the fastest and most used entropy estimators are stressed. Other interesting approaches can be found in [22][23][24][25][26].

Empirical and Miller-Madow corrected estimators
The empirical estimator (also called "plug-in", "maximum likelihood" or "naïve", see [23]) is the entropy of the empirical distribution.
Note that, because of the convexity of the logarithmic function, an underestimate of p(x k ) causes an error on H(X = x k ) that is larger than the one given by an overestimation of the same quantity. As a result, entropy estimators are biased downwards, that is It has been shown that the variance of the empirical estimator is upper-bounded by which depends only on the number of samples whereas the asymptotic bias of the estimate depends also on the number of bins | | [23]. As | | &#x226B; m, this estimator can still have a low variance but the bias can become very large [23].
The Miller-Madow correction is then given by the following formula which is the empirical entropy corrected by the asymptotic bias, where | | is the number of bins with non-zero probability. This correction, while adding no computational cost to the empirical estimator, reduces the bias without changing variance. As a result, the Miller-Madow estimator is often preferred to the naive empirical entropy estimator.

Shrink entropy estimator
The rationale of the shrink estimator, [27], is to combine two different estimators, one with low variance and one with low bias, by using a weighting factor λ ∈ [0,1] Shrinkage is a general technique to improve an estimator for a small sample size [3]. As the value of λ tends to one, the estimated entropy is moved toward the maximal entropy (uniform probability) whereas when λ is zero the estimated entropy tends to the value of the empirical one.
Let λ* be the value minimizing the mean square function, see [27], It has been shown in [28] that the optimal λ is given by l l l * arg min ( ( ) ( )) .

The Schurmann-Grassberger Estimator
The Dirichlet distribution can be used in order to estimate the entropy of a discrete random variable. The Dirichlet distribution is the multivariate generalization of the beta distribution. It is also the conjugate prior of the multinomial distribution in Bayesian statistics. More precisely, the density of a Dirichlet distribution takes the following form where β i is the prior probability of an event x i and Γ(·) is the gamma function, (see [25,27,29] for more details).
In case of no a priori knowledge, the β k are assumed to be equal (β k = N, k ∈ ) so as no event becomes more probable than another. Note that using a Dirichlet prior with parameters N is equivalent to adding N ≥ 0 "pseudocounts" to each bin i ∈ . The prior actually provides the estimator the information that | |N counts have been observed in previous experiments. From that viewpoint, | |N becomes the a priori sample size.
The entropy of a Dirichlet distribution can be computed directly with the following equation: with the digamma function.
Various choices of prior parameters has been proposed in the literature [29][30][31]. Schurmann and Grassberger have proposed the prior [32] that has been retained in the package.

Implementation of estimators in minet
The mutual information matrix is estimated by using the function build.mim(dataset, estimator). This function returns a matrix of paired mutual informations computed in nats (base e) and takes two arguments: 1. the data frame dataset which stores the gene expression dataset or a generic dataset where columns contain variables/features and rows contain outcomes/samples 2. the string mi, that denotes the routine used to perform mutual information estimator.

Discretization methods
All the estimators discussed in the previous section have been designed for discrete variables. If the random variable X is continuous and takes values comprised between a and b, it is then required to partition the interval [a, b] into | | sub-intervals in order to adopt a discrete entropy estimator. The two most used discretizing algorithm are the equal width and the equal frequency quantization. These are explained in the next sections. Other discretization methods can be found in [33][34][35].

Equal Width
The principle of the equal width discretization is to divide the range [a i , b i ] of each variable X i , i ∈ {1, 2,...,n} in the dataset into | | sub-intervals of equal size: . Note that an ε is added in the last interval in order to include the maximal value in one of the | | bins. This discretization scheme has a O(m) complexity cost (by variable).

Global Equal Width
The principle of the global equal width discretization is the same as the equal width (Sec. 3.1) except that the considered range [a, b] is not the range of each random variable such as in Sec. 3.1 but the range of the random vector composed of all the variables in the dataset. In other words, a and b are respectively the minimal and the maximal value of the dataset.

Equal Frequency
The equal frequency discretization scheme consists in partitioning the range [a i , b i ] of each variable X i in the dataset into | | intervals, each having the same number m/| | of data points points. As a result, the size of each interval can be different. Note that if the | | intervals have equal frequencies, the computation of entropy is straightforward: it is log . However, there can be more than m/ | | identical values in a vector of measurements. In such case, one of the bins will be more dense than the others and the resulting entropy will be different of log . It should be noted that this discretization is reported in some papers as one of the most efficient method (e.g. for naive Bayes classification) [35].

Implementation of discretization strategies in minet
The discretization is performed in minet by the function discretize(dataset, disc = "equalfreq", nbins = sqrt(nrow(dataset))) where • dataset is the dataset to be discretized • disc is a string which can take three values: "equal freq" "equalwidth" "globalequal width"(default is " equalfreq").
• nbins, the number of bins to be used for discretization, which is by default set to with m is the number of samples [35]. Note that there are functions used by the built-in R hist() function that can be used here such as nclass. FD(dataset), nclass. scott(data set) and nclass. Sturges(dataset).

Assessment of the network inference algorithm
A network inference problem can be seen as a binary decision problem where the inference algorithm plays the role of a classifier: for each pair of nodes, the algorithm either returns an edge or not. Each pair of nodes can thus be assigned a positive label (an edge) or a negative one (no edge).
A positive label (an edge) predicted by the algorithm is considered as a true positive (TP) or as a false positive (FP) depending on the presence or not of the corresponding edge in the underlying true network, respectively. Analogously, a negative label is considered as a true negative (TN) or a false negative (FN) depending on whether the corresponding edge is present or not in the underlying true network, respectively. Note that all mutual information network inference methods use a threshold value in order to delete the arcs having a too low score. Hence, for each treshold value, a confusion matrix can be computed.

ROC curves
The false positive rate is defined as and the true positive rate as also known as recall or sensitivity.
A Receiver Operating Characteristic (ROC) curve, is a graphical plot of the TPR (true positive rate) vs. FPR (false positive rate) for a binary classifier system as the threshold is varied [36].

PR curves
It is generally recommended [37] to use receiver operator characteristic (ROC) curves when evaluating binary decision problems in order to avoid effects related to the chosen threshold. However, ROC curves can present an overly optimistic view of an algorithm's performance if there is a large skew in the class distribution, as typically encountered in transcriptional network inference because of sparseness. To tackle this problem, precision-recall (PR) curves have been cited as an alternative to ROC curves [38].
Let the precision quantity measure the fraction of real edges among the ones classified as positive and the recall quantity also know as true positive rate (TPR), denote the fraction of real edges that are correctly inferred. These quantities depend on the threshold chosen to return a binary decision. The PR curve is a diagram which plots the precision (p) versus recall (r) for different values of the threshold on a two-dimensional coordinate system.

F-Scores
Note that a compact representation of the PR diagram is returned by the maximum and/or the average of the Fscore quantity [39]: which is an harmonic average of precision and recall.
The general formula for non-negative real β is: where β is a parameter denoting the weight of the recall.
Two commonly used F-scores are the F 2 -measure, which weights recall twice as much as precision, and the F 0.5measure, which weights precision twice as much as recall.
In transcriptional network inference, precision is often a more desirable feature than recall since it is expensive to investigate if a gene regulates another.

Assesment functionalities in minet
In order to benchmark the inference methods, the package provides a number of assessment tools. The vali date(net, ref.net, steps = 50) function allows to compare an inferred network net to a reference network ref.net, described by a Boolean adjacency matrix. The assessment process consists in removing the inferred edges having a score below a given threshold and in computing the related confusion matrix, for steps thresholds ranging from the minimum to the maximum value of edge weigths. A resulting dataframe table containing the list of all the steps confusion matrices is returned and made available for further analysis.
In particular, the function pr( allow the user to plot PR-curves and ROC-curves respectively ( Figure 3) from a list of confusion matrices.

Example
Once the R platform is launched, the package, its description and its vignette can be loaded using the following commands: library(minet) library(help = minet) vignette("minet") A demo script (demo(demo)) shows the main functionalities of the package that we describe in the following.
In order to infer a network with the minet package, four steps are required: • data discretization, • MIM computation, • network inference, • normalization of the network (optional).
The main function of the package is minet which sequentially executes the four steps mentioned above, see Figure  1).
The function minet(dataset, method, estima tor, disc, nbins) takes the following arguments: dataset, a matrix or a dataframe containing the microarray data, method, the inference algorithm (such as ARACNE, CLR or MRNET), estimator, the entropy estimator used for the computation of mutual information (empirical, Miller-Madow, shrink, Schurmannn-Grassberger), disc the binning algorithm (i.e. equal frequency or equal size interval) and the parameter nbins which sets the number of bins to use. The final step of the minet function is the normalization using the norm(net) function. This step normalizes all the weights of the inferred adjancy matrix between 0 and 1. Hence, the minet function returns the inferred network as a weighted adjacency matrix with values ranging from 0 to 1 where the higher is a weight, the higher is the evidence that a gene-gene interaction exists.
For demo purposes the package makes available also the dataset syn.data representing the expression of 50 genes in 100 experiments. This dataset has been synthetically generated from the network syn.net using the microarray data generator Syntren [40]. This dataset can be loaded with data(syn.data) and the corresponding original network with data(syn.net).

res<-norm(net)
In order to plot a PR-curve (see Figure 3), the functions show.pr and validate can be used. In order to display the inferred network, the Rgraphviz package [41] can be used with the following commands (see Fig. 2):

library(Rgraphviz)
The four steps in the minet function (discretization disc, mutual information matrix build.mim, inference mrnet, aracne, clr and normalization norm Figure 1 The four steps in the minet function (discretization disc, mutual information matrix build.mim, inference mrnet, aracne, clr and normalization norm. Precision-Recall curves plotted with show.pr(table) Figure 3 Precision-Recall curves plotted with show.pr (table).
Graph generated with minet and plotted with Rgraphviz Figure 2 Graph generated with minet and plotted with Rgraphviz.
graph <-as(res, "graphNEL") plot(graph) Note that, for the sake of computational efficiency, all the inference functions as well as the entropy estimators are implemented in C++. As a reference, a network of five hundreds variables may be inferred in less than one minute on an Intel Pentium 4 with 2 Ghz and 512 DDR SDRAM.

Conclusion
Transcriptional network inference is a key issue toward the understanding of the relationships between the genes of an organism. Notwithstanding, few public domain tools are available once a thourough comparison of existing approaches is at stake. A new R/Bioconductor package, freely available, has been introduced in this paper. This package makes available to biologists and bioinformatics practicioneers a set of tools to infer networks from microarray datasets with a large number (several thousands) of genes. Four information-theoretic methods of network inference (i.e. Relevance Networks, CLR, ARACNE and MRNET), four different entropy estimators (i.e. empirical, Miller-Madow, Schurmann-Grassberger and shrink) and three validation tools (i.e. F-scores, PR curves and ROC curves) are implemented in the package. We deem that this tool is an effective answer to the increasing need of comparative tools in the growing domain of transcriptional network inference from expression data.
This work was partially funded by the Communauté Française de Belgique under ARC grant no. 04/09-307. The authors thank their collegue Catharina Olsen for her appreciable comments, suggestions and testing of package functionalities. The authors also thank Korbinian Strimmer as well as the reviewers for their useful comments on the package and the paper.