Cellular phenotypes are determined by the dynamical activity of large networks of co-regulated genes. Thus dissecting the mechanisms of phenotypic selection requires elucidating the functions of the individual genes in the context of the networks in which they operate. Because gene expression is regulated by proteins, which are themselves gene products, statistical associations between gene mRNA abundance levels, while not directly proportional to activated protein concentrations, should provide clues towards uncovering gene regulatory mechanisms. Consequently, the advent of high throughput microarray technologies to simultaneously measure mRNA abundance levels across an entire genome has spawned much research aimed at using these data to construct conceptual "gene network" models to concisely describe the regulatory influences that genes exert on each other.

Genome-wide clustering of gene expression profiles [1] provides an important first step towards this goal by grouping together genes that exhibit similar transcriptional responses to various cellular conditions, and are therefore likely to be involved in similar cellular processes. However, the organization of genes into co-regulated clusters provides a very coarse representation of the cellular network. In particular, it cannot separate statistical interactions that are irreducible (i.e., direct) from those arising from cascades of transcriptional interactions that correlate the expression of many noninteracting genes. More generally, as appreciated in statistical physics, long range order (i.e., high correlation among non-directly interacting variables) can easily result from short range interactions [2]. Thus correlations, or *any other* local dependency measure, cannot be used as the only tool for the reconstruction of interaction networks without additional assumptions.

Within the last few years a number of sophisticated approaches for the reverse engineering of cellular networks (also called deconvolution) from gene expression data have emerged (reviewed in [3]). Their goal is to produce a high-fidelity representation of the cellular network topology as a graph, where genes are represented as vertices and are connected by edges representing direct regulatory interactions. The criteria for defining an edge, as well as its biological interpretation, remain imprecise and vary between applications. For example, graphical modeling [4] defines edges as parent-child relationships between mRNA abundance levels that are most likely to explain the data, integrative methods [5] use independent experimental clues to define edges as those showing evidence of physical interactions, and other statistical/information theoretical methods [6] identify edges with the strongest statistical associations between mRNA abundance levels. All available approaches suffer to various degrees from problems such as overfitting, high computational complexity, reliance on non-realistic network models, or a critical dependency on supplementary data that is only available for simple organisms. These limitations have relegated the successful large-scale application of most methods to relatively simple organisms, such as the yeast *Saccharomyces cerevisiae*, and the genome-wide deconvolution of a mammalian network is yet to be reported.

Here we introduce *ARACNE* (Algorithm for the Reconstruction of Accurate Cellular Networks), a novel information-theoretic algorithm for the reverse engineering of transcriptional networks from microarray data that overcomes some of these limitations. ARACNE defines an edge as an irreducible statistical dependency between gene expression profiles that cannot be explained as an artifact of other statistical dependencies in the network. We suggest that the presence of such irreducible statistical dependencies is likely to identify direct regulatory interactions mediated by a transcription factor binding to a target gene's promoter region, although other types of interactions may also be identified (see Discussion). In this study we focus on the former type of interaction for validation purposes. We demonstrate that ARACNE compares favorably with existing methods and achieves extremely low error rates in identifying transcriptional interactions in a synthetic dataset modeled using realistic Hill kinetics. In a biological context, we show that the algorithm infers bona-fide transcriptional targets in a mammalian gene network. We also study the effects of misestimation of mutual information (MI) on network reconstruction, and show that algorithms based on MI ranking are resilient to estimation errors. The algorithm is general enough to deal with a variety of other network reconstruction problems in biological, social, and engineering fields.

### Theoretical Background

Several factors have impeded the reliable reconstruction of genome-wide mammalian networks. First, temporal gene expression data is difficult to obtain for higher eukaryotes, and cellular populations harvested from different individuals generally capture random steady states of the underlying biochemical dynamics. This precludes the use of methods that infer temporal associations and thus plausible causal relationships (reviewed in [7]). Only steady state statistical dependences can be studied, which are not obviously linked to the underlying physical dependency model. Compounding this constraint, there is no universally accepted definition of statistical dependencies in the multivariate setting [8, 9]. In this work we adopt the definition of [9], which builds on ideas from the Markov networks literature [10]. Briefly, we write the joint probability distribution (JPD) of the stationary expressions of all genes, *P*({*g*
_{
i
}}), *i* = 1,..., *N*, as:

where *N* is the number of genes, *Z* is the normalization factor, also called the *partition function*, *φ*... are *potentials*, and *H*({*g*
_{
i
}}) is the *Hamiltonian* that defines the system's statistics. Within such a model, we assert that a set of variables interacts if and only if (*iff*) the single potential that depends exclusively on these variables is nonzero. ARACNE aims precisely at identifying which of these potentials are nonzero, and eliminating the others even though their corresponding marginal JPDs may not factorize. While this representation is not directly used by the algorithm, it helps precisely formalize our definition of interaction and the class of irreducible dependencies that it will help elucidate.

Note that Eq. (1) does not define the potentials uniquely, and additional constraints are needed to avoid the ambiguity (see Appendix B). A reasonable approach is to specify *φ*... using maximum entropy approximations [9, 11] to *P*(*g*
_{1},..., *g*
_{
N
}) consistent with known marginals, so that constraining an *n*-way marginal defines its corresponding potential. We refer the reader to [9] for details.

### Approximations of the interaction structure

Since typical microarray sample sizes are relatively small, inferring the exponential number of potential *n*-way interactions of Eq. (1) is infeasible and a set of simplifying assumptions must be made about the dependency structure. Eq. (1) provides a principled and controlled way to introduce such approximations. The simplest model is one where genes are assumed independent, i.e., *H*({*g*
_{
i
}}) = ∑*φ*(*g*
_{
i
}), such that first-order potentials can be evaluated from the marginal probabilities, *P*(*g*
_{
i
}), which are estimated from experimental observations. As more data become available we should be able to reliably estimate higher order marginals and incorporate the corresponding potentials progressively, such that for *M* → ∞ (where *M* is sample set size) the complete form of the JPD is restored. In fact, *M* > 100 is generally sufficient to estimate 2-way marginals in genomics problems, while *P*(*g*
_{
i
}, *g*
_{
j
}, *g*
_{
k
}) requires about an order of magnitude more samples. Thus the current version of ARACNE truncates Eq. (1) at the pairwise interactions level,
. Within this approximation, all genes for which *φ*
_{
ij
}= 0 are declared mutually non-interacting. This includes genes that are statistically independent (i.e., *P*(*g*
_{
i
}, *g*
_{
j
}) ≈ *P*(*g*
_{
i
})*P*(*g*
_{
j
})), as well as genes that do not interact directly but are statistically dependent due to their interaction via other genes (i.e. *P*(*g*
_{
i
}, *g*
_{
j
}) ≠ *P*(*g*
_{
i
})*P*(*g*
_{
j
}), but *φ*
_{
ij
}= 0). We note that *P*(*g*
_{
i
}, *g*
_{
j
}) = *P*(*g*
_{
i
})*P*(*g*
_{
j
}) is not a sufficient condition for *φ*
_{
ij
}= 0. We discuss this below.

Since the number of potential pairwise interactions is quadratic in *N*, identification of indirect statistical interactions is a formidable challenge for all network reconstruction algorithms that rely on statistical associations. However, under certain biologically realistic assumptions about the network topology, the ARACNE algorithm provides a framework to reconstruct two-way interaction networks reliably from a finite number of samples in a computationally feasible time.