Large scale statistical inference of signaling pathways from RNAi and microarray data

Background The advent of RNA interference techniques enables the selective silencing of biologically interesting genes in an efficient way. In combination with DNA microarray technology this enables researchers to gain insights into signaling pathways by observing downstream effects of individual knock-downs on gene expression. These secondary effects can be used to computationally reverse engineer features of the upstream signaling pathway. Results In this paper we address this challenging problem by extending previous work by Markowetz et al., who proposed a statistical framework to score networks hypotheses in a Bayesian manner. Our extensions go in three directions: First, we introduce a way to omit the data discretization step needed in the original framework via a calculation based on p-values instead. Second, we show how prior assumptions on the network structure can be incorporated into the scoring scheme using regularization techniques. Third and most important, we propose methods to scale up the original approach, which is limited to around 5 genes, to large scale networks. Conclusion Comparisons of these methods on artificial data are conducted. Our proposed module network is employed to infer the signaling network between 13 genes in the ER-α pathway in human MCF-7 breast cancer cells. Using a bootstrapping approach this reconstruction can be found with good statistical stability. The code for the module network inference method is available in the latest version of the R-package nem, which can be obtained from the Bioconductor homepage.


Background
In the modern field of systems biology scientists aim to get insights into the architecture and behavior of complex cellular and genomic processes. An important task in this context is the detection of novel interdependencies between gene products. This insight into the genomic networks is an important step towards a better understanding of the functional aspects of a biological system and of great value for drug target identification at a later stage. Within this context modern DNA microarray technology plays an important role. In addition, the advent of RNA silencing techniques has further increased its power by allowing the selective knock-down of certain genes of interest. This may enable us to detect interdependencies between gene products on a non-transcriptional level. The genes of interest are knocked down individually, and the respective downstream effects on gene expression are measured by using genome-wide microarrays. By observing the nested structure of significant up or down regulations of affected genes, this may allow one to reverse engineer features of the upstream signaling pathway [1]. In a recent work Markowetz et al. [2] introduced a method to reverse engineer the signaling pathway between perturbed genes using the nested structure of secondary downstream effects. They developed a Bayesian statistical framework, in which for a given network hypothesis one can calculate a score and thus can reduce the set of all possible networks to the most likely ones. A severe limitation of this method lies, however, in the restriction to small networks of up to 5 genes, because the method completely enumerates all possible network hypotheses. Furthermore, a difficulty in the practical use is the required binary discretization of the data ("secondary effect present/not present").
In our work we therefore aim to extend the framework by Markowetz et al. in order to make it practically applicable for a broader range of real life problems. We are thereby motivated by biological experiments conducted in our department: 13 genes in the ER-α pathway in human MCF-7 breast cancer cells were silenced via small interfering RNAs and the effects on gene expression were subsequently measured on cDNA microarrays. Our extensions of the original approach go in three directions: First, we introduce a way to omit the data discretization step needed via a calculation based on p-values instead, which is more suitable for our data and makes the whole framework more flexible (generalized inference framework). Second, we show how prior assumptions on the network structure can be incorporated into the network scoring scheme via techniques from regularization theory [3]. Third and most important, we develop and investigate methods to scale up the network inference to large scale networks. For this purpose two approaches are considered: simulated annealing on a restricted set of possible networks and our so-called module networks, which build the complete network recursively from smaller pieces that are connected subsequently. In order to validate these approaches we conduct studies on artificially created networks and show that module networks offer the highest sensitivity and specificity in the reconstruction of edges in the networks. Finally, we demonstrate the applicability of our approach to real data by inferring the complete 13 genes ER-α signaling pathway network. Using a bootstrapping approach this reconstruction can be found with good statistical stability and hence seems to be reliable.

Statistical Inference Framework for Signaling Pathways from RNAi Data
We start with a brief review of the statistical inference framework for signaling pathways by Markowetz et al.: In general within this framework one distinguishes between silenced genes (S-genes) and genes showing a downstream effect (E-genes). It is assumed that each E-gene is attached to a single S-gene only ( Figure 1). Knocking down a specific S-gene S k interrupts signal flow in the downstream pathway, and hence an effect on the E-genes attached to S k and all S-genes depending on S k is expected. Let us assume n knock-downs are performed and there exist m E-genes in total. The outcomes of these experiments are summarized in an m × n data matrix D. According to Bayes' formula a specific network hypothesis Φ ∈ {0,1} n × {0,1} n can be scored as: The position of the E-genes is introduced as a model parameter Θ = {θ i |θ i ∈ {1,..., n}, i = 1,..., m}, i.e θ i = j, if Egene i is attached to S-gene j. Assuming independence of the observations (rows) D i in the data matrix D (given a fixed network hypothesis Φ and model parameters Θ) one can write down the conditional likelihood P(D)|Φ, Θ) as: It is furthermore assumed that all parameters θ i are statistically independent, i.e.
The likelihood P(D|Φ) can now be written as: Please note that the edges in network Φ can either represent transcriptional regulation events or phosphorilation or post-translational effects, as we reconstruct the signal flow in the network based on the nested structure of the measured effects. The effects on the E-genes that are measured are transcriptional effects, which are ultimately regulated by transcription factors. Some E-genes may be regulated by kinases, as due to the inherent nature of microarray measurements, it is impossible to distinguish between direct and indirect effects.

Our Approach Generalized Inference Framework
In their original work Markowetz et al. suppose the data matrix D to consist of counts, how often a specific E-gene shows an effect in ᐍ experiment repetitions. This requires a data discretization step, for which user specified type-I and type-II error rates are assumed. The choice of these parameters is certainly critical for the inference procedure, because it directly influences (5) and appears to be difficult to estimate. Markowetz et al. suppose to have both, positive and negative controls (pathway stimulated/not stimulated) for this procedure, which for our data is not available (see Section "Methods"). In contrast, in our approach we make the assumption that D is an m × n matrix of (raw) p-values, which specify the likelihood of Egene i being differentially expressed after knock-down of S-gene k. The p-values are calculated using a method for detecting differential gene expression, e.g. limma [4]. This way various experimental designs, including dye swaps, on arbitrary chip platforms can be used in a simple manner.
We now suppose a decomposition of P(D i |Φ,θ i ) as follows: In accordance to [2] this makes the assumption that knock-down experiments are statistically independent from each other. Hence, Eq. (5) can be written down as The only thing missing is the definition of P(D ik |Φ,θ i ). For this purpose we suppose the D ik to be drawn from a mixture of a uniform [0, 1] distribution reflecting the null hypothesis and another distribution f 1 reflecting the alternative hypothesis [5][6][7]: Under the alternative hypothesis there is a high density for small p-values and a strong decrease for increasing p-values. Both distributions overlap with mixing coefficient γ k . P(D ik |Φ,θ i ) can therefore be decomposed as: Main idea of the inference framework by Markowetz et al Figure 1 Main idea of the inference framework by Markowetz et al.: A network hypothesis is a directed graph between S-genes. Attached to each S-gene are several E-genes. Knocking down S-gene S 2 interrupts signal flow in the downstream pathway, and hence an effect of E-genes attached to S 2 and to S 1 is expected.
The density function f 1 reflects the strength of the knockdown effect on E-gene i under the alternative hypothesis. If it is greater than 1 the alternative hypothesis would be accepted, and if it is smaller than 1 rejected. In this work we assume f 1 to be a mixture of a Beta(1, β k ) distribution (β k Ŭ 2) and a small uniform component: In practice we set π k = 0.01 and tuned the parameter β k on the full distribution of raw p-values for knock-down experiment k (26709 genes) such that f 1 (D ik ) > 1, if the Benjamini-Hochberg false discovery rate [8] for D ik was ≤ 10% and f 1 (D ik ) ≤ 1 otherwise. An alternative treatment using a fitting procedure with Expectation Maximization [9] is described in our recent publication [10].

Regularization
Eq. (1) allows one to specify a prior P(Φ) on the network structure itself. This can be thought of as biasing the score of possible network hypotheses towards prior knowledge. It is crucial to understand that in principle in any inference scheme there exist two competing goals: Belief in prior assumptions/prior knowledge versus belief in data. Only trusting the data itself may lead to overfitting, whereas only trusting in prior assumptions does not give any new information and prevents learning. Therefore, we need a trade-off between both goals. This technique is known as regularization in the machine learning literature [3,11]. We have to take into account at this point that our assumptions may only be true up to a certain degree. Hence, for each edge we should suppose a prior probability reflecting the degree of belief in its existence. In principle, this degree of belief can be very different for each edge. We summarize all prior edge probabilities in an n × n matrix . Making the assumption that all edge priors P(Φ ij ) are independent, i.e. allows us to define the connection between Φ ij and for each edge separately. Note that Φ ij ∈ {0,1} depending on whether we set the edge i → j or not. Hence, for each edge we have a certain difference to our prior assumptions. The smaller this difference, the higher P(Φ ij ) should be. We can therefore model P(Φ ij ) as a Laplacian distribution with parameter λ : If we now write down the log-posterior of Eq. (1) we see that λ specifies the trade-off between the model's fit to our data and our prior assumptions. An important special case of the latter would be = 0, i.e. the matrix consisting only of zeros. The meaning of this prior would be to generally prefer sparse networks structures over dense ones. Setting λ → ∞ corresponds to completely trusting the prior, whereas λ = 0 leads to a maximum likelihood estimate, i.e. complete trust in data. In practice we would like to choose a λ balancing both goals. This leads to an instance of the classical model selection problem (e.g. [12]) in statistical learning. One way of dealing with it is to tune λ such that the Akaike information criterion (AIC) (15) becomes minimal [12]. Here d(λ, Φ opt ) denotes the number of free parameters (i.e. the number of unknown edges) in the network structure Φ opt optimizing (14).

Large Scale Network Inference
The inference framework does not answer the question how to come up with a candidate network topology, which we would like to score. Markowetz et al. [2] completely enumerate all possible topologies. This is, however, only suitable for small networks of up to 5 S-genes. For 5 S-genes there already exist more than 1,000,000 and for 10 genes more than 10 27 possible network topologies. In this context it should be noted that the scoring scheme cannot distinguish between two network hypotheses, if they only differ in transitive edges. This issue is known as prediction equivalence. Hence, it only makes sense to consider the set of all transitively closed network hypotheses. However, restricting ourselves to this limited class of network structures does not generally solve the problem, since even then the number of networks to consider scales in a similar way with the number of S-genes (for 5 genes if predicts an effect otherwise (14) Φ there are already more than 6,000 transitively closed networks to test). Hence, we have to resort to heuristics.

Stochastic Sampling
A quite obvious idea to prevent the computational effort to enumerate all possible network hypothesis is to sample from the set of all transitively closed network graphs randomly. We decided to use simulated annealing (SA) here [13]. SA is rather similar to Markov chain Monte Carlo (MCMC) sampling [14], but additionally makes use of a so-called cooling scheme, which gradually decreases the neighborhood size of a given state in search space. SA has been successfully applied to many difficult optimization problems from various disciplines, including bioinformatics [15,16]. In order to use SA, we have to define a state transition function t : S → S, which defines how to come from one graph to a modified one in search space. A special challenge in this context is that we need to guarantee that in principle all possible transitively closed network topologies can be reached by our function t.
Supposed we have functions add and del, which add and remove edges from a given transitively closed graph and produce a new one from this. We will restrict ourselves to the set of all transitively closed directed graphs (DAGs) here for reasons that will become clear soon. We now define add and del in a formal way as follows:

Definition 1
Let be the set of all transitively closed graphs DAGs with n nodes. Suppose the nodes for G ∈ are indexed in some way by natural numbers. We define add : as a function, which takes a graph G and a pair of node indices (i, j) ∈ ⊂ × , inserts the edge between i, j and transitively closes G then. is defined as the set of all pairs of node indices, for which there exists no edge in G and which after edge insertion do not violate the DAG property. Likewise, del : is a function, which takes a graph G and a pair of node indices (i, j) ∈ ⊂ × . G is transitively reduced, the edge between i, j deleted and then transitively closed again. With a transitive reduction G* of a graph G ∈ we mean a graph with a minimal number of edges such that the transitive closure of G* is G. is defined as the set of all pairs of node indices, for which there exists an edge in G.
In contrast to general graphs the transitive reduction of a DAG is unique [17], which is the reason for our restriction. This way we can guarantee that the del function is well defined and injective. This gives rise to the following lemma:

Lemma 2
The operations add : and del : have the following inverse property to each other: For any G ∈ del(add (G, i, j), i, j) = G, provided that edge i, j does not exist in G and does not violate the DAG property. Likewise, add(del(G, i, j), i, j) = G, provided that edge i, j exists in G.
Proof. The transitive reduction of a transitively closed DAG is unique [17]. Hence, the del operation is a well defined injective function. Additionally note that in the add operation we can never insert an edge, which lies in the transitive hull of G ∈ (and can thus be removed by a Proof. Let 0 ∈ be the DAG without any edges (but with n nodes). Let us denote by X → Y that from DAG X we can get DAG Y using add and del operations on appropriate edges. Note that we have A → 0 and B → 0, because we can use del to remove edges successively until no edges are left.
(The reasons for this property is that del removes edges from the transitive reduction of a DAG, which can thus not be inserted in the following transitive closure any more.) Because add and del are inverse operations to each other according to the lemma, we have 0 → A, 0 → B. Hence, we get A → 0 → B, which proves the theorem. ᮀ Still, the SA approach suffers from a potential problem: Both, the add and the del operation, at the bottom line perform a whole cascade of changes on the original graph. Thus there may be harsh changes in the scoring function when applying such an operation to a given candidate network. This may make it difficult to come close to the optimal network hypothesis.

Module Networks
Rather than looking for a complete network hypothesis at once the idea of the module network is to build up a graph from smaller subgraphs, called modules in the following. The module network is thus a divide and conquer approach: We first split the complete node set into smaller subgroups. This can be done by PAM clustering [18] on the p-value density profiles of the S-genes. The idea is that S-genes with a similar E-gene response profile (here: with regard to the Manhattan distance) should be close in the signaling path. The number of clusters for the PAM clustering is chosen between 2 and half of the number of Sgenes such that the average silhouette index becomes maximal. The silhouette value for each point in a cluster is a measure of how similar that point is to points in its own cluster vs. points in other clusters, and ranges from -1 to +1 [19]. It is defined as: where is the average distance from the i-th point to the other points in its own cluster, and is the average distance from the i-th point to points in another cluster j.
Each cluster of S-genes now forms one module. These modules are eventually further subdivided into smaller submodules until each submodule contains only 4 Sgenes at most. This way we obtain a tree structure of modules, where each node (module) has children (submodules). We begin with estimating the leaves in the module tree. As each leaf module can contain 4 S-genes at maximum this can be done by enumerating all possible transitively closed network hypotheses and taking the highest scoring one. After the leaves in the module tree have been built, their connection is estimated. For this purpose we score all pairwise connections between any pair of S-genes from leaves L 1 and L 2 . Denoting by |L 1 | and |L 2 | the number of S-genes in both leaves, these are 4. |L 1 |·|L 2 | tests altogether, because between any pair of S-genes (n 1 , n 2 ) we can either have no edge, an edge from n 1 to n 2 , an edge from n 2 to n 1 or an edge in both directions. After the best connection between L 1 and L 2 has been estimated, the corresponding subgraph is transitively closed. After all connections between leaves belonging to the same submodule in the module tree have been established, we recursively continue with connecting submodules in the same fashion as we did for leaf modules until the topology for the total network is completed.

Generalized Inference Framework: Proof of Principle
Scores of the top 25 models and the best model Figure 2 Scores of the top 25 models and the best model. employed by Markowetz et al. [2] as a proof of principle with discretized data. The dataset consists of expression profiles from 16 Affymetrix microarrays: 4 genes (tak, rel, key, mkk4/hep) were stimulated by lipopolysaccharide (LPS) for 60 minutes and then knocked-down by RNAi with 2 replicates for each expression profile. Additionally there were 4 replicates of control experiments without LPS and RNAi and 4 replicates of expression profiling with LPS but without RNAi. The dataset is available in a preprocessed form as a supplement of [2].
We took the same 68 genes showing a secondary effect (Egenes) as used in this publication and calculated p-values for differential gene expression between LPS stimulated and knock-down conditions by fitting an empirical Bayes model using the limma package in the R statistical computing environment [4]. We enumerated all possible 355 transitively closed network topologies and calculated their scores using (Eq. 7). The scores of the top 25 models and the best model are depicted in Figure 2. The score distribution of the 25 top models is slightly different, because of our modified inference scheme. We had a closer look at the best 4 models and found them to be identical to those shown in [2] (see also additional file 1). The second best model differs from the best model only in the missing edge key → rel. The next two models are either missing the edge tak → rel or tak → key. The key feature is preserved in all of them: The signal runs through tak before splitting into two pathway branches, one containing mkk4/hep, the other both key and rel. This fits exactly to the findings of Boutros et al. [1].

Large Scale Inference: Evaluation on Artificial Networks
To test our methods and to get better insights into the performance of our large scale inference methods, we generated data from artificial random networks. The sampling procedure for artificial networks is described in Section "Methods". We sampled networks with n = 4, 8,12 Sgenes. For each number of S-genes we varied the number m = 4, 8,..., 4n of E-genes and the parameter β = 10, 50,100 describing the Beta(1, β) component of the f 1 distribution (Eq. 10). We compared the SA approach with the module network. We evaluated both methods in terms of average sensitivity (i.e. ratio of correctly learned edges to total number of edges in the original network) and specificity (i.e. ratio of correctly unconnected genes to total number of unconnected genes in the original network) over 100 generated networks for each parameter combination (n, m, β). The initial temperature for the SA was set to 1000 and the maximum number of iterations to 100n. The initial network structure was always the graph with no edges. A logarithmic temperature cooling scheme according to [20] was used.
The results are shown in Figure 3 - Figure 5. In general all methods achieve a higher specificity than sensitivity, which is due to our "p-value" sampling strategy, and they show a high robustness against a varying number of Egenes. All in all the module network approach shows a superiority to the SA approach, especially in terms of sensitivity. Using module networks the sensitivity and specificity for n = 4 goes up to almost 100%. For n = 8,12 the sensitivity lies around 80%, while the specificity reaches more than 90%. Moreover, for all tested values of β the curves are relatively close together. We also compared the computation times for both approaches and found the module network to be substantially faster for n = 8,12 ( Figure 6). The average running time for network inference with n = 12 nodes was only 4s with the module network on our AMD dual core Opteron 2 GHz machine. In conclusion we think that the module network offers the most reliable and fast mechanism for large scale network inference among our tested approaches and is therefore taken as our inference method in the following section.

Application to RNAi Data from Human ER-α pathway
We applied the module network to infer the complete topology for a network of 13 silenced genes (Table 1) in the ER-α pathway. The 13 genes were selected from previous microarray studies in our department to be influenced by ER status in breast cancer patients. Each of the 13 genes was silenced individually using two different siRNAs, and the effect on gene expression was studied on whole genome cDNA microarrays. The data were generated in our department, details on the data generation and preprocessing steps are described in Section "Methods".
We found several known interdependencies between Eand S-genes as well as among S-genes by an intensive literature screen. The corresponding information was obtained from the Ingenuity™ software and can be visualized in form of a interdependency graph (Figure 7a). It represents some prior knowledge, which can be used for the network inference with our module networks algorithm via the regularization technique (c.f. Section "Regularization").
We considered 3 situations for the network inference: 1. no prior knowledge (complete trust in data), 2. inclusion of known interdependencies between E-and S-genes ( Table 2): For known interdependencies we set P(θ i = j|Φ) To ensure the statistical stability of the inferred network we employed bootstrapping: We sampled m E-genes from the total set of E-genes 50 times with replacement and each time ran the module network for topology induction. At the end we only considered edges, which were found in more than 50% of all bootstrap trials.
was found with high consistency and was also in agreement with the literature network (Figure 7a). A little bit more astonishing was the dependency of AKT1 from either FOX A1 or XPB1, which did, however, fit well to our data (c.f. Figure 8). The rest of the pathway reconstruction varied slightly among our three scenarios, but was in agreement with the data as well as with the literature.

Conclusion
We proposed a method for reconstructing signaling pathways from secondary effects, which were observed on microarray after silencing genes of interest via RNAi. Our approach systematically extends and generalizes previous work by Markowetz et al: An inference scheme was developed, which can deal with p-values for differential gene expression and does not rely on discretized data only. Regularization was employed to incorporate prior assumptions on the network architecture into our framework. Finally, new algorithms for large scale inference of signaling pathways were developed and evaluated in a systematic fashion on artificially created data. Thereby, our module network, which recursively build up the complete Sensitivity (top) and specificity (bottom) analysis for randomly generated networks with n = 8 S-genes: β = 100 (solid), β = 50 (dashed), β = 10 (dotted)
topology from smaller pieces, revealed the best performance in terms of sensitivity and specificity. We used the module network to infer the signaling pathway for 13 genes in the ER-α pathway in human MCF-7 breast cancer cells and used a bootstrapping approach to ensure the statistical stability of the result. The induced edges in our inferred network were found with good consistency and could in many cases be also confirmed by the literature.
Future biological experiments are planned to validate our reconstructed network in a systematic way. In conclusion of our results we think that our approach offers a scaleable, reliable and fairly general way for large scale inference of signaling pathways from secondary effects and therefore provides researchers with a valuable tool to gain insight into complex cellular processes.

Figure 5
Sensitivity (top) and specificity (bottom) analysis for randomly generated networks with n = 12 S-genes: β = 100 (solid), β = 50 (dashed), β = 10 (dotted). Left: simulated annealing, right: module network. Figure 6 Computation times (s) for the module network (white) and the simulated annealing (gray) approach.  Interdepencencies of 13 genes in the ER-α pathway drawn as transitvely reduced graphs: a) literature knowledge (Ingenuity™), b) inferred without prior knowledge, c) inferred with prior knowledge on some E-gene -S-gene connections, d) inferred with additional knowledge from a). Figure b) -d) only show edges, which where found in more than 50% of all bootstrap sets Figure 7 Interdepencencies of 13 genes in the ER-α pathway drawn as transitvely reduced graphs: a) literature knowledge (Ingenuity™), b) inferred without prior knowledge, c) inferred with prior knowledge on some E-gene -S-gene connections, d) inferred with additional knowledge from a). Figure b) -d) only show edges, which where found in more than 50% of all bootstrap sets. The corresponding fraction is reported at each edge.

Data Preprocessing
The microarray data was normalized on probe level using a variance stabilization transformation [21]. We calcu-lated p-values for differential gene expression by fitting an empirical Bayes model using the limma package in the R statistical computing environment [4]. For each knock- Heatmap showing the secondary effects of individual knock-downs (columns) on E-genes (rows) as log-f 1 density (cutoff 0, darker = stronger effect) Figure 8 Heatmap showing the secondary effects of individual knock-downs (columns) on E-genes (rows) as log-f 1 density (cutoff 0, darker = stronger effect). Our method tries to resolve the nested structure of these secondary effects.
down experiment we selected the top 100 ranked genes, which showed an at least 1.5-fold absolute change in expression level in at least one experiment. This gave us m = 94 E-genes altogether, which were the basis for our network inference (see additional file 2).

Generation of Artificial Networks
To get better insights into the performance of our large scale inference methods way we generated data from artificial random networks. This was done as follows: A network topology was created by randomly connecting n signaling genes (S-genes) with q edges. The number q was itself a random number between 1 and 25% of all possible edges. It thus covered extremely sparse up to relatively dense topologies. No loops between a node and itself were allowed. After defining the core topology, the network was transitively closed. Because the Simulated Annealing (SA) method can only deal with Directed Acyclic Graphs (DAGs), in this case we additionally restricted ourselves to randomly generated transitively closed DAGs. After creating the network between S-genes, we attached m E-genes uniform randomly over all S-genes. We then simulated knock-downs of the individual S-genes. For the genes effected by the knock-down (E-genes) "p-values" from the distribution f 1 were sampled (c.f. Subsection "Our Approach"). For those E-genes, where no effects were expected, the "p-values" were drawn uniform randomly from [0,1]. Afterwards, all sampled "p-values" were processed by the f 1 density function.

Implementation
All methods were implemented and computation and testing was performed using the statistical computing environment of R. The implementations of our methods have been integrated in the R package "Nested Effect Models" (nem) together with the original methods by Markowetz et al. [2]. The package and source code is publicly available via the Bioconductor repository.