Large scale statistical inference of signaling pathways from RNAi and microarray data
© Froehlich et al.; licensee BioMed Central Ltd. 2007
Received: 27 February 2007
Accepted: 15 October 2007
Published: 15 October 2007
The advent of RNA interference techniques enables the selective silencing of biologically interesting genes in an efficient way. In combination with DNA microarray technology this enables researchers to gain insights into signaling pathways by observing downstream effects of individual knock-downs on gene expression. These secondary effects can be used to computationally reverse engineer features of the upstream signaling pathway.
In this paper we address this challenging problem by extending previous work by Markowetz et al., who proposed a statistical framework to score networks hypotheses in a Bayesian manner. Our extensions go in three directions: First, we introduce a way to omit the data discretization step needed in the original framework via a calculation based on p-values instead. Second, we show how prior assumptions on the network structure can be incorporated into the scoring scheme using regularization techniques. Third and most important, we propose methods to scale up the original approach, which is limited to around 5 genes, to large scale networks.
Comparisons of these methods on artificial data are conducted. Our proposed module network is employed to infer the signaling network between 13 genes in the ER-α pathway in human MCF-7 breast cancer cells. Using a bootstrapping approach this reconstruction can be found with good statistical stability.
The code for the module network inference method is available in the latest version of the R-package nem, which can be obtained from the Bioconductor homepage.
In the modern field of systems biology scientists aim to get insights into the architecture and behavior of complex cellular and genomic processes. An important task in this context is the detection of novel interdependencies between gene products. This insight into the genomic networks is an important step towards a better understanding of the functional aspects of a biological system and of great value for drug target identification at a later stage. Within this context modern DNA microarray technology plays an important role. In addition, the advent of RNA silencing techniques has further increased its power by allowing the selective knock-down of certain genes of interest. This may enable us to detect interdependencies between gene products on a non-transcriptional level. The genes of interest are knocked down individually, and the respective downstream effects on gene expression are measured by using genome-wide microarrays. By observing the nested structure of significant up or down regulations of affected genes, this may allow one to reverse engineer features of the upstream signaling pathway . In a recent work Markowetz et al.  introduced a method to reverse engineer the signaling pathway between perturbed genes using the nested structure of secondary downstream effects. They developed a Bayesian statistical framework, in which for a given network hypothesis one can calculate a score and thus can reduce the set of all possible networks to the most likely ones. A severe limitation of this method lies, however, in the restriction to small networks of up to 5 genes, because the method completely enumerates all possible network hypotheses. Furthermore, a difficulty in the practical use is the required binary discretization of the data ("secondary effect present/not present").
In our work we therefore aim to extend the framework by Markowetz et al. in order to make it practically applicable for a broader range of real life problems. We are thereby motivated by biological experiments conducted in our department: 13 genes in the ER-α pathway in human MCF-7 breast cancer cells were silenced via small interfering RNAs and the effects on gene expression were subsequently measured on cDNA microarrays. Our extensions of the original approach go in three directions: First, we introduce a way to omit the data discretization step needed via a calculation based on p-values instead, which is more suitable for our data and makes the whole framework more flexible (generalized inference framework). Second, we show how prior assumptions on the network structure can be incorporated into the network scoring scheme via techniques from regularization theory . Third and most important, we develop and investigate methods to scale up the network inference to large scale networks. For this purpose two approaches are considered: simulated annealing on a restricted set of possible networks and our so-called module networks, which build the complete network recursively from smaller pieces that are connected subsequently. In order to validate these approaches we conduct studies on artificially created networks and show that module networks offer the highest sensitivity and specificity in the reconstruction of edges in the networks. Finally, we demonstrate the applicability of our approach to real data by inferring the complete 13 genes ER-α signaling pathway network. Using a bootstrapping approach this reconstruction can be found with good statistical stability and hence seems to be reliable.
Results and discussion
Statistical Inference Framework for Signaling Pathways from RNAi Data
P(Φ|D) ∝ P(D|Φ)P(Φ)
Please note that the edges in network Φ can either represent transcriptional regulation events or phosphorilation or post-translational effects, as we reconstruct the signal flow in the network based on the nested structure of the measured effects. The effects on the E-genes that are measured are transcriptional effects, which are ultimately regulated by transcription factors. Some E-genes may be regulated by kinases, as due to the inherent nature of microarray measurements, it is impossible to distinguish between direct and indirect effects.
Generalized Inference Framework
In their original work Markowetz et al. suppose the data matrix D to consist of counts, how often a specific E-gene shows an effect in ℓ experiment repetitions. This requires a data discretization step, for which user specified type-I and type-II error rates are assumed. The choice of these parameters is certainly critical for the inference procedure, because it directly influences (5) and appears to be difficult to estimate. Markowetz et al. suppose to have both, positive and negative controls (pathway stimulated/not stimulated) for this procedure, which for our data is not available (see Section "Methods"). In contrast, in our approach we make the assumption that D is an m × n matrix of (raw) p-values, which specify the likelihood of E-gene i being differentially expressed after knock-down of S-gene k. The p-values are calculated using a method for detecting differential gene expression, e.g. limma . This way various experimental designs, including dye swaps, on arbitrary chip platforms can be used in a simple manner.
The only thing missing is the definition of P(D ik |Φ,θ i ). For this purpose we suppose the D ik to be drawn from a mixture of a uniform [0, 1] distribution reflecting the null hypothesis and another distribution f1 reflecting the alternative hypothesis [5–7]:
P(D ik ) = γ k + (1 - γ k )·f1(D ik ), γ k ∈ (0,1)
The density function f1 reflects the strength of the knock-down effect on E-gene i under the alternative hypothesis. If it is greater than 1 the alternative hypothesis would be accepted, and if it is smaller than 1 rejected. In this work we assume f1 to be a mixture of a Beta(1, β k ) distribution (β k ≫ 2) and a small uniform component:
f1(D ik ) = π k + (1 - π k )Beta(D ik , 1, β k )
In practice we set π k = 0.01 and tuned the parameter β k on the full distribution of raw p-values for knock-down experiment k (26709 genes) such that f1(D ik ) > 1, if the Benjamini-Hochberg false discovery rate  for D ik was ≤ 10% and f1(D ik ) ≤ 1 otherwise. An alternative treatment using a fitting procedure with Expectation Maximization  is described in our recent publication .
we see that λ specifies the trade-off between the model's fit to our data and our prior assumptions. An important special case of the latter would be = 0, i.e. the matrix consisting only of zeros. The meaning of this prior would be to generally prefer sparse networks structures over dense ones. Setting λ → ∞ corresponds to completely trusting the prior, whereas λ = 0 leads to a maximum likelihood estimate, i.e. complete trust in data. In practice we would like to choose a λ balancing both goals. This leads to an instance of the classical model selection problem (e.g. ) in statistical learning. One way of dealing with it is to tune λ such that the Akaike information criterion (AIC)
AIC(λ, Φ opt ) = -2 log P(D|Φ opt ) + 2d(λ, Φ opt )
becomes minimal . Here d(λ, Φ opt ) denotes the number of free parameters (i.e. the number of unknown edges) in the network structure Φ opt optimizing (14).
Large Scale Network Inference
The inference framework does not answer the question how to come up with a candidate network topology, which we would like to score. Markowetz et al.  completely enumerate all possible topologies. This is, however, only suitable for small networks of up to 5 S-genes. For 5 S-genes there already exist more than 1,000,000 and for 10 genes more than 1027 possible network topologies. In this context it should be noted that the scoring scheme cannot distinguish between two network hypotheses, if they only differ in transitive edges. This issue is known as prediction equivalence. Hence, it only makes sense to consider the set of all transitively closed network hypotheses. However, restricting ourselves to this limited class of network structures does not generally solve the problem, since even then the number of networks to consider scales in a similar way with the number of S-genes (for 5 genes there are already more than 6,000 transitively closed networks to test). Hence, we have to resort to heuristics.
A quite obvious idea to prevent the computational effort to enumerate all possible network hypothesis is to sample from the set of all transitively closed network graphs randomly. We decided to use simulated annealing (SA) here . SA is rather similar to Markov chain Monte Carlo (MCMC) sampling , but additionally makes use of a so-called cooling scheme, which gradually decreases the neighborhood size of a given state in search space. SA has been successfully applied to many difficult optimization problems from various disciplines, including bioinformatics [15, 16]. In order to use SA, we have to define a state transition function t : S → S, which defines how to come from one graph to a modified one in search space. A special challenge in this context is that we need to guarantee that in principle all possible transitively closed network topologies can be reached by our function t.
Supposed we have functions add and del, which add and remove edges from a given transitively closed graph and produce a new one from this. We will restrict ourselves to the set of all transitively closed directed graphs (DAGs) here for reasons that will become clear soon. We now define add and del in a formal way as follows:
Let be the set of all transitively closed graphs DAGs with n nodes. Suppose the nodes for G ∈ are indexed in some way by natural numbers. We define add : as a function, which takes a graph G and a pair of node indices (i, j) ∈ ⊂ ℕ × ℕ, inserts the edge between i, j and transitively closes G then. is defined as the set of all pairs of node indices, for which there exists no edge in G and which after edge insertion do not violate the DAG property. Likewise, del : is a function, which takes a graph G and a pair of node indices (i, j) ∈ ⊂ ℕ × ℕ. G is transitively reduced, the edge between i, j deleted and then transitively closed again. With a transitive reduction G* of a graph G ∈ we mean a graph with a minimal number of edges such that the transitive closure of G* is G. is defined as the set of all pairs of node indices, for which there exists an edge in G.
In contrast to general graphs the transitive reduction of a DAG is unique , which is the reason for our restriction. This way we can guarantee that the del function is well defined and injective. This gives rise to the following lemma:
The operations add : and del : have the following inverse property to each other: For any G ∈ del(add(G, i, j), i, j) = G, provided that edge i, j does not exist in G and does not violate the DAG property. Likewise, add(del(G, i, j), i, j) = G, provided that edge i, j exists in G.
Proof. The transitive reduction of a transitively closed DAG is unique . Hence, the del operation is a well defined injective function. Additionally note that in the add operation we can never insert an edge, which lies in the transitive hull of G ∈ (and can thus be removed by a transitive reduction), since otherwise it would have been there already (because consists of transitively closed networks only). Therefore, for any G ∈ del(add(G, i, j), i, j) = G, provided that edge i, j does not exist in G and does not violate the DAG property. Likewise, add(del(G, i, j), i, j) = G, provided that edge i, j exists in G. □
From any graph A ∈ in the set of all transitively closed DAGs we can reach any other graph B ∈ using add and del operations.
Proof. Let 0 ∈ be the DAG without any edges (but with n nodes). Let us denote by X → Y that from DAG X we can get DAG Y using add and del operations on appropriate edges. Note that we have A → 0 and B → 0, because we can use del to remove edges successively until no edges are left. (The reasons for this property is that del removes edges from the transitive reduction of a DAG, which can thus not be inserted in the following transitive closure any more.) Because add and del are inverse operations to each other according to the lemma, we have 0 → A, 0 → B. Hence, we get A → 0 → B, which proves the theorem. □
Still, the SA approach suffers from a potential problem: Both, the add and the del operation, at the bottom line perform a whole cascade of changes on the original graph. Thus there may be harsh changes in the scoring function when applying such an operation to a given candidate network. This may make it difficult to come close to the optimal network hypothesis.
where is the average distance from the i-th point to the other points in its own cluster, and is the average distance from the i-th point to points in another cluster j.
Each cluster of S-genes now forms one module. These modules are eventually further subdivided into smaller submodules until each submodule contains only 4 S-genes at most. This way we obtain a tree structure of modules, where each node (module) has children (submodules). We begin with estimating the leaves in the module tree. As each leaf module can contain 4 S-genes at maximum this can be done by enumerating all possible transitively closed network hypotheses and taking the highest scoring one. After the leaves in the module tree have been built, their connection is estimated. For this purpose we score all pairwise connections between any pair of S-genes from leaves L1 and L2. Denoting by |L1| and |L2| the number of S-genes in both leaves, these are 4. |L1|·|L2| tests altogether, because between any pair of S-genes (n1, n2) we can either have no edge, an edge from n1 to n2, an edge from n2 to n1 or an edge in both directions. After the best connection between L1 and L2 has been estimated, the corresponding subgraph is transitively closed. After all connections between leaves belonging to the same submodule in the module tree have been established, we recursively continue with connecting submodules in the same fashion as we did for leaf modules until the topology for the total network is completed.
Generalized Inference Framework: Proof of Principle
To show the correctness of our generalized inference framework, we conducted experiments on the Drosophila dataset by Boutros et al. [I]. This dataset was also employed by Markowetz et al.  as a proof of principle with discretized data. The dataset consists of expression profiles from 16 Affymetrix microarrays: 4 genes (tak, rel, key, mkk4/hep) were stimulated by lipopolysaccharide (LPS) for 60 minutes and then knocked-down by RNAi with 2 replicates for each expression profile. Additionally there were 4 replicates of control experiments without LPS and RNAi and 4 replicates of expression profiling with LPS but without RNAi. The dataset is available in a preprocessed form as a supplement of .
Large Scale Inference: Evaluation on Artificial Networks
To test our methods and to get better insights into the performance of our large scale inference methods, we generated data from artificial random networks. The sampling procedure for artificial networks is described in Section "Methods". We sampled networks with n = 4, 8,12 S-genes. For each number of S-genes we varied the number m = 4, 8,..., 4n of E-genes and the parameter β = 10, 50,100 describing the Beta(1, β) component of the f1 distribution (Eq. 10). We compared the SA approach with the module network. We evaluated both methods in terms of average sensitivity (i.e. ratio of correctly learned edges to total number of edges in the original network) and specificity (i.e. ratio of correctly unconnected genes to total number of unconnected genes in the original network) over 100 generated networks for each parameter combination (n, m, β). The initial temperature for the SA was set to 1000 and the maximum number of iterations to 100n. The initial network structure was always the graph with no edges. A logarithmic temperature cooling scheme according to  was used.
Application to RNAi Data from Human ER-α pathway
Differential genes in complete dataset and among E-genes: The first column shows the number of all genes with Benjamini-Hochberg false discovery rate  ≤ 10%; the second column the number of all genes with at least 1.5-fold disregulation; the third column the number of all genes with f1-density > 1 (effected genes). The last two columns show statistics among the selected E-genes (see Methods Section): the number of E-genes with false discovery rate ≤ 10% and with f1-density > 1
fdr ≤ 10%
≥ 1.5-fold disreg.
f1 > 1
fdr ≤ 10%(E-genes)
f1 > 1(E-genes)
Known interdependencies between S-genes and E-genes
ARL6IP, XBP 1
ARL6IP, XBP 1
ARL6IP, XBP 1
ARL6IP, XBP 1
To ensure the statistical stability of the inferred network we employed bootstrapping: We sampled m E-genes from the total set of E-genes 50 times with replacement and each time ran the module network for topology induction. At the end we only considered edges, which were found in more than 50% of all bootstrap trials.
We proposed a method for reconstructing signaling pathways from secondary effects, which were observed on microarray after silencing genes of interest via RNAi. Our approach systematically extends and generalizes previous work by Markowetz et al: An inference scheme was developed, which can deal with p-values for differential gene expression and does not rely on discretized data only. Regularization was employed to incorporate prior assumptions on the network architecture into our framework. Finally, new algorithms for large scale inference of signaling pathways were developed and evaluated in a systematic fashion on artificially created data. Thereby, our module network, which recursively build up the complete topology from smaller pieces, revealed the best performance in terms of sensitivity and specificity. We used the module network to infer the signaling pathway for 13 genes in the ER-α pathway in human MCF-7 breast cancer cells and used a bootstrapping approach to ensure the statistical stability of the result. The induced edges in our inferred network were found with good consistency and could in many cases be also confirmed by the literature. Future biological experiments are planned to validate our reconstructed network in a systematic way. In conclusion of our results we think that our approach offers a scale-able, reliable and fairly general way for large scale inference of signaling pathways from secondary effects and therefore provides researchers with a valuable tool to gain insight into complex cellular processes.
The code for the module network inference method is available in the latest version of the -R-package nem, which can be obtained from the Bioconductor homepage (see additional file 3).
RNAi knockdown and microarray experiments
RNAi knock down experiments were conducted on 13 S-genes (Table 1), which were supposedly connected in signaling pathways in human MCF-7 breast cancer cells (ATCC, Manassas, VA). These cells were cultured in Gibco MEM medium with phenol red supplemented with 10% fetal bovine serum (FBS), 50 μ g/ml streptomycin, 50 U/ml penicillin, 1% MEM non essential amino acids (100×) and 100 μ g/ml insulin bovine (all reagents provided by Invitrogen).
Cells were split every 3–4 days to ensure exponential growth. MCF-7 cells were transiently transfected with at least two different chemically synthesized small interfering (si)RNA (50 nM) against one of the 13 genes in order to minimize off-target effects. Control silencing was done in the same experiment using control (non-silencing) siRNA (50 nM). All used siRNAs were provided by Qiagen (Hilden, Germany). Transfection was performed in antibiotic-free medium according to Qiagen HiPerFect standard transfection protocol. Therefore 1 × 104 cells/well were seeded onto a 96 well plate 24 h prior to transfection. After preincubation transfection was carried on for 42 h and total RNA was isolated using RNeasy mini prep Kit. Every knock-down experiment was performed in 2–4 independent replicates, and the mRNA level of each targeted gene was measured using qRT-PCR. Only experiments showing more than 70% silencing of the mRNA of interest were used for following studies. For global gene expression analysis 2 μ g of isolated total RNA was amplified using the Agilent Low RNA Input Fluorescent Linear Amplification Kit and hybridized in dye-swap design on at least 4 (according to number of used replicates) home made whole genome cDNA microarrys containing 37.500 genes from the RZPD Unigene 3.1 clone collection. The complete dataset was submitted to the GEO database (GEO ID: GSE7033). A manuscript describing the biological implication of the data and analysis in more detail is in preparation.
The microarray data was normalized on probe level using a variance stabilization transformation . We calculated p-values for differential gene expression by fitting an empirical Bayes model using the limma package in the R statistical computing environment . For each knock-down experiment we selected the top 100 ranked genes, which showed an at least 1.5-fold absolute change in expression level in at least one experiment. This gave us m = 94 E-genes altogether, which were the basis for our network inference (see additional file 2).
Generation of Artificial Networks
To get better insights into the performance of our large scale inference methods way we generated data from artificial random networks. This was done as follows: A network topology was created by randomly connecting n signaling genes (S-genes) with q edges. The number q was itself a random number between 1 and 25% of all possible edges. It thus covered extremely sparse up to relatively dense topologies. No loops between a node and itself were allowed. After defining the core topology, the network was transitively closed. Because the Simulated Annealing (SA) method can only deal with Directed Acyclic Graphs (DAGs), in this case we additionally restricted ourselves to randomly generated transitively closed DAGs. After creating the network between S-genes, we attached m E-genes uniform randomly over all S-genes. We then simulated knock-downs of the individual S-genes. For the genes effected by the knock-down (E-genes) "p-values" from the distribution f1 were sampled (c.f. Subsection "Our Approach"). For those E-genes, where no effects were expected, the "p-values" were drawn uniform randomly from [0,1]. Afterwards, all sampled "p-values" were processed by the f1 density function.
All methods were implemented and computation and testing was performed using the statistical computing environment of R. The implementations of our methods have been integrated in the R package "Nested Effect Models" (nem) together with the original methods by Markowetz et al. . The package and source code is publicly available via the Bioconductor repository.
We thank Florian Markowetz, Rainer Spang, Andreas Buneß, Markus Ruschhaupt and Ruprecht Kuner for help and discussions, and Dirk Ledwinda for IT support.
TB and HF were supported by the National Genome Research Network (NGFN) of the German Federal Ministry for Education and Research (BMBF) – grants SMP Bioinformatics (01GR0450, subprojects PBF-S19T10, PBF-S02T11) and exploratory project EP-S19T03. MF and HS were supported by the NGFN SMP RNA (01GR0418).
- Boutros M, Agaisse H, Perrimon N: Sequential activation of signaling pathways during innate immune responses in Drosophila. Developmental Cell. 2002, 3 (5): 711-722. 10.1016/S1534-5807(02)00325-8.View ArticlePubMedGoogle Scholar
- Markowetz F, Bloch J, Spang R: Non-transcriptional pathway features reconstructed from secondary effects of RNA interference. Bioinformatics. 2005, 21 (21): 4026-4032. 10.1093/bioinformatics/bti662.View ArticlePubMedGoogle Scholar
- Tikhonov A, Arsenin V: Solutions of ill-posed problems. 1977, W.H. WinstonGoogle Scholar
- Smyth G: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004, 3-Google Scholar
- Liao J, Lin Y, Selvanayagam Z, Shih W: A mixture model for estimating the local false discovery rate in DNA microarray analysis. Bioinformatics. 2004, 20 (16): 2694-2701. 10.1093/bioinformatics/bth310.View ArticlePubMedGoogle Scholar
- Pounds S, Morris S: Estimating the occurence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics. 2003, 19 (10): 1236-1242. 10.1093/bioinformatics/btg148.View ArticlePubMedGoogle Scholar
- Efron B, Tibshirani R: Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology. 2002, 23: 70-86. 10.1002/gepi.1124.View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J Royal Statist Soc. 1995, 57: 289-300.Google Scholar
- Dempster A, Laird N, Rubin D: Maximum likelihood from incomplete data via the EM algorithm. J Royal Statistical Soc Series B. 1977, 39: 1-38.Google Scholar
- Fröhlich H, Fellmann M, Sültmann H, Poustka A, Beissbarth T: Estimating Large Scale Signaling Networks through Nested Effects Models from Intervention Effects in Microarray Data. Proc German Conf on Bioinformatics. 2007, 45-54.Google Scholar
- Schölkopf B, Smola AJ: Learning with Kernels. 2002, Cambridge, MA: MIT PressGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. 2001, SpringerView ArticleGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by Simulated Annealing. Science. 1983, 220 (4598): 671-680. 10.1126/science.220.4598.671.View ArticlePubMedGoogle Scholar
- Berg B: Markov Chain Monte Carlo Simulations and Their Statistical Analysis. 2004, World ScientificView ArticleGoogle Scholar
- Lukashin AV, Fuchs R: Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters. Bioinformatics. 2001, 17 (5): 405-414. 10.1093/bioinformatics/17.5.405. [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/17/5/405]View ArticlePubMedGoogle Scholar
- Gonzalez OR, Kuper C, Jung K, Naval J, Prospero C, Mendoza E: Parameter estimation using Simulated Annealing for S-system models of biochemical networks. Bioinformatics. 2006, btl522-[http://bioinformatics.oxfordjournals.org/cgi/content/short/23/4/480]Google Scholar
- Poutre JL, van Leeuwen J: Maintenance of Transitive Closures and Transitive Reductions of Graphs. Tech. Rep. RUU-CS-87-25, Rijksuniversiteit Utrecht. 1987Google Scholar
- Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: WileyView ArticleGoogle Scholar
- Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comp and Applied Mathematics. 1987, 20: 53-65. 10.1016/0377-0427(87)90125-7.View ArticleGoogle Scholar
- Belisle CJP: Convergence theorems for a class of simulated annealing algorithms. J Applied Probability. 1992, 29: 885-895. 10.2307/3214721.View ArticleGoogle Scholar
- Huber W, Heydebreck A, Sültmann H, Poustka A, Vingron M: Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics. 2002, 18: S96-S104.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.