Quantification of biological network perturbations for mechanistic insight and diagnostics using two-layer causal models

Background High-throughput measurement technologies such as microarrays provide complex datasets reflecting mechanisms perturbed in an experiment, typically a treatment vs. control design. Analysis of these information rich data can be guided based on a priori knowledge, such as networks or set of related proteins or genes. Among those, cause-and-effect network models are becoming increasingly popular and more than eighty such models, describing processes involved in cell proliferation, cell fate, cell stress, and inflammation have already been published. A meaningful systems toxicology approach to study the response of a cell system, or organism, exposed to bio-active substances requires a quantitative measure of dose-response at network level, to go beyond the differential expression of single genes. Results We developed a method that quantifies network response in an interpretable manner. It fully exploits the (signed graph) structure of cause-and-effect networks models to integrate and mine transcriptomics measurements. The presented approach also enables the extraction of network-based signatures for predicting a phenotype of interest. The obtained signatures are coherent with the underlying network perturbation and can lead to more robust predictions across independent studies. The value of the various components of our mathematically coherent approach is substantiated using several in vivo and in vitro transcriptomics datasets. As a proof-of-principle, our methodology was applied to unravel mechanisms related to the efficacy of a specific anti-inflammatory drug in patients suffering from ulcerative colitis. A plausible mechanistic explanation of the unequal efficacy of the drug is provided. Moreover, by utilizing the underlying mechanisms, an accurate and robust network-based diagnosis was built to predict the response to the treatment. Conclusion The presented framework efficiently integrates transcriptomics data and “cause and effect” network models to enable a mathematically coherent framework from quantitative impact assessment and data interpretation to patient stratification for diagnosis purposes.


Network Models
The backward assumption, which is based on the concept that gene expressions are the consequences of upstream interactions between biological entities represented by network nodes (such as enzymatic activities) requires a significant effort in mining and organizing the knowledge to build "cause and effect" network models. The results presented in this study represent a step forward in network integration and have clearly shown the potential of TopoNPA to structure the gene expression profiles into backbone differential scores in a relevant and robust manner. The networks built to date are pioneering a novel approach to model cellular processes at the molecular level by representing knowledge in a computable format using the Biological Expression Language (BEL) syntax. The Open BEL framework is an open source initiative [1] which aims to promote the BEL syntax as a standard to capture biological knowledge.
The description of molecular processes in terms of network models introduces several innovative features and opens new perspectives for describing biological processes at system levels. First, as a consequence of the BEL semantics, the fundamental entities in the network models are not exclusively the concentrations of the relevant molecules and their variations, but also include their functions and their perturbations. The functional layer is the result of assembling multiple fundamental entities using literature-supported causal relationships, with the goal of covering one specific cellular function, such as cell cycle, xenobiotic metabolism or NFkB signaling. As a result, the integration of experimental data in the context of such a network model naturally provides a "mechanistic" description, as the effects of the experimental treatment contained in the data are described in terms of fundamental and causally connected functional entities underlying the cellular process potentially associated with the observed phenotypes.
Second, the adoption of the backward-causal paradigm in our network models enables the distinction between a "functional layer", containing the actual biological mechanisms, and a "gene transcript layer" representing the experimentally measurable consequences of the network when using transcriptomic technologies ( Figure  1b). This architecture is characteristic of the combination of the network backbone and backward-causal approaches, which constitute two essential features introduced in this study. Previous works in molecular systems biology used these components partially and separately when integrating prior knowledge into the analysis of transcriptomic data. For instance, the use of protein-protein interaction (PPI) networks in this context is pragmatically based on the "forward" approach ( Figure 1a), even though variations in transcript levels do not always translate into a similar variation in protein abundances or activities. Also, PPI networks do not contain the mechanistic aspect discussed above, as their nodes do not specifically represent functional entities whose assembly describes cellular processes such as proliferation or stress. Similarly, most existing backward-causal approaches do not introduce relationships between their upstream entities, in opposition to the complex "functional layer" of the network models used in this study. In most of the cases, a large collection of potential upstream entities are tested independently of one another, and their likelihood is scored individually by an enrichment statistic applied to differential gene expression values [2][3][4]. A step forward is achieved by the Network Component Analysis (NCA) where several upstream entities are allowed to compete with one another to best model the cause of the observed gene expression values [5]. TopoNPA methodology goes even further by using all the structure contained in the directed and signed networks of upstream entities constituting the "functional layer".
Another important feature of our biological network models is their clearly defined boundaries, which are usually absent in PPI networks. Indeed, a direct physical/chemical interaction between two proteins, as captured in a PPI network, is much less dependent on the tissue context than the functional activity of a protein, which may strongly depend upon the presence of co-factors whose abundance can vary across tissues. This highlights the fact that the relationships in the "functional layer" have a specifically defined context as evidenced by the supporting literature. For example, the functional layers for the Xenobiotic and Cell Cycle models were made specifically for healthy lung tissue, whereas the TNF-IL1a-TLR-NFκB network model was made generically. These considerations clearly indicate that the context-dependent, mechanistic and backward-causal biological network models used in this work constitute a powerful knowledge based substrate for integrating transcriptomic data.

Defining the Biological Content of Network Models
The biological content of the network models is defined during their building phase, which is multi-step iterative process involving human expert knowledge and datadriven machine-assisted techniques. First, guided by a survey of relevant scientific literature into the signaling pathways relevant to the process of interest (e.g., proliferation, stress, or inflammation), a team of experts define the biological boundaries of the network. Cause-and-effect relationships describing these pathways are extracted from the literature and the Selventa Knowledgebase which nucleates the network with assertions derived from the relevant cell types and/or experimental contexts. Second, gene expression data obtained from experiments where the process of interest has been stimulated are analyzed using Reverse Causal Reasoning (RCR, [4,6]). RCR is a "backward-causal" approach that explicitly integrates Selventa Knowledgebase, taking gene expression profiling data as an input and producing qualitative statements for the activity states of biological entities according to statistical and biological criteria as outputs [7][8][9]. Hypothesized upstream controllers of the observed experimental data are drawn from those computations to be included into the network.
In the final step of network construction, the content and connectivity is subject to a terminal round of manual review. Ultimately, this three-step methodology results in computationally optimized network models whose edges are supported by published literature [10].
Data CDK Inhibitor-Treated Normal Human Bronchial Epithelial Cells NHBE cells were treated with the CDK4/6 inhibitor PD-0332991 (IC50 = 0.011 µmol/L for CDK4 and 0.016 µmol/L for CDK6). Specifically, cells were treated in vitro with media alone or media with 1 µM CDK inhibitor, for 24 hours, after which CDK inhibitor-treated cells were washed in media with or without the same concentration of CDK inhibitor, while control cells were washed in media alone. Cells were collected 2, 4, 6 and 8 hours after washing (3 petri-dishes per time point), and total RNA was extracted and hybridized to Affymetrix U133 Plus 2.0 microarrays. Data is available in Array Express, accession number E-MTAB-1272.

Rat 28-Day cigarette smoke inhalation
Outbred Sprague-Dawley rats were obtained from Charles River, France. The age of the rats at the start of the inhalation period was 7 weeks. The body weights were within 20% of the mean weight for each sex. The rats were nose-only exposed to main stream smoke (MS) or to filtered, conditioned air (sham exposure group) for 6 hours /day, 7 days/week for 28-days. Each group was exposed in a separate exposure chamber. The 28-day exposure period is defined in draft OECD guideline 412 (2005). The target MS concentrations were 8, 15, or 23 µg nicotine/l for the groups exposed to the reference cigarette 3R4F. A 5-day dose-adaptation regimen was applied at the start of the inhalation period. On study days 1 and 2, the rats were exposed to 1/3 of the target concentrations. On study days 3 and 4, the rats were exposed to 2/3 of the target smoke concentrations. As from day 5 on, the rats were exposed to the target concentrations. 3R4F cigarettes were smoked according to Health Canada Intensive Smoking Protocol (Health Canada, 1999) whereby MS was diluted with filtered, conditioned air to the target nicotine concentrations. Steady-state blood carboxyhemoglobin concentration, respiratory physiology parameters and representative nicotine metabolites in the urine were determined (data not shown). Food consumption and body weight, and hematological, clinical-chemical, gross pathological, histopathological, and pulmonary inflammation parameters were determined to characterize the biological activity of the smoke. The study was conducted in compliance with the OECD Principles on Good Laboratory Practices (as revised in 1997).
TNF exposure of RLAK cell NRBE cells (normal rat bronchial epithelial cells) were either treated with a vehicle control (sham) or TNFα at concentrations of 0.1 (very low), 1 (low), 10 (medium), or 100 (high) ng/ml. Normal rat bronchial epithelial cells (Lonza Walkersville, Inc.) were cultured in standard growth medium (Clonetics medium, Lonza Walkersville, Inc.). Cells were harvested after the desired treatment length (30 minutes, 2 hours, or 24 hours). Cells were immediately put on ice. Each experimental group contains 5 biological replicates. Data are available in Array Express, accession number E-MTAB-1311.

Gene expression profiling and analysis Tissue Preparation
Dissection took take place directly after the last exposure. Left and right lungs were separately snap-frozen immediately after the preparation. To collect specific cellular structures of the lung (i.e., parenchyma and airways), Laser Capture Microdissection (LCM) was performed (PALM Microbeam, Carl Zeiss Microscopy GmbH, Jena). To this end, 20 µm slices were produced and mounted on special membrane slides (Zeiss, "MembraneSlide", 1.0 PEN'). For collecting the parenchyma material, only slices which were enriched for parenchyma (mainly at the exterior side of the lung) were used. As airway-conducting material, the main-bronchus, and the first main branching were taken. Only slices which were enriched for air-conducting structures (from the central part of the lung) were taken. The areas of interest (parenchyma, main bronchus) were exercised by laser micro-dissection and transferred into an RNAse-free reaction tube. Because the lasered areas were too large for the catapulting function of the microscope, the slices needed to be transferred with RNAse-free forceps (irradiated with UV light). Directly after the transfer, the tissue was lysed in 700 µl QIAzol lysis buffer (Qiagen). The samples were frozen on dry ice and stored at -80 degC until further processing.

RNA Preparation and Whole Genome Expression Arrays
RNA from the derived samples was isolated using miRNAeasy Mini Kit (Qiagen). The resulting RNA quality and quantity were analyzed using an Agilent 2100 Bio-Analyzer (Agilent, Waldbronn) and a NanoDrop ND-1000 (PeqLab, Erlangen). The starting material for the use of the GeneChip R 3' IVT Express Kit (Affymetrix) was 100 ng RNA in 3 µl RNAase-free water. The procedure was carried out as described in the GeneChip R 3' IVT Express Kit User Manual from Affymetrix. The Hybridization mix was incubated on GeneChip R Rat Genome 230 2.0 Arrays for 16 hours. Further processing was performed as described in the GeneChip R Expression Analysis Technical Manual with Specific Protocols for using the GeneChip Hybridization, Wash and Stain Kit Affymetrix. After hybridization, the arrays were inserted in the GeneChip R Fluidics Station 450 of Affymetrix which was operated as described in the GeneChip R Fluidics Station User's Guide. The Fluidics Station was controlled by the GeneChip Command Console software. The probe array underwent an automated washing and staining protocol specific for GeneChip R Rat Genome 230 2.0 Arrays. The arrays were further analyzed in the Affymetrix R Expression Console software application using the MAS5 algorithm to create CHP files and checked for several quality parameters provided in the Affymetrix data analysis software package. Command Console Software (Affymetrix) was used to automatically grid the DAT files and create the CEL files (probe cell intensity data). Data processing was implemented in the R statistical environment. Raw RNA expression data were analyzed using the affy and gcrma packages of the Bioconductor suite of microarray analysis tools available in the R statistical environment. Robust Microarray Analysis (RMA) background correction and quantile normalization were used to generate probe set expression values. Forward FCS approach results using the 28 Day Inhalation study and the TNF-induced response in RLAK cells study, using both TNF-IL1α-TLR-NFκB and the xenobiotic metabolism response networks. Several combinations of gene level statistics and geneset enrichment statistics are compared. Among those, the combination "fc / maxmean" is the method leading to the best qualitative and quantitative results, but still predicting some activation of the xenobiotic response metabolism for the TNFα-induced treatment in RLAK cells (bottom right panel). Our method is the only one exhibiting the expected qualitative and quantitative pattern. A "*" indicates a significant enrichment (P-value¡0.05). ES is teh enrichemnt score of each methods normalized to its maximum value. See main text, method section for further details.   ) results using the 28 Day Inhalation study and the TNF-induced response in RLAK cells study, using both TNF-IL1α-TLR-NFκB. The method is not applicable to the xenobiotic network as this model is not causally consistent. The qualitative pattern of the NPA method, when applicable, is matching with the expectations. While behaving quantitatively for the RLAK study, it fails to quantify the dose response in the 28-day rat inhalation study. Again, TopoNPA method is one exhibiting both the expected qualitative and quantitative pattern. A "*" indicates a significant enrichment (P-value< 0.05).

Figure S7
Network leading nodes of TNF-IL1α-TLR-NFκB network for the comparison of IFX responders vs. IFX non-responders, before (left bar) and after (right bar) treatment. Values are -1 (respectively +1) if the node is a leading node and its differential backbone value is negative (respectively positive), or 0 if it is not a leading node. Nodes circled in green are the leading nodes of the perturbation of the network from the comparison shown in Figure 6 of the main text.    Supplementary Tables   Table S1 Prediction sensitivities and specificities

Type
Method  A → B (B → A respectively). The mean of the G-performances (= √ Sp · Se) over the two independant test sets are shown in the rightmost column and is highlighted if > 0.7. The best UBE based models is chosen based on the mean cross-validation G-performance for the two datasets. The algorithms based on the backbone values leads to good performances for a majority of algorithms.

Type
Method