NPA: an R package for computing network perturbation amplitudes using gene expression data and two-layer networks

Background High-throughput gene expression technologies provide complex datasets reflecting mechanisms perturbed in an experiment, typically in a treatment versus control design. Analysis of these information-rich data can be guided based on a priori knowledge, such as networks of related proteins or genes. Assessing the response of a specific mechanism and investigating its biological basis is extremely important in systems toxicology; as compounds or treatment need to be assessed with respect to a predefined set of key mechanisms that could lead to toxicity. Two-layer networks are suitable for this task, and a robust computational methodology specifically addressing those needs was previously published. The NPA package (https://github.com/philipmorrisintl/NPA) implements the algorithm, and a data package of eight two-layer networks representing key mechanisms, such as xenobiotic metabolism, apoptosis, or epithelial immune innate activation, is provided. Results Gene expression data from an animal study are analyzed using the package and its network models. The functionalities are implemented using R6 classes, making the use of the package seamless and intuitive. The various network responses are analyzed using the leading node analysis, and an overall perturbation, called the Biological Impact Factor, is computed. Conclusions The NPA package implements the published network perturbation amplitude methodology and provides a set of two-layer networks encoded in the Biological Expression Language.


Background
Gene expression technologies provide complex datasets reflecting the response of a cell system or organism exposed to bioactive substances. Contextualizing and quantifying the transcriptomic profiles into predefined mechanisms by combining the gene expression changes and networks is at the core of systems toxicology, as this discipline requires a quantitative measure of doseresponse. Among the networks used, cause-and-effect network models are becoming increasingly popular [3, 7-9, 12, 13, 19, 25, 29-31, 34, 35], and many of these models, describing processes involved in cell proliferation, cell fate, cell stress, and inflammation, have already been published in repositories such as the Causal Bionet (http://causalbionet.com) [3] or BEL Commons (https:// bel-commons.scai.fraunhofer.de) [14]. The underlying biological knowledge in these networks has been extracted from the scientific literature manually and encoded in the Biological Expression Language (BEL), a computable language developed specifically for causal biological networks (CBN) (http://bel.bio/).
To address the quantification of biological mechanisms known to be linked or that can lead to toxicological responses based on gene expression profiles, twolayer networks are suitable, and a mathematically and statistically sound methodology, called network perturbation amplitude (NPA), was published in [23]. The twolayer structure on the networks reflects backward reasoning, paradigm in which protein activities in a pathway are inferred from their downstream gene expression footprints, as opposed to forward reasoning that maps gene expression (or related fold-change) to protein activities. In this context, several studies using this approach have been published in recent years [10-13, 15, 16, 20, 21, 24, 26, 27, 36]. This approach is a threshold-free approach and can therefore enable an objective evaluation of network perturbations.

Implementation
The NPA package has been developed in and designed for the R statistical environment. It is accessible in Bioconductor and free of charge for non-commercial use.
The implementation uses R6 classes with an S4-class dispatching in order to make its usage more intuitive. It does not rely on any C++ component, as the R code has been optimized to provide fast permutation tests (less than 5 s per two-layer structured network).

Workflow overview
The quantification of NPAs aims to describe the response of biological mechanisms modeled by a network using transcriptomic data. Here we focus on the particular type of causal networks for which the NPA methodology has been developed. Given a suitably organized collection of causal networks selected for a priori relevant biological mechanisms, the structure of the associated NPA results can be seen as a complex reduction scheme starting from large experimental transcriptomic data. It provides a quantification of the treatmentinduced impact on the considered biological mechanisms, which, in concrete applications, is used to comparatively assess toxicity.
Concretely, the workflow of the NPA computational workflow requires three distinct inputs in terms of experimental data and biological knowledge (Fig. 1).

Network input
The network models provided with the companion package, NPAModels (https://github.com/philipmorrisintl/NPAModels), represent the molecular mechanisms across wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, and tissue remodeling, relevant for human respiratory physiology. The two-layer structure, as defined in [23], is summarized below.

Network functional layer
Unlike other types of networks, the networks nodes describe molecular (protein, chemicals, genes) concentrations but also represent functions such as transcriptional, enzymatic, or kinase activities. The network edges encode directed relationships between nodes, each of which is associated with a sign representing the increasing or decreasing direction of change between the molecular activities. The biological knowledge represented in these networks has been manually extracted from the scientific literature and encoded in the BEL, an ontology developed specifically for CBNs (http://www. openbel.org/). More CBNs and BEL resources are publicly and freely available on the CBN Database website (http://causalbionet.com) [3]. a b c e d Fig. 1 The NPA workflow. The gene expression data are used to estimate the treatment effect for each gene. The (log 2 ) fold-changes and the associated t-statistics are required (a). By combining the gene expression fold-changes and a two-layer causal network (b), its perturbation is quantified and assessed for its significance by combining two specificity statistics and the fold-change standard deviations (c). Several NPAs can be summarized into a holistic quantity describing the overall biological response, called the Biological Impact Factor (BIF) (d), which can be used as a toxicological index for comparing various treatments. If an individual network reaches significance, the leading node analysis (e) enables the identification of the key biological entities involved in its response to ease the interpretation

Network transcript layer
For some nodes in the functional layer, information supporting the relationship to the expression (upregulation or downregulation) of certain genes is available. These relationships are either extracted from literature or from specific gene expression experiments. Such a footprint for a given node in the functional layer (such as a transcription factor or an activity of a protein) can be efficiently extracted from gene expression data sets whereby the experiment involves the inhibition or activation of the molecule under consideration. Edges from that node to genes significantly impacted are defined and signed accordingly. These specific edges (directed and signed) define the gene expression fingerprint of the network functional layer. Typically, hundreds or thousands of genes are included in the transcript layer. Note that not all the nodes are required to have such footprints for computing NPA. Following the "backward reasoning", paradigm stating that changes in molecular mechanisms encoded by causal network nodes (e.g., the activity of a transcription factor) can be deduced from the expression changes of their downstream-regulated genes, a gene expression profile can be used to computationally predict the activity of the functional layer nodes.

Data input
The gene expression profiles are derived from gene expression data for which one or several contrasts (typically a treatment versus control design, or linear model coefficients) are estimated from a statistical model. These contrasts represent the system response profile associated with comparisons of interest and are mapped onto the transcript layer of the two-layer network, which reflects the perturbation of the functional layer. In our implementation, the data input is a list, one entry per contrast, and each entry is a data frame, with a variable nodeLabel describing the gene symbol of each row, fold-Change describing the estimated contrast (log 2 -based), and t describing the t-statistics associated with the foldchange. T-statistics are typically derived from limma (in which case the moderated t-statistics is used) or from DESeq2 for RNA-seq data (in which case the t-statistics is replaced by the z-statistics of the Wald test used for testing the significance of the shrunken estimates of the gene fold-changes).

NPA computation
The NPA method previously reported in [12,13,23] will first compute the perturbations for the network nodes based on a constraint optimization problem. If the obtained transcriptomics profile reflects the perturbation of the functional layer, all of the differential values should be close to each other (smooth perturbation profile) while being equal to the observed fold-changes β in the transcript layer, denoted V 0 . The differential values for the functional layer are obtained by solving: denotes the sign of the edge x → y. This constrained optimization problem can be solved analytically and was implemented as a matrix multiplication, relying on sub-matrices of the signed Laplacian matrix of the two layers [23]. Once differential values are obtained for the functional layer, the NPA score is then computed by summing the result over the edges of that layer as: where E is the set of its edges, |E| is its size, f is the solution of the constrained problem describing the node differential values, and e 0 and e 1 denote the start and the end, respectively, of the edge e. This sum is efficiently implemented as the evaluation of a quadratic form. Finally, the computation of three accompanying statistics is performed to determine if the obtained NPA value represents a true positive. The first statistic is based on the biological variability propagated from the uncertainties of the differential gene expression values: the 95% confidence interval (CI) around the NPA value should not contain zero. The other two statistics test the relevance of the biological mechanism(s) encoded in the network by randomly reshuffling the network edges of the transcriptional layer or functional layer. This turns into two null distributions for the network-level perturbation values. If the actual NPA value lies above the 95% quantile of a null distribution, it is considered to be statistically significant and labeled as "O" or "K," respectively. Significant network perturbations correspond to the cases where all three statistical tests lead to significance.
In our implementation, computation time for those statistics usually takes 49.1 s for 12 comparisons and 58.1 s for 48 comparisons and scales linearly with the number of comparisons by using the precomputed permutation matrices. To speed up the computation, it is recommended to precompute the permutation matrices using the function preprocessNetworks() provided by the NPAModels companion package. With the preprocessed version of the networks, the computation time drastically drops to 6.4 s for 12 comparisons and 20.1 s for 48 comparisons, on single Intel® Core™ i5-6500 CPU @ 3.20GHz with a 16 Gb of RAM computer. If a node of the functional layer has less than five edges to the transcript layer (considered as under-represented), removing those edges is recommended. Finally, only the nodes having an ancestor and a descendant node with edges to the transcript layer are scored to avoid extrapolation.
All of the computations are performed with a single call of compute npa and plotted using the barplot method. Eight two-layer networks are available in the accompanying NPAModels package .
The interpretation of a significant network perturbation is supported by the inspection of the leading nodes, defined as the functional layer nodes that contribute the most to the NPA scores. This component is key, as two different perturbation patterns in the functional layer can lead to the same summarized scores and as functional layer size can be substantial (e.g., oxidative stress network functional layer contains 194 nodes) making the systematic inspection of all differential values tedious. Leading nodes are nodes of the functional layer, whose differential values contribute 80% to the observed effect. The inspection of this shorter list of nodes supports the interpretation and the comparison of network perturbation across contrasts. It also allows for the assessment of the directionality (activation or inhibition) of the inferred effects on each node; as the NPA is by definition positive. Leading node ranks and signs can be accessed with the as.matrix method from an NPA object and visualized as a heatmap or in the network graph by calling the plot method. To further ease the interpretation and identification of highly impacted sub-networks, network sub-graphs (called here modules) that are dense in leading nodes across all the contrasts can be extracted by the method module and plotted by a simple plot call. The modules are extracted by using the dNetFind call from the dnet package in order to heuristically find a maximum scoring connected subgraph, using the leading node contributions as scores. If modules are too large for a visual inspection, they can be further clustered by using the infomap community approach implemented in the igraph package.
When several networks are available and grouped into families (such as Cell Stress, Cell Fate), the treatmentinduced biological impact is summarized as network family-level impact factors, which in turn are summarized into a single quantity, called the BIF. The evaluation of the BIF consists of filtering out the networks that are not significantly perturbed and then summing the remaining NPA values with weights that take into account the number of network in each family and the nodes overlapping between networks.

Transcriptomic dataset
To show the scoring of the network models with transcriptomic data, we have focused on a mouse inhalation study reported in [4]. The study was designed to identify the onset of emphysema induced by exposure to cigarette smoke (CS). C57BL/6 mice were exposed to mainstream CS from the 3R4F reference cigarette through whole-body exposure for up to 7 months. Additionally, four cessation scenarios were included to assess the impact of smoking cessation on emphysema progression in the lung. Figure 2 illustrates the study arms used in this analysis. The dataset is available in the package as data (COPD1).

NPA and BIF
The NPA scores, describing the degree perturbation of each network as a positive amplitude, are shown in Fig. 3a. To generate NPA scores on the whole network suite and displaying the results, the following code was used: All network models were impacted in response to 5 and 7 months of CS exposure. These impacts were a b The backbone NPA values with directionalities of inferred regulation are shown as bar graphs for each node. The green asterisk indicates that the node is a leading node in a given contrast. Highlighted nodes are the molecules discussed in the main text. The vocabulary for the BEL is provided in http://www.openbel.org/ remarkably reduced in the lungs of mice that were exposed to fresh air after 2 months of CS exposure and to a lesser extent when the CS exposure was stopped after 4 months. A similar result was seen in the BIF that depicts the aggregated impact based on all networks that were scored with the transcriptomic data (Fig. 3b).

Leading node analysis
The leading node analysis was used to dissect the mechanistic detail behind the perturbation of each network model in response to CS exposure. To investigate the network perturbation, a graphical presentation of the leading node module in the oxidative stress network model was leveraged ( Fig. 4) and generated using the code below: The graph illustrates the signaling cascade from particulate matter exposure to the increase in superoxide production, leading to increased intracellular reactive oxygen species (ROS), which also results from the activation of the arachidonate 12-lipoxygenase (ALOX12) [1] and xanthine dehydrogenase (XDH) [33]. In contrast, thioredoxin (TXN) was inferred to be inhibited, in line with its role in modulating intracellular ROS balance [18]. The increase in ROS triggers multiple signaling pathways, such as the SRC family leading to activation of the NADPH oxidase complex via cortactin (CTTN) and neutrophil cytosolic factor 1 (NCF1) [32]. The NADPH oxidase complex in turn further increases the production of ROS [2]. Other signaling molecules predicted to be activated in response to CS and the increase in ROS in the mouse lung include Duox1 and Map3k5. The apoptosis network model represents one of the cell fates that can result from prolonged oxidative stress and tissue damage. Figure 5 shows the perturbation for the apoptosis network as a whole and the leading nodes inferred from the transcriptomic data from CS-exposed mouse lungs. The activation of caspase-8 mediated by p53 and the FAS/ FADD pathway was inferred from the transcriptomic data, in line with the CS toxicants, such as acrolein, activating apoptosis pathways in lung cells [28]. TNFα pathway in the context of apoptosis was also inferred to mediate caspase activation in this dataset. The network presentation on the right (panel b) shows an example of connected leading nodes common to all contrasts. The backbone NPA values with directionalities of inferred regulation are shown as bar graphs for each node. The green asterisk indicates that the node is a leading node. Highlighted nodes are the molecules discussed in the main text. The vocabulary for the BEL is provided in http://www.openbel.org/ Inferred activation of FOXM1, subsequent activation of SKP2, and the degradation of cyclin-dependent kinase inhibitor 1B (CDKN1b) were observed based on the leading node analysis of the cell cycle network scored with transcriptomic data from the lungs of CS-treated mice [5] (Fig. 6).
The degradation of the extracellular matrix (ECM) is characteristic of emphysema, which is a manifestation of chronic obstructive pulmonary disease and a phenotype detected in the lungs of mice exposed to CS [6,22]. The graphical presentation of the leading nodes that are connected in the ECM degradation network model is shown in Fig. 7. The tissue inhibitor of matrix metalloproteases (TIMP) were inferred to be downregulated, and accordingly, the matrix metalloproteases (MMP) were inferred to be activated, with subsequent degradation of collagen in the lungs of mice exposed to CS. The leading node analysis also indicated that increased IL1b signaling was responsible for the molecular cascade leading to the degradation of the ECM components.

Conclusions
The NPA package and its companion data package NPA-Models provide the tools and resources to contextualize transcriptomic profiles into manually curated networks describing cellular processes relevant to toxicology, such as apoptosis, cell cycle, oxidative stress, and phase I xenobiotic metabolism. This network enrichment approach [23] can be considered as a threshold-free functional class scoring methodology [17]. The computations and presentations of the results have been facilitated by using R6 classes and provides the user with an intuitive and easy-to-use package.