SEMgraph: An R Package for Causal Network Analysis of High-Throughput Data with Structural Equation Models

With the advent of high-throughput sequencing (HTS) in molecular biology and medicine, the need for scalable statistical solutions for modeling complex biological systems has become of critical importance. The increasing number of platforms and possible experimental scenarios raised the problem of integrating large amounts of new heterogeneous data and current knowledge, to test novel hypotheses and improve our comprehension of physiological processes and diseases. Although network theory provided a framework to represent biological systems and study their hidden properties, different algorithms still offer low reproducibility and robustness, dependence on user-defined setup, and poor interpretability. Here we discuss the R package SEMgraph, combining network analysis and causal inference within the framework of structural equation modeling (SEM). It provides a fully automated toolkit, managing complex biological systems as multivariate networks, ensuring robustness and reproducibility through data-driven evaluation of model architecture and perturbation, that is readily interpretable in terms of causal effects among system components. In addition, SEMgraph offers several functions for perturbed path finding, model reduction, and parallelization options for the analysis of large interaction networks.


Introduction
Discovering and understanding the mechanisms underlying complex phenotypical traits is of primary importance in bio-medical research. A deeper and detailed knowledge of the physio-pathological events leading to the onset and progression of a disease enables a clearer estimation of disease risk and more accurate diagnosis, prognosis evaluation, and decision making, including treatment choice [Ritchie et al., 2015]. With the advent of the high-throughput sequencing (HTS) technologies, the actual complexity behind diseased (and generally, phenotypical) traits became prominent, opening up to the big data era also in molecular biology and medicine [Shendure and Aiden, 2012]. Biological systems complexity arises from the interactions and reactions among their components (e.g., genomic elements, epigenomic modifications, DNA-binding proteins, miRNAs, receptors, signaling molecules) and the layered modularity of their compartments (e.g., cellular components, tissues, organs). Predicting the behavior of these components after external perturbation or intrinsic variability (e.g., genetic polymorphisms), is key for the discovery and prediction of disease-associated processes [Liu et al., 2020, Ritchie et al., 2015, Shendure and Aiden, 2012. Biological models are commonly represented by signaling pathways, chains of metabolic reactions, disease modules, or very large protein-protein interaction networks (also called interactomes) [Ritchie et al., 2015, Barabási et al., 2011. Given the vast amount of publicly available bio-medical databases, the access to curated biological models is no longer a limitation. The key feature of these databases is the availability of structured bio-chemical and bio-medical information that can be readily converted into networks and statistical models: we generally refer to them as knowledge-based models (KBMs). KBMs provide a basis and a gold standard to improve exploratory methods, with some critical issues [Ritchie et al., 2015]. Fistly, defining a set of rules to convert a KBM to a causal model is key to test specific biological hypotheses and mechanisms, but it is not always trivial due to missing information or evidence level (e.g., experimental evidence versus inference from similarity or electronic annotation). Secondly, a KBM reflects current knowledge, constantly challenged by new experimental data that may reveal novel interactions and pathways. Finally, there must be clear statistical criteria to evaluate the initial causal model, reflecting biological properties of the system, and improving both model descriptive and predictive power [Liu et al., 2020, Ritchie et al., 2015. Starting from current knowledge, network models should be updated and tested in a simple and clear workflow. From the computational point of view, the challenge is to free the user from chosing the initial setup, estimating algorithm and model parameters directly from quantitative data, with efficient and parallelizable methods. Motivated by this challenge, we developed the R package SEMgraph, based on structural equation modeling (SEM) [Bollen, 1989], enabling causal inference on complex biological networks. SEM are now a popular tool in causal inference [Pearl, 2009], causal structure learning [Spirtes et al., 2000], and biostatistics. Path diagrams, often represented as acyclic mixed graphs, provide a backbone for model learning, data-driven model refinement and causal inference and discovery. HTS data is often structured into pathways or large networks, enabling either confirmatory or exploratory analysis of salient biological properties. Within SEMgraph, this is practically achieved through algorithm-assisted search for the optimal trade-off between best model fitting (i.e., the optimal context) and perturbation (i.e., exogenous influence) given data, in which knowledge is used as supplementary confirmatory information. In SEMgraph, the input network and the underlying statistical model are interchangeable representations of the same object: a set of interacting variables linked by causal relationships. This dual representation is opportunely manipulated to generate the final causal model, through a series of intermediate steps, including causal backbone estimation, adjustement of hidden confounding variables, graph extension, and model refinement to improve fitting, whith scalable solutions for large graphs. In this work, we expose the relevant SEMgraph functions with examples of typical applications in genomics. SEMgraph package is available under the GNU General Public License version 3 or higher (GPL ≥ 3) from CRAN repository, and the latest stable version can be installed via: R > install . packages ( " SEMgraph " ) The development version of SEMgraph can be installed from the GitHub repository, at https://github. com/fernandoPalluzzi/SEMgraph through devtools: R > devtools :: install _ github ( c ( " fernandoPalluzzi / SEMgraph " )) 2 Structural equation models

SEM basics
SEM is a statistical framework for causal inference based on multivariate linear regression equations, where the response variable in one regression equation may appear as a predictor in another equation [Bollen, 1989, Shipley, 2016. SEM may be formulated to explicity include latent unobserved variables, but here we consider a setup in which the latent variable have been marginalized out and represented in the model only implicitly through possible correlations among unobserved latent confounders [Pearl, 1998]. A SEM, is based on a system of structural (i.e., linear regression) equations definig a path diagram, represented as a graph G = (V, E), where V is the set of nodes (i.e., variables) and E is the set of edges (i.e., connections). The set E may include both directed edges k → j if k ∈ pa(j) and bidirected edges k ↔ j if k ∈ sib(j), where the parent set pa(j), and the siblings set sib(j), determine the system of linear equations, as follows: cov(U j ; U k ) = ψ jk if j = k or k ∈ sib(j) 0 otherwise (2) where Y j and U j are an observed variable and an unobserved error term, respectively; β jk are regression coefficients, and a covariance ψ jk indicates that errors are dependent, which is assumed when there exists an unobserved (i.e.latent) confounder between k and j. A path diagram is also a formal tool to evaluate the hierarchical structure of a system, where we can identify exogenous variables as system elements with empty parents set, and endogenous variables, having at least one parent variable in at least one structural equation of the SEM. In graph theory, exogenous variables are source nodes, with incoming connectivity equal to 0, whilst endogenous variables are nodes with non-zero incoming connectivity. Endogenous variables can be further divided into connectors, with non-zero outgoing connectivity, and sinks, having no outgoing connections. Given these notions, we consider three types of fundamental path diagrams to describe high-throughput data structure: • Directed Acyclic Graphs (DAGs), composed by directed edges (k → j) only, whose magnitude is quantified through path coefficients β jk , and all covariances are null (i.e., ψ jk = 0). In addition, loops are not allowed in a DAG. • Bow-free Acyclic Paths (BAPs), having acyclic directed edges (k → j), and bidirected connections (k ↔ j) only if the k-th and j-th variable do not share any directed link (i.e., they are bow-free). As a consequence, in a BAP, if β jk = 0 then ψ jk = 0. • Covariance models, as a special case of BAP in which all β jk = 0. Therefore, only covariances ψ jk may have non-zero values.
These three models are simple graphs; i.e., they have at most one edge between any pair of nodes, and are all identifiable, such that the parameter matrices B and Ψ can be uniquely estimated from the population covariance matrix of the observed variables for nearly every parameter choice [Brito andPearl, 2002, Pearl, 1998].

SEM fitting
From the computational point of view, it is convenient to write Equations 1 and 2 in matrix form as: Y = BY + U and cov(U ) = Ψ. Assuming random variables with zero mean vector (µ(θ) = 0), the covariance matrix of the joint distribution of p variables Y is given by: where the set of free parameters θ = (β; ψ) has dimension t. B is the path coefficient matrix, Ψ is the covariance matrix, and I is the identity matrix, all of them having dimension p × p. Generally, in the SEM framework, free (i.e., unknown) parameters θ are computed by Maximum Likelihood Estimation (MLE), assuming all model variables as jointly gaussian, so that the estimated covariance matrix Σ(θ) is close to the observed sample covariance matrix S. This is obtained by maximizing (up to an additive constant) the model log-likelihood function logL(θ) given data [Bollen, 1989, p. 135].
arg max From the expected Fisher s information matrix of the likelihood function, standard errors, SE(θ) of the MLEθ are extracted. MLE approximates a normal distribution and the P-values are computed through the test statistic z =θ/SE(θ) with 95% confidence intervals:θ ± 1.96 SE(θ). An advantage of MLE is that its estimates are in general scale invariant and scale free [Bollen, 1989, p. 109]. Therefore, the values of the fit function do not depend on whether correlation or covariance matrices are analyzed, and whether original or transformed data are used. Model assessment is based on a chi-squared likelihood ratio test (LRT) statistic, known as model deviance: where logL(θ) is the log-likelihood Equation 4 evaluated to model-implied covariance matrix, Σ(θ) and logL(θ max ) is the log-likelihood for an exact fit; i.e., Σ(θ) = S. P-values are derived either from the χ 2 (df) distribution with df = p(p + 1)/2 − t degrees of freedom, or from a resampling bootstrap distribution [Bollen and Stine, 1992]. Non-significant P-values (P > 0.05) indicate that the model provides a good fit to data (i.e., the elements of S − Σ(θ), should be close to zero). Alternatively to the chi-square test, the Akaike s information criterion (AIC) [Akaike, 1974] can be used to compare fitted to saturated model, defined in SEM as [Bentler, 2016]: AIC = −2 logL(θ) + 2t ≈ χ 2 − 2df (6) where the rightmost member in Equation 6 is equal to the left member minus the constant term p(p + 1)/2. The model with the minimum AIC value is regarded as the best fitting model. In the chi-square (or deviance) metric it has been suggested that a ratio between the magnitude of χ 2 and the expected value of the sample distribution E(χ 2 ) = df less than 2 and between 2 and 3 is indicative of a good and acceptable data-model fit, respectively [Schermelleh-Engel and Moosbrugger, 2003]. The relationship between AIC and χ 2 /df thresholds become more evident by comparing Equation 5 and Equation 6. For the saturated model AIC = 0, and the fitted model should be selected if AIC < 0, which is equivalent to the condition χ 2 /df < 2. Another approximate SEM fit index comparing two models (fitted vs. saturated) as the chi-square test (or the chi-square ratio), is the Standardized Root Mean-squared Residual (SRMR), an overall descriptive statistic based on all pairwise differences between observed sample covariances (s) and implied model covariances (σ): SRMR values range from 0 to 1, where 0 is equivalent to a perfect fit. The acceptable range for the SRMR index is between 0 and 0.08 [Hu and Bentler, 1999]. If the model is a DAG, a global fitting statistic, based on the directed separation (d-separation) concept, can be applied [Shipley, 2000]. In a DAG, missing edges between nodes imply a series of independence relationships between variables (either direct or indirect). These independences are implied by the topology of the DAG and are determined through d-separation: two nodes, Y j and Y k , are d-separated by a set of nodes S if conditioning on all members in S blocks all confounding (or backdoor) paths between Y j and Y k [Pearl, 1998, Verma andPearl, 1990]. In a DAG, with Y j having a higher causal order than Y k , it is possible to find a minimal set of conditional independencies B U implying all the other possible independencies, defined by: The number of conditional independence constraints in the basis set B U equals the number of missing edges, corresponding to the number of degrees of freedom (df) of the model. If the graph is not very large or very sparse, it is possible to perform local testing of all missing edges separately, using the Fisher s z-transform of the partial correlation. An edge (k; j) is absent in the graph when the null hypothesis H 0 : cor(Y j ; Y k | pa(j) ∪ pa(k)) = 0 is not rejected. These individual tests implied by the basis set B U are mutually independent, thus their P-values p r can be combined in an overall test of the fitted model (i.e., the DAG) using Fisher s statistic: This statistic follows a chi-squared distribution with df = 2 × (number of missing edges). A non-significant P-value (P > 0.05) of C indicates that the model provides a good fit to data.

Decomposition of effects
In observational studies, as in network biology and medicine, there is the need for assessing causality over paths (i.e., chains of direct effects X → · · · → Y ) having biological relevance. One important feature of SEM is the decomposition of effects between variables. We may define three types of causal effects: direct effect (DE), indirect effect (IE), and total effect (TE). A DE is the causal effect X → Y of the j-th variable (X) on the k-th variable (Y ) of the model, when all other variables are kept constant (i.e., the effect quantified by path coefficients β jk ). Keeping the other variables constant will exclude all causal paths between X and Y , with the exception of the direct connection X → Y [Pearl, 1998]; therefore the DE does not consider mediators effect. In a graph, a path between two nodes X and Y can be viewed as a sequence of edges that may have either the same or different direction respect to neighbouring connections. A directed path between two nodes is a sequence of edges with the same direction, where node X is an ancestor of Y , and Y is a descendant of X. The TE includes the contribution of all directed paths connecting X and Y , whereas the IE can be defined as the difference TE − DE. Let us consider an acyclic mixed graph G (either a DAG or a BAP) and a directed path π ∈ G, traveling from node X to node Y , having length (i.e., number of edges) equal to r. Every j-th directed edge in π correspond to a DE quantified by a path coefficient β j;j+1 . The causal effect of X on Y through all the intermediate edges is given by the product of the underlying beta coefficients along a directed path from X to Y . In other words, we may consider π as the path through which information is propagated from the source node X to the target node Y . If there is more than one directed path π s (s = 1, ..., r(s)) from X to Y in G, the TE will be the sum of the contribution of each alternative path π through which information propagates from X to Y : The nodes of an acyclic mixed graph can be ordered topologically, such that we observe a directed edge j → k only if j < k. All possible paths from j to k are given by [ ∞ r=0 B r ] jk . Under node topological ordering, the path coefficents matrix B is strictly lower-triangular, it is invertible, and (I − B) −1 = I + B + B 2 + ..., implying [Drton et al., 2011]: Generally, in observational studies and genomics, the interaction between pairs of variables is estimated as the direct effect of the source variable X on the target variable Y , when all other predictors are kept constant. However, this interpretation is incomplete for systems in which mediators effects is not negligible, as in case of perturbation propagation though nodes of a community or a signaling pathway. In these cases, the TE is a more appropriate estimation, considering the simultaneous variation of all mediators. A formal definition of TE, as average causal effect (ACE), is provided by the post-intervention do-calculus, defined in Pearl [2009]: where E[Y | do(X = x)] denotes the expected value of Y when X is fixed to a reference value x by external intervention, as in a randomized experiment. In nonlinear models, the ACE will depend on the reference point. However, in a linear Gaussian SEM, x can assume every arbitrary value and the intervention effect (or causal effect) will be a real-valued parameter, given by [Pearl, 2009]: In acyclic mixed graphs, this constant parameter is given by the TE computed with the path method as ACE jk = (I − B) −1 jk . Alternatively, when the causal model is a DAG, a simple way to compute the ACE is by applying Pearl s backdoor criterion [Pearl, 1998], allowing ACE estimation through regression. The parent set pa(X) of X blocks all backdoor (i.e., confounding) paths from X to Y , and the ACE is equal to the θ Y X|Z coefficient in a multiple regression of Y on X + pa(X) [Pearl, 2009]. However, adjusting for pa(X) is typically inefficient with respect to its asymptotic variance, and an optimal adjustament set (O-set) with smallest asymptotic variance is obtained using the parent set of Y , pa(Y | D XY ), in a suitable latent projection graph D XY , called the forbidden projection [Witte et al., 2020]. The ACE is then computed as the θ Y X|Z coefficient in a multiple regression of Y on X + pa(Y | D XY ).

Evaluating system perturbation with two-group SEM
In several applications, the concept of perturbation arises when a system is altered (i.e., changed) by one or more external influences affecting its behaviour respect to a reference state (often described as physiological or healthy). However, in most cases, the mechanisms and extent of the alterations are unknown and data-driven discovery based on the comparison between experimental (i.e., altered) and healthy samples is the best possible option. A possible approach to the evaluation of system perturbation is multigroup SEM [Bollen, 1989, p. 355]. In SEMgraph a two-group SEM is implemented either using an exogenous group variable acting over a common model, or building a separate model for each group and comparing them. In the former, the experimental condition is compared to a control one through the use of an exogenous binary group variable X = {0, 1} acting on every node of the network. This model is converted to a system of linear equations that is common to both conditions, with µ(θ) = 0 and Σ(θ) being the implied mean vector and covariance matrix of the common model: where V (x) and V (y) are the sets of exogenous (i.e., sources) and endogenous (i.e., connectors and sinks) variables, respectively. Coefficients β j (adjusted by the parents of the j-th node) determine the effect of the group on the j-th node, while the common path coefficients β jk represent regression coefficients, adjusted by group effect. This type of SEM enables the identification of differentially regulated nodes (DRNs); i.e., variables showing a statistically significant variation in their activity (e.g., gene expression) in the experimental group respect to the control one. Alternatively, the two groups of samples (or subjects) are kept separated, with two different systems of linear equations: This enables the identification of differentially regulated edges (DREs). We define µ 1 (θ) = 0 and Σ 1 (θ) as the model-implied mean vector and covariance matrix for the experimental group (group 1), and µ 0 (θ) = 0 and Σ 0 (θ) the corresponding moments for the control group (group 0), respectively. Perturbation tests in the common-model and two-models approaches are based on the definition of two different test statistics: • z C = β j /SE(β j ), testing the null value for path coefficients β j of the group variable X and evaluating node activation or inhibition; jk ), testing the null value for path coefficients β jk differences between groups and evaluating edge activation or inhibition.
In both approaches, parameters are estimated through MLE and P-values for the z statistics are derived asymptotically from the N (0, 1) standard Gaussian distribution. The descriptive overall group perturbation on either nodes or edges can be computed, for both node and edge differences, based on the Brown s method for combining non independent, one-sided significance tests [Brown, 1975]. The method computes the sum of one-sided pvalues: X 2 = −2 j log(p j ), where the direction is chosen according to the alternative hypothesis (H 1 ), and the overall P-value is obtained from the chi-square distribution with new degrees of freedom f and a correction factor c to take into consideration the correlation among P-values [Brown, 1975]. The conversion of two-sided pvalues in one-sided pvalues is performed according to the sign of the z-test: H 1 : with at least one β j < 0 =⇒ p If the overall P-value < α (i.e., the significance level), we define node (or edge) perturbation as activated when the direction of the alternative hypothesis is positive. Conversely, the status is inhibited if the direction is negative.

Existing R packages for SEM
There are many popular software packages for conducting SEM analysis, including commercial programs like LISREL [Jöreskog and Sörbom, 2018], EQS [Bentler, 2016], and Mplus [Muthén and Muthén, 2017]. Within the R environment [R Core Team, 2020], lavaan [Rosseel, 2012] is the most popular package for SEM and latent variable analysis, although alternative R packages are available, including: sem [Fox, 2006], OpenMx [Boker et al., 2011], or RAMpath [Zhang et al., 2015]. All these packages use a specific model syntax or model matrix specification. The specialized package dagitty [Textor et al., 2016] estimates causal effects by covariate adjustment sets in four classes of causal models: DAGs, maximal ancestral graphs (MAGs), completed partially DAGs (CPDAGs), and partial ancestral graph (PAGs). Finally, piecewiseSEM [Lefcheck, 2016] enables the analysis of linear, non-linear, mixed, and survival models as a SEM. With the availability of large genome-wide data sets, several existing R packages implemented SEM-based strategies for Genome-Wide Association Studies (GWAS). Package GenomicSEM [Grotzinger et al., 2019] uses SEM for modeling the multivariate genetic architecture of groups of correlated traits, incorporating the genetic covariance structure into a multivariate GWAS framework. Package GW-SEM [Verhulst et al., 2017], based on OpenMx [Boker et al., 2011], does SEM association analysis of SNPs with multiple phenotypes or latent constructs on a genome-wide basis. Several recent SEM applications led to the development of sparse data analysis methods. Package regsem [Jacobucci et al., 2016], designed for fitting common classes of SEM models with low dimensional data (n > p), uses lavaan outut for subsequent penalized likelihood analysis. Package lslx [Huang, 2018] adopts a lavaan-like model syntax, where users can set each coefficient as free, fixed, or penalized. Finally, package sparseSEM [Cai et al., 2013] was developed for inferring gene regulatory networks from high-dimensional gene expression data and genetic makers. Current SEM-based R packages and programming languages do not provide environments for automated and data-driven causal inference for network biology and medicine, integrating model syntax with graph analysis. With the adjectives automated and data-driven, we highlight the possibility to import, build, manage, and improve causal models directly leveraging on knowledge (i.e., the input graph), quantitative data, and a possible exogenous perturbation source (e.g., a phenotypical trait or a disease). Therefore, the R package SEMgraph comes with the following functionalities: • Interchangeable model representation as either an igraph object or the corresponding SEM in lavaan syntax. Model management functions include automated covariance matrix regularization, graph-to-SEM or graph-to-DAG conversion, and graph creation from correlation matrices.
• Automated data-driven model building and improvement, through causal structure learning, bow-free interaction search, and latent variable confounding adjustment.
• Perturbed paths finding, community searching, and sample scoring, together with graph plotting utilities, tracing model architecture modifications and perturbation (i.e., activation or repression) routes.
• Heuristic graph filtering, node and edge weighting, resampling and parallelization settings for fast fitting in case of very large models.
This means letting the package finding possible solutions for high dimensionality, computational issues, and optimal causal architecture search.

The SEMgraph package
SEMgraph uses igraph objects as input, although an internal SEM representation in lavaan syntax is also used by functions requiring model fitting. The user may manually change between these representations using simple conversion utilities. These functionalities reflect the four main steps of a typical SEMgraph workflow (see Figure 1), including: (i) data import and graph pre-processing; (ii) causal architecture learning; (iii) searching for (perturbed) network communities and paths; and (iv) model fitting. Beside the proposed scheme, the building blocks shown in Figure 1 can be freely rearranged to generate custom workflows. A set of utilities for graph manipulation, format conversion, and visualization, complements the SEMgraph backbone, providing a self-sufficient toolkit for causal network analysis. The main goal of SEMgraph is the identification of critical players within the best causal model defined by three contextual sources of information that are simultaneously involved in model building and analysis: graph architecture, quantitative data, and the possible perturbing cause. To achieve this goal, SEMgraph integrates different packages for model management and causal inference. Packages igraph [Csardi and Nepusz, 2006] and lavaan [Rosseel, 2012] provide the basic environment for model manipulation and fitting, while glmnet [Tibshirani et al., 2012], dagitty [Textor et al., 2016], and GGMncv [Williams, 2020] constitute the backbone for DAG estimation and BAP search. The employed methodologies are general enough to accept different graph types (e.g., directed, undirected, or mixed) and any kind of quantitative data, including bio-molecular, sequencing, and clinical data.
Interactomes and data used in this work are available in the SEMdata data package at: https://github. com/fernandoPalluzzi/SEMdata. Interactomes are stored as igraph objects, so that they can be manipulated in R as any other graph. KEGG and Reactome are also present as a list of igraph objects (kegg.pathways and reactome.pathways, respectively), each being a single pathway. In this section, we use KEGG pathways to build a SEM from an available graph (although the input can be any igraph network object). As a first example, we could load a single pathway using: Function properties() takes an igraph object and shows basic information about graph components, topology, and the presence of edge weights. In the example above, the KEGG pathway Amyotrophic Lateral Sclerosis (ALS) is imported and the largest connected component is assigned to the graph object in igraph format. ALS RNA-seq expression data [Cooper-Knock et al., 2015] is downloaded, pre-processed, and stored in the alsData$exprs object as a matrix of 160 subjects × 17695 genes (with 139 ALS cases and 21 healthy controls). This is a high-dimensional data matrix, with the number of variables sensibly exceeding the number of observations (p >> n).
The three basic SEMgraph arguments are graph, data, and group. Regarding quantitative data, we always suggest to apply some kind of correction method to relax the normality assumption required by SEM. While log2 or ln transform are frequently used for count data (e.g., sequencing), we generally suggest the nonparanormal transform implemented in the huge.npn() function of the R package huge [Zhao et al., 2012].

Gene set analysis
When the perturbation of a biological network is associated to a disease, a systematic review of known biological networks may give important clues about the functional implication and molecular mechanisms of disease associated alterations. To this end, SEMgraph provides tools for gene set analysis (GSA), enabling fast and accurate testing at gene and pathway level. The core of SEM-based GSA methodology is implemented in the RICF-based method implemented in SEMrun(). In addition to node-level and model fitting estimates, SEMrun() RICF-based algorithm computes three global measures of pathway perturbation: • Total pathway perturbation adjusted by model covariances (D). D is the sum of residual decorrelated mean differences between groups and its sign determines pathway activation or inhibition. It is calculated as the square root of the Mahalanobis distance [Mahalanobis, 1936] of group mean vector • Total perturbation accumulated by sink nodes (A). The perturbation accumulation of the j-th target gene is given by its group mean difference weighted by the sum of incoming effects β j+ of its upstream (i.e., ancestor) genes. Thus A corresponds to the linear combination of the incoming effects on every pathway sink and its sign determines overall perturbation accumulation in terms of activation or inhibition.
• Total perturbation emitted by source nodes (E). Similarly to A, E is calculated as the linear combination of the outgoing effects β +k of every ancestor gene on downstream (i.e., descendant) genes, using the sum of outgoing effects as weights. The sign of E determines the overall perturbation emission in terms of activation or inhibition.
While A and E are suited for describing directed (hierarchical) networks, such as signaling pathways, D can describe perturbation in both directed and undirected networks. Permuted P-values of the aggregated statistics T = (D, A, E) for directed graphs, or T = D for undirected graphs, are evaluated by comparing the observed values of T with their random resampling distribution after a sufficiently high number of case/control labels permutations. In SEMgraph, this is implemented using the R package flip [Finos et al., 2018]. For large networks (p >> n), accurate P-value estimations are possible with no need for a large number of permutations (SEMrun() makes 5000 permutations), using the moment based approximation proposed by Larson and Owen [2015]. Once the empirical distribution of the permuted statistic T is obtained, the two-sided P-values are computed from the normal distribution with mean and standard deviation estimated by the empirical distribution. These estimates can be viewed at the top three lines of the gest object: R > ricf <-SEMrun ( alsData $ graph , data . npn , alsData $ group , algo = " ricf " ) R > head ( ricf $ gest )

Causal structure learning
In biological systems, curated networks rarely provide a complete explanation of data variability, often leading to a poor SEM fitting. This is exactly what happened when we fitted RNA-seq ALS data onto the ALS pathway provided by KEGG. In this case, the known ALS model is able to detect significantly perturbed nodes and edges, but a significant proportion of data variability is still unexplained, as shown by the global fitting statistics (deviance/df and SRMR). SEMgraph main goal is to learn the causal structure from data, applying the best tradeoff between model fitting and perturbation. Generally, causal inference applied to complex biological systems relay on models that are either a priori conceptual constructs given by the expert or curated knowledge-based networks from biological repositories (typically molecular, genetic, or protein-protein interaction databases) [Liu et al., 2020, Barabási et al., 2011. On the other hand, fully data-driven networks provide exploratory structures unravelling hidden knowledge, although they can be deeply affected by technical variability, and the specific method used to build them often results in very different or irreproducible networks [Liu et al., 2020]. SEMgraph offers three methods to cope with these limitations, improving the initial model by leveraging on both knowledge-based and data-driven procedures. Firstly, SEMdag() uses data and topological information from the input network to estimate the optimal directed (i.e., causal) edge backbone. In addition, SEMbap() uses missing edges from the input graph to search for bidirected edges (i.e., covariances) based on conditional independence tests, removing possible latent sources of confounding, encoded in the estimated covariance matrix. Finally, extendGraph() uses external interactomes (e.g., from a chosen biological database) and observed data to extend the input graph with new connectors. The next sections will dive into the details of these core functions.

DAG estimation
SEMdag() estimates the causal structure of a DAG, inferring the parent set of each variable, given data. However, the causal DAG is generally not identifiable, while only its Markov equivalence class is (i.e., the list of all equivalent DAGs). Recent work established that exact identification, and not just an equivalent class, is possible under specific assumptions, including nonlinearity with additive errors, linearity with non-Gaussian errors, and linearity with errors of equal variance Maathuis, 2017, Heinze-Deml et al., 2018]. A key observation, under the error equal variance assumption, is that ordering among conditional variances implies data-driven identifiability. After estimating the (top-down or bottom-up) ordering of a graph, its unique causal structure can then be inferred [Chen et al., 2019]. Alternatively, the natural ordering of a biological network (e.g., a gene or protein interaction network) could be typically obtained from a priori information (e.g., from signaling pathway or transcription factor binding databases) [Kanehisa andGoto, 2000, Jassal et al., 2020] or inferred using expression quantitative trait loci in the neighborhood of transcription start sites of known genes (cis-eQTL), used as causal anchors [Neumeyer et al., 2019]. The problem of estimating the skeleton of a DAG can be seen in terms of penalized likelihood, as suggested by Shojaie and Michailidis [2010]. Assuming that the topological ordering of the variables (nodes) Y 1 < Y 2 < · · · < Y p is known, where the relation k < j is interpreted as "node k precedes node j"(i.e., there is an acyclic path from node k to node j). Then, the estimate of the graph adjacency matrix A can be solved by p − 1 LASSO (Least Absolute Shrinkage and Selection Operator) regressions of the j-th outcome variable on the predictor variables k = 1, . . . , (j − 1) in the order list: whereÂ j; 1:j−1 denotes the first 1 to (j − 1) elements of the j-th column of A, and λ j is the tuning parameter for each LASSO regression problem. Separate penalty factors w jk can be applied to each coefficient to allow differential shrinkage. If w jk = 0 for some variables, it implies no shrinkage and those variables are always included in the selected model. Function SEMdag() converts the input graph in a DAG, sorts its nodes in a topological order, and solves the (j = 2, . . . , p) LASSO problems, using the extremely fast cyclic coordinate descent optimization algorithm, implemented in the R package glmnet [Friedman et al., 2010]. Using penality weights 0 (i.e., edge present) and 1 (i.e., missing edge) for the DAG adiacency matrix ensures that input DAG edges will be retained in the final model. Function SEMdag() takes an input graph, a data matrix or data.frame, and a reference directed interactome, if available: R > DAG <-SEMdag ( graph = alsData $ graph , data = data . npn , gnet = kegg , + d = 2 , beta = 0 , lambdas = NA , verbose = FALSE ) Argument gnet is used to specify the reference interactome as an igraph object. The reference network should ideally encompass the current knowkedge domain, providing the largest possible framework in which the input model is embedded. In our ALS example, we used kegg as a reference. This means that every added directed interaction is checked in KEGG. If a reference is not available, the gnet argument can be skipped and the DAG estimation will be fully data-driven (no reference-based validation is required). If gnet is not NULL, argument d determines the maximum geodesic distance between two nodes in the interactome, to consider the inferred interaction between the same two nodes in the DAG as validated. For instance, if d = 2, two interacting nodes in the output DAG must either share a direct interaction or being connected through at most one mediator in the reference interactome (in general, at most d -1 mediators are allowed). Typical d values include 2 (at most one mediator), mean_distance(gnet), or mean_distance(graph) (i.e., the average shortest path length for the reference network and the input graph, respectively). Argument beta (by default, beta = 0) is the threshold LASSO coefficient which retains only those variables for which the absolute value of the LASSO coefficients exceed the threshold (i.e., higher beta values correspond to sparser output DAGs) [Zhou, 2009]. Argument lambdas can be used to specify a vector of LASSO λ values. As an alternative, cross-validation (|V | > 100) or BIC-based (|V | ≤ 100) optimal lambdas for each response variable will be selected. If lambdas is NULL, the glmnet default is used, while if lambdas is NA (default), a tuning-free scheme is enabled by fixing lambdas = sqrt(log(p)/n), as suggested by Janková and van de Geer [2015]. Finally, enabling verbose, the output DAG (object DAG$dag) will be plotted: blue edges are the ones imported from the input graph, and red edges are the interactions inferred from data.

BAP deconfounding
Function SEMbap() provides local DAG fit evaluation and data de-correlation methods through BAP exhaustive search from an input DAG. The idea behind this approach is based on the causal interpretation of BAPs. Two connection types characterize a BAP: directed (direct effect) and bidirected (covariance). A directed edge from node Y j to node Y k represents a direct causal effect of Y j on Y k . A bidirected edge between Y j and Y k can be interpreted as a latent variable (LV) acting on both Y j and Y k . This LV may be the cause of a correlation between observed variables; i.e., the LV is an unobserved confounder [Spirtes et al., 2000]. This correlation can be misleading and can only be correctly explained if the presence of the LV that produce the confounding effect is evaluated. We can use Shipley s independent d-separation local tests (see Section 3.1) for DAG evaluation. As stated by Shipley [2000] (p. 217): "Because the individual tests implied by the basis set B U are mutually independent, each one can be tested separately at a significance level of α/B, where B is the number of tests performed, following a Bonferroni test logic. In this way, lack-of-fit in the whole model can be decomposed into lack-of-fit involving pairs of variables". Extensions to DAGs with correlated errors (i.e., BAPs) can also be obtained. There is currently no method to obtain a mutually independent basis set B U for a BAP. However, each pair of nonadjacent variables in a BAP model implies that there is some set of other observed variables that, on conditioning, will make the two nonadjacent variables independent. Hence, it is always possible to obtain a minimal set B M = {Y j ⊥ Y k | min(S)} consisting of each nonadjacent pair (Y j ; Y k ) in the model, and the smallest conditioning set S that makes these two variables independent and a significance level α [Shipley, 2002]. Significant local tests do not indicate a specific direction of causality, but provide information about which part of a DAG is not supported by the observed data, identifying the local misspecification given by the structural assumptions implied by the DAG, that may substantially alter the observed data variability. We assume that the model misspecification is determined by unobserved confounders (i.e., LVs). These LVs may include, for example, biomarkers that are not observed in experimental chips, environmental variables, or underlying populations among experimental samples. In summary, BAP search could be performed with d-separation or conditional independence (CI) tests between all pairs of variables with missing connection in the input DAG. A BAP is then built by adding a bidirected edge (i.e., bow-free covariance) to the DAG when there is an association between them at a significance level α, after multiple testing correction. Intuitively, it would be impossible to evaluate a causal DAG if the nuisance LVs, encoded in the bow-free covariances, are not properly removed. If the BAP represents a good compromise between map accurateness and non-identified factors, and the implied population precision matrix Ψ −1 is know, we can adjust (or de-correlate) the observed variables Y via, in the matrix form of equations (1) of Section 2.1: where A = Ψ −1/2 BΨ 1/2 , Z = Ψ −1/2 Y and D = Ψ −1/2 U . By definition this model assumes independence among error terms, i.e., is a DAG: cov(D) = D T D = I, and considering that det(Σ) = det(D) = 1, the log-likelihood function (see 4) in Section 2.1) is reduced to: The population precision matrix Ψ −1 is not known, therefore the adjusted (de-correlate) variables Z = Ψ −1/2 Y should be estimated from data. We suggest a two-step procedure: (i) fitting the constrained precision matrix Ψ −1 with null (zero) pattern corresponding to the DAG edges and the null (P > α) edges after the local d-separation or CI screening, and (ii) removing by conditioning out from the observed data the latent triggers responsible for the nuisance edges by the spectral decomposition of the fitted precision matrixΨ −1 = V LV T , from which we get the adjusted (de-correlate) matrix, Z = V L 1 2 V T Y . Using Z as new data, may lead to an improvement of DAG fitting, encoded in the matrix B. Since the confounding correlation in Z vanishes, we find that this de-correlation step is able to substantially decrease DAG badness of fit indices, such as C or SRMR, applying the best tradeoff between global model fitting and local statistical significance of path coefficients [reference]. In SEMgraph, the constrained estimation of the precision matrix Ψ −1 and the spectral decomposition are implemented using the constrained() function of the R package GGMncv [Williams, 2020] and the eigen() R function [R Core Team, 2020], respectively. The SEMbap() function has the following syntax: R > BAP <-SEMbap ( graph = alsData $ graph , data = data . npn , + method = " bonferroni " , alpha = 0 . 0 5 , + limit = 3 0 0 0 0 , verbose = FALSE ) Argument alpha determines the significance level after d-separation testing (by default, alpha = 0.05). Argument limit corresponds to the number of missing edges beyond which multithreading is enabled to reduce the computational burden. Finally, the verbose = TRUE option plots the intermediate covariance and LV structure used for the BAP search. The output of SEMbap() are four objects: the BAP (i.e., the union between the input graph and the bow-free covariance graph), the covariance graph, the directed graph of LVs underlying significant covariances (i.e., the canonical graph, where bidirected Y j ↔ Y k edges are substituted by directed edges Y j ← LV → Y k ), and a data.frame of the adjusted (i.e., de-correlated) data matrix Z.

Graph extension
Both directed (causal) edges inferred by SEMdag() and covariances (i.e., bidirected edges) added by SEMbap(), highlight emergent hidden topological proprieties, absent in the input graph. Estimated directed edges between nodes Y j and Y k are interpreted as either direct links or direct paths mediated by connector nodes. Covariances between any two bow-free nodes Y j and Y k may hide causal relationships, not explicitly represented in the current model. If this latent cause exists, the presence of a covariance can be considered as a potential source of model misspecification, and can be either data-driven adjusted or recovered from a reference database. Missing information could be recovered from a large interaction database, revealing two main types of system elements not explicitly represented by the current model: hidden mediators within a directed path, and hidden variables (e.g., LVs) masked by a covariance. Function extendGraph() leverage on these concepts to extend a causal model, importing new directed edges and connectors (i.e., mediators) from a given reference network: R > ext <-extendGraph ( g = list ( DAG $ dag , DAG $ dag . red ) , data = data . npn , + gnet = kegg , verbose = FALSE ) This function takes three input graphs: the first is the input causal model (i.e., a directed graph), the second can be either a directed or undirected graph, providing a set of connections to be checked against the reference network (i.e., the third input). In the example above, we used the DAG estimated by SEMdag() (object DAG$dag) and the new estimated edges (object DAG$dag.red) as first and second input, respectively. The reference network (gnet = kegg in our example) should have weighted edges, corresponding to their interaction P-values, as an edge attribute E(kegg)$pv (see Section 6). Then, connections in the second graph will be substituted by known connections from the reference network, intercepted by the minimum-weighted shortest path found among the equivalent ones by the Dijkstra s algorithm, as implemented in the igraph function all_shortest_paths(). If the reference netwok has unweighted edges, one random shortest path will be chosen among the equivalent ones. The interactions imported from the reference network will be added to the first causal graph. If the reference is an undirected network, an extended undirected graph will be inferred. The resulting graph is saved in the ext$Ug object. The whole process may lead to the discovery of new paths of information flow, from network sources to sinks, and the presence of novel connectors between them. Since added nodes can already be present in the input graph, network extension may create cross-connections between old and new paths and their possible closure into circuits.

Model estimation strategies
One of the goals of SEMgraph is to provide a set of causal interence tools also for users with minimal statistical expertise. To this end, we propose four preset strategies, implemented in the modelSearch() function, combining SEMdag(), SEMbap(), and extendGraph() functions. All strategies estimate a DAG through the adjusted (de-correlate) data matrix Z by iteratively update DAG and Z according to the following steps: 1. Initialization of G (0) , Z (0) and Ψ (0) with some suitable estimates; i.e., G (0) = DAG (0) , Z (0) = Y , and Ψ (0) = I.
This procedure is implemented in the modelSearch() function, following the same syntax of SEMbap() and SEMdag(). With pstop = TRUE, the algorithm can be halted when the Shipley s test P-value > 0.05. DAG estimation can be controlled through the argument alpha (i.e., the significance level for the FDR correction), where 0 corresponds to no data de-correlation, and beta (i.e., the LASSO coefficient threshold), where 0 maintains all the edges of the input graph. We suggest to start with alpha = 0.05 and beta = 0.1 to have a good balance between model adjustment and density. Then beta could be gradually decreased (0.1 to 0) to obtain more complex models, unless the Shipley s global fitting test P-value > 0.05. Similarly, argument alpha can be increased up to 0.2. A higher alpha level includes more hidden covariances, thus considering more sources of confounding, resulting in a higher data de-correlation. Considering the ALS example, the model search of a DAG using the search = "basic" procedure has the following code: R > # Model search R > model <-modelSearch ( graph = alsData $ graph , data = data . npn , gnet = NULL , + d = 0 , search = " basic " , beta = 0 . 1 , + alpha = 0 . 0 5 , pstop = TRUE , + verbose = FALSE ) # # Searching for missing covariances ... 2 2 0 Basis set 2 6 7 of 2 6 7 C _ test df pvalue 5 4 3 . 7 7 2 4 4 2 9 5 3 4 . 0 0 0 0 0 0 0 0 . 3 7 5 3 9 5 1 Done .
RICF solver ended normally after 2 iterations deviance / df : 1 . 7 5 5 4 4 5 srmr : 0 . 0 8 3 9 3 0 7 The resulting graph is shown in Figure 3A. We may then evaluate model perturbation using the SEMrun() function, as shown in Figure 3B. In addition, with SEMace() and SEMpath() we can evaluate ACE, path perturbation, and fitting of specific directed paths between a souce-sink pair. As an example, Figure 3C shows in yellow all directed paths between genes SOD1 (Entrez ID: 6647) and NEFM (Entrez ID: 4741). R > pert <-SEMrun ( model $ graph , model $ data , alsData $ group ) R > ace <-SEMace ( model $ graph , model $ data , alsData $ group , method = " BH " ) R > path <-SEMpath ( model $ graph , model $ data , alsData $ group , + from = " 6 6 4 7 " , to = " 4 7 4 1 " , + path = " directed " , + verbose = TRUE ) All the steps done by modelSearch() are shown to standard output, and the resulting graphs are visualized in Figure 3 A-B-C. Following the example above, the extracted DAG model has a good fitting (deviance/df < 2, srmr near 0.08, and C-test with P-value > 0.05). The output model object contains model fitting as a lavaan object (model$fit), the output graph coloured according to node and edge relevance during the estimation steps (model$graph), and the adjusted dataset (model$data). With search = "basic", we enabled a data-driven model search strategy, where model structure is based only on data and no validation against a Added edges are highlighted in red, while blue edges are maintained from the input ALS graph. Panel B shows node-level perturbation, estimated by SEMrun(): pink nodes are activared, while lightblue nodes are inhibited. Edges are coloured according to their significance: significant direct effects and covariances (P-value < 0.05) are either red (estimate > 0) or blue (estimate < 0), while non-significant ones are gray-shaded. Panel C highlights in yellow all directed paths between genes SOD1 and NEFH, showing how SEMpath() may help us to clarify and evaluate causal effects between perturbed source-target pairs, within an entangled cluster.
reference network is done (i.e., gnet = NULL and d = 0). In the example above, we set beta to 0.1 to reduce graph density. As a result, input edges could be removed and new ones could be added, partially reshaping model architecture. The aim is to generate an improved model, achieving a good overall fitting (for DAGs, the main fitting index is the Shipley s global test P-value > 0.05), showing the best possible balance among model complexity, fitting, and perturbation. In this example, the output model shows how the SOD1 gene deregulation is causally connected to the deregulated gene NEFM, implied in the maintainance of a physiological neuronal caliber. This indirect connection (the yellow path in Figure 3C), absent in the input model (Figure 2), is now possible thanks to the new connections BCL2-DAXX, CYCS-MAPK13, and TOMML40-MAP2K6 (red links in Figure 3A), showing a tight association between apoptosis and neuronal caliber regulation, both dysregulated in neurodegenerative disorders.
Conversely, we could take advantage of known interactions, importing them in our model to extend it. We define them as knowledge-based strategies. The outer (search = "outer") strategy relies on an external reference network and the input graph topology, to assess the presence of possible hidden mediators (d > 1), including them in the output model. If one is not interested in adding new mediators from the reference, but still wants to evaluate the presence of internal hidden indirect (i.e., mediated) paths, the search argument can be set to "inner": the reference network is still used, but only to validate the new direct and indirect paths added to the model. Both inner and outer search strategies rely on the initial estimation of a DAG, working as a causal model backbone. Finally, we can use a direct strategy (search = "direct"), where the input graph structure is improved only through direct (i.e., adjacent) link search, followed by interaction validation and import from the reference network, with no mediators (i.e., d = 1).

Network clustering and scoring
SEMgraph offers the possibility to define topological communities of an input graph, generating scores for each statistical unit (i.e., subject) by using data from nodes belonging to communities. Clusters can be defined using the algorithms implemented in the R package igraph [Csardi and Nepusz, 2006] and then they can be fitted as independent models. Among the available clustering methods, we suggest either the walktrap community detection algorithm (WTC), based on random walks and develobed by Pons and Latapy [2005], or the edge betweenness clustering (EBC), developed by Newman and Girvan [2004]. The former tends to generate as many clusters as needed to cover the whole input network. The latter generally produces one large subnetwork and other much smaller communities or singletons. In case of trees, our implementation of the tree agglomerative hierarchical clustering (TAHC), proposed by Yu et al. [2015], is the suggested solution. Our aim here is to provide a tool yielding different (orthogonal) local models when dealing with large networks (|V | > 100). Beside network size, we generally recommend clustering when there are evidences of possible functional modules (i.e., subnetworks whose members are involved in a specific process). Sample scoring can be generated by three different hidden models: the latent variable (LV) model, the composite variable (CV) model, and the unobserved variable(UV) model. The LV model consists in a confirmatory factor analysis (CFA) with one factor and specific error variances [Bai and Li, 2012]: The CV model consists in a CFA with one factor and equal (common) error variances, equivalent to a principal component analysis (PCA) [Bai and Li, 2012]: The UV model corresponds to a fixed factor analysis (FFA) model with one factor projected on the observed X set, with zero residual variance, and equal (common) error variances fixed to 1. This is equivalent to a reduced-rank regression analysis (RRA) [Davies and Tso, 1982]: In every hidden model, Y j are the random observed endogenous variables of each module, and E j the residual errors, with j = (1, . . . , q). In the UV model, X k represent the observed variables, with k = (1, . . . , r).
Variables F , C, and U correspond to the scores assigned to each subject, for each cluster, representing the latent factor, the principal component, and the unmeasured variable of the hidden model, respectively. In the UV model, the factor scores U are found in the space spanned by the source variables X of each module (i.e., they are projected on X). Factor scores U are also called unmeasured variables, rather than latent variables or factors, because they can be expressed as a function of the observed X variables. Although the underlying variables are not actually measured, the scores U are measurable [Bentler and Weeks, 1980]. SEMgraph generates cluster scores using the factor.analysis() function of the R package cate [Wang and Zhao, 2019], an efficient package for high-dimensional factor analysis models. Only modules for which cluster scores represent 50% or more of the total variance are considered. The general syntax for network clustering is the following: R > U <-clusterScore ( model $ graph , model $ data , alsData $ group , HM = " LV " , + type = " ebc " , size = 5 ) Arguments type and size set the clustering algorithm and the minimum group of nodes to generate a cluster (groups smaller than size are considered as singletons). The suggested type is the one between the walktrap ("wtc") and edge betweeness ("ebc") community detection algorithm resulting in the largest number of nodes included in clusters, with a minimum cluster size of 5. Argument  Every cluster is represented by a LV and each estimate measures the global effect of the group over it. Together with the fitted hidden model U$fit, clusterScore() returns the data.frame containing cluster scores (U$dataHM) and a vector indicating the cluster membership for every node (U$membership). Topological cluster networks (without subject scoring) can be produced independently from clusterScore(), using the clusterGraph() utility: R > C <-clusterGraph ( model $ graph , type = " ebc " , size = 5 , verbose = FALSE ) The clusterGraph() arguments are equivalent to those used in clusterScore(). In addition, function cplot() generates and plots separate graphs for each cluster, and returns the input graph with a new attribute V(graph)$color, where each cluster membership correspond to a different color: R > G <-cplot ( graph = model $ graph , membership = U $ membership , map = TRUE ) Arguments graph and membership correspond to the input graph and node membership, respectively. If the map argument is set to TRUE, the input graph is colored according to cluster membership (object G$graph), as shown in Figure 4.

Network weighting and filtering
A common problem in network biology and medicine is to filter large models (i.e., networks) to highlight informative interactions, paths and communities, and find phenotype-associated factors. Although network topology alone provides enough information for many filtering algorithms to work, edge and node weights may significantly improve this process. Unlikely several network analysis tools, SEMgraph can work on both edge-and node-weighted graphs.

Network weighting
SEMgraph uses two different strategies to weight nodes and edges on the base of the perturbation induced by an external influence (that is internally represented by the binary group variable). Node-level perturbation consists in the detection of a subset of nodes, called seeds, having a key topological or functional role. Currently three seed detection methods are provided: closeness percentile (q), prototype clustering (h), and t-test (alpha). Three binary seed attributes (1: seed, 0: non-seed) are associated to each node. The first set includes nodes with closeness larger than the q-th percentile (computed through the igraph function closeness()). The second set includes nodes belonging to the prototype cluster generated using the R package protoclust [Bien and Tibshirani, 2011] by cutting at distance h = 1 − |r| (with r being the Pearson s correlation coefficient). The third set of seeds includes nodes with significant group effect at alpha level as measured by P-values testing a bivariate linear model, fitted with the R function lm(). Edge-level perturbation consists into edge weights (P-values and z signs) using three different trivariate procedures: a SEM model, a covariance model, and the Fisher s r-to-z transform. The SEM model implies testing the group effects on the source node j and the sink node k. A common group effect model of X = {0: control; 1: case} is fitted: and a weighted sum defines the new parameter w combining the total effect (TE) of the binary group on source and sink nodes: where d k and d j are the outgoing degrees (i.e., the number of all direct outgoing connections in the input graph, for each node) of the k-th sources and j-th sinks, respectively. The P-values are computed through the z-test = w/SE(w) of the combined TE of the group on the source node j to the sink node k.
The covariance model implies testing the group effect simultaneously on the source node j, the sink node k and their interaction (j; k). In this case, a two-group (0: controls, 1: cases) covariance model with an intercept parameters is fitted: A weighted sum defines the new weight parameter w, combining the group effect on the source node (mean difference β j − α j ), the sink node (mean difference β k − α k ), and the source-sink connection (covariance difference ψ jk − φ jk ): where d k and d j are the degree (i.e., the number of incoming and outgoing connections) of the source node k and sink node j, respectively. P-values are computed throug the t-test = w/SE(w) on the combined difference of the group over the source node j, the sink node k, and their connection (j; k). Finally, the correlation method tests the group difference between the correlation coefficients r (1) jk and r (0) jk of connected nodes j and k, by applying the Fisher s r-to-z transform: z = 0.5 log[(1 + r)/(1 − r)], with t = (z (1) − z (0) )/ 1/(n 1 − 3) + 1/(n 0 − 3) [Fisher, 1915]. Usually, the SEM model is similar to the covariance model if the covariance difference is close to zero. The correlation method is similar to the covariance model if the source and sink mean differences are close to 0. From a computational point of view, the r-to-z transform is the fastest of the three methods. The SE(w) is estimated through lavaan, specifying the new parameter w using the := operator. Graph weighting can be applied to any graph with the weightGraph() function: R > W <-weightGraph ( alsData $ graph , alsData $ exprs , alsData $ group , + method = " r 2 z " , seed = c ( 0 . 0 5 , 0 . 5 , 0 . 5 )) Here we used the group and seed arguments to include the perturbation information (i.e., the case-control difference) into weights (by default, seed = "none"). The object W correspond to the input graph with three new node (i.e., seed) attributes (namely, V(W)$pvlm, V(W)$proto, and V(W)$qi) and two new edge attributes (namely, E(W)$pv and E(W)$zsign). Each of the three seed attributes is a binary vector, taking value 1 for a seed and 0 for a non-seed. Each seed type can be defined by a vector of three cutoffs: the significance level of the direct group effect, the prototype clustering distance corresponding to 1 − abs(r) (with r being the Pearson s correlation coefficient), and the closeness percentile, respectively equal to 0.05, 0.5 and 0.5 in the above seed code. The edge pv attribute is the vector of P-values yielded by the selected method (in the example above, we used "r2z"), providing continuous edge weights. Methods "sem" and "cov" can be specified for using the SEM or covariance weighting model, respectively. These two methods are slower than "r2z", but multicore usage is automatically enabled for large networks.The edge attribute zsign is the sign of the test statistic z, that can be interpreted as activated (+1), inhibited (-1), or neutral (0, with P-value > 0.05), providing categorical edge weights.

Active module finding
Reducing large complex graphs by either extracting critical relationships or perturbed disease modules is key to focus relevant information into simpler subgraphs. Usually the detection of critical nodes and edges and disease modules supplements prior knowledge about disease-associated genomic elements, leveraging on emergent properties that can be revealed only through network analysis. Although a wide range of different methods have been proposed through years [Liu et al., 2020], SEMgraph proposes four fast procedures, including: random walk with restart (RWR), heat diffusion (HDI) [Dirmeier, 2018], Steiner tree (ST) [Kou et al., 1981], and the union of shortest path graph (USPG) [Chan, 2010]. The RWR and HDI algorithms, implemented in the R package diffusR [Dirmeier, 2018], starts from an initial distribution of node P-values, then computing their stationary distribution. They spread information in the form of node weights along the edges of a graph to other nodes. The information (i.e., node weights) is iteratively propagated to other nodes until an equilibrium state or stop criterion occurs. The RWR starts from an initial distribution p 0 , then computing their stationary distribution. The process depends on a restart probability parameter that allows to regulate how often the RWR returns back to the initial values. The HDI starts from an amount of heat h 0 , and gets stationary distribution using the Laplacian matrix of a graph. Every iteration (or time interval) t heat streams from the starting nodes into surrounding nodes. The reduce graph G 0 for both RWR and HDI is the induced subgraph of the input graph G defined by the S nodes in the top-rank scoring (q-th quantile of the stationary distribution). The ST and USPG algorithms minimize paths between seed nodes S with minimum weights (i.e., maximum perturbation), we will refer to these paths as the maximum perturbation paths (MPPs), and a perturbation route being an MPP subset. Edge weights are defined as inverse of negative logarithm of the P-values. In this way, edge weights are in a positive continuous range [0, ∞). We refer to this scale as perturbance [Palluzzi et al., 2017]. The lower the P-value (or the w-value), the higher the perturbation. The ST problem [Kou et al., 1981] is to find a connected subgraph G 0 of G such that the additional nodes C (called the Steiner or connector nodes) connecting a priori (e.g., disease) seed nodes S (called the terminal nodes) minimize the sum of the weight of every edge in the subgraph G 0 . Various heuristic algorithms are available for solving ST problem. The SEMgraph function SteinerTree() applies a fast algorithm approximation, first proposed by Kou et al. [1981]. USPG() generates a subgraph G 0 of G as the union of the significantly pertubed shortest paths between a priori seed nodes S. The USPG problem is focused in computing the minimum shortest path between each pair of nodes, selecting MPPs. Thus, each shortest path considers not only the number of links needed to reach the disease-associated node, but also the number of disease-associated edges that are included in the path. Active module finding methods are implemented in the wrapper activeModule(): R > R <-activeModule (W , type = " rwr " , seed = " pvlm " , eweight = " pvalue " ) Function activeModule() takes two main inputs: a weighted network W and a reduction method specified by the argument type, including algorithms: RWR ("rwr"), HDI ("hdi"), the Kou version of the Steiner tree algorithm ("kou"), and the USPG method ("usp"). The optional argument seed takes either a binary vector of custom seeds or one among "qi", "proto", or "pvlm", corresponding to the seed types generated through the weightGraph() function. Finally, the optional argument eweight allow the user to specify either a custom vector of distances or one between "pvalue" and "zsign", generated using the weightGraph() function.

Graph conversion utilities
SEMgraph uses two standard graph and model formats, being respectively igraph and lavaan. Different SEMgraph functions may require DAG conversion, and a common way of generating undirected graphs from data is to convert a correlation matrix to an adjacency matrix, using a threshold over the correlation coefficient. These conversion types can be obtained with the following functions: R > # Extract an undirected network from a correlation matrix R R > R <-cor ( model $ data ) R > U <-corr 2 graph (R , n = nrow ( model $ data ) , type = " marg " , method = " BH " , + alpha = 0 . 0 5 ) Function graph2lavaan() simply generates a SEM (lavaan syntax) from the input igraph object, while lavaan2graph() does the opposite operation. Natively, path diagrams are directed and may contain covariances (i.e., bidirected edges). However, lavaan2graph() may generate either an undirected graph (directed = FALSE) or a directed graph without covariances (psi = FALSE). By default, both these arguments are set to TRUE. Function graph2dag() extract a DAG from an input graph through a two-steps pruning strategy. Firstly, bidirected edges are removed from the input graph. Secondly, edge are weighted by P-values, through marginal correlation testing. When a cycle is detected, the edge with highest P-value is removed, breaking the cycle. If bap=TRUE, a BAP is then generated merging the output DAG and the bidirected edges from the input graph. The function corr2graph() offers the possibility to apply a threshold to the input correlation matrix (R), based on either marginal (type = "marg") or conditional (type = "cond") correlation testing. The arguments alpha and method set the significance level over the adjusted P-value. In addition, the Minimum Spanning Tree (type = "mst") or the Triangulated Maximally Filtered Graph (type = "tmfg") are implemented for filtering the amount of meaningful correlation structure. A MST is a subset G = (V = p, E = p − 1) of a edge-weighted graph that connecting all the p nodes (variables) together, without cycles and with the minimum possible total edge weight. The TMFG method [Massara et al., 2016] uses a structural constraint that limits the number of zero-order correlations included in the network, yielding the subgraph G = (V = p, E = 3p − 6).

Disease modules detection
In this case study, we want to build a causal model for the Frontotemporal Dementia, a neurodegenerative disorder characterized by cognitive and behavioural impairments [Palluzzi et al., 2017]. The aim is to produce a map of the DNA methylation (DNAme) alterations caused by FTD, without an initial disease model. For this example, we will use DNAme data from Li et al. [2015] (GEO accession: GSE53740), stored in the SEMdata package. Although not necessary, having a collection of known disease-associated networks is an advantageous starting point. For instance, the KEGG BRITE database allows to search for terms, including human disorders, that could be associated to one or more pathways. The term Frontotemporal lobar degeneration (an alias for FTD; KEGG ID: H00078) is associated to 6 KEGG pathways: MAPK signaling pathway (hsa04010), Protein processing in endoplasmic reticulum (hsa04141), Endocytosis (hsa4144), Wnt signaling pathway (hsa04310), Notch signaling pathway (hsa04330), and Neurotrophin signaling pathway (hsa04722). Starting from database queries is not a requirement. Initial network models may also derive from exploratory analyses, such as overrepresentation analysis (ORA) or gene set enrichment analysis (GSEA) [Reimand et al., 2019]. In SEMgraph, we can use the SEMgsa() utility to apply gene set analysis (GSA) on a collection of networks.
R > # Seed extraction R > seed <-V ( graph )$ name [ V ( graph )$ name % in % unique ( unlist ( ftd . gsa $ DRN ))] This leads to a graph of 581 nodes and 3817 edges, and a seed list of 29 DRNs. With large networks, a heuristic solution to maximize model perturbation within a relatively less complex model is graph weighting and filtering. Fisher s r-to-z is the fastest weighting solution for very large networks: R > W <-weightGraph ( graph , pc 1 . npn , ftdDNAme $ group , method = " r 2 z " ) We may then apply activeModule() to generate our reduced perturbed model. A very fast and efficient solution to estimate the perturbed backbone of the input graph is the Steiner tree (type = "kou"), traversing all our seeds, while minimizing the total weight of the network (i.e., maximizing edge perturbation): R > # Steiner tree extraction R > R <-activeModule (W , type = " kou " , seed = seed , eweight = " pvalue " ) R > # Entrez ID conversion R > V ( R )$ label <-mapIds ( org . Hs . eg . db , V ( R )$ name , ' SYMBOL ' , ' ENTREZID ') R > # Perturbation evaluation and plotting R > pert <-SEMrun ( graph = R , data = pc 1 . npn , group = ftdDNAme $ group ) R > gplot ( pert $ graph ) The kou algorithm yielded a Steiner tree R of 59 nodes and 58 edges. In case of very large and dense network, Steiner trees are fast and accurate solutions for finding the essential backbone of the network, providing a valuable insight of the key mediators supporting the information flow of the system. The inferred FTD perturbed backbone, given DNAme data, is shown in Figure 5. The perturbed backbone (i.e., the tree connecting seeds, maximizing edge perturbation, with minimum possible cost), can be exploited to build an improved causal model with the modelSearch() function (see Section 4.4). Code and output of the backbone improvement pipeline can be found in the supplementary file available at: https://github.com/ fernandoPalluzzi/SEMgraph/blob/master/SEMgraph-replicationCode.R.

Summary and discussion
SEMgraph is a fast and user-friendly, yet powerful package for causal network analysis. Bridging graph theory and structural equation modeling (SEM), it conveys causal structure learning within the framework of multivariate linear networks, combining accurate data-driven discovery and confounding adjustment to model interpretability. The SEMgraph philosophy includes two main aspects: the technical aspect of usability and the concept of contextual analysis. The former is achieved by introducing automated and data-driven settings for both optimal algorithm tuning and scalability, to relieve the user from time-consuming and/or arbitrary choices. The latter is founded on the notion of model fitting/perturbation tradeoff, implying that destabilizing or pathological processes arise within a stable or phyisiological system context, under the action of a perturbing signal. Given the advance in causal structure learning, our direction is to incorporate the most recent proposals in DAG search [Heinze-Deml et al., 2018], cyclic SEM fitting , and confounding adjustment [Buhlmann and Cevid, 2020]. In addition, new examples, pathways and interactome for graph usability can be easily added in SEMdata, providing an extensible platform.