Illustration of the proposed pipeline
The pipeline, similar to that described by [12], is illustrated in Figure 1. In step1, DEGs were obtained by Significance Analysis of Microarray (SAM) [16] and the perturbed pathways by Signaling Pathway Impact Analysis (SPIA) [17] using KEGG database [18]; in step2, the pathway models were generated by network analysis and evaluated with SEM in step3 for: 1) improving the models generated by the biological pathways found; 2) testing if the pathway models differ across groups by multiple group analysis; 3) screening of single differences in expression (gene nodes) and in regulation (genegene edges) across groups. We used the implementation provided by the R packages samr [19] and SPIA [20] for SAM/SPIA procedures.
Structural equations models (SEM)
SEM is a statistical procedure for confirmatory causal inference originated from path analysis proposed in 1921 by the American geneticist Sewall Wright [21]. It is based on multivariate linear regression equations, where the response variable in one regression equation may appear as a predictor in another equation. Indeed, variables may influence oneanother reciprocally, either directly or through other variables as intermediaries. Additionally, correlated or uncorrelated unmeasured variables may indicate the presence of unobserved factors that influences observed variables.
In general, a SEM consists of a structural model describing (causal) relationships among latent (hidden) variables and a measurement model describing the relationships between the observed measurements and the underlying latent variables [22]. Here we consider SEM with observed variables only, and therefore no measurement models have been used [23]. Specifically, let V to be the index set of the Y observed variables, represented as the “parent” set {pa(i)i ∈ V}, i.e. the explanatory variables of Y_{
i
}, or as the “siblings” set {sib(i)i ∈ V}, i.e. the unmeasured linked variables with Y_{
i
}, respectively. These sets determine a system of linear equations:
{Y}_{i}={\sum}_{j\in \mathit{\text{pa}}\left(i\right)}{\beta}_{\mathit{\text{ij}}}{Y}_{j}+{U}_{i}\phantom{\rule{1em}{0ex}}i\in V
and a covariance structure:
cov\left({U}_{i};{U}_{j}\right)=\left\{\begin{array}{ccc}\hfill {\psi}_{\mathit{\text{ij}}}\hfill & \mathrm{if}\phantom{\rule{0.75em}{0ex}}i=j& \text{or}\phantom{\rule{1em}{0ex}}j\in \mathit{\text{sib}}\left(i\right)\\ \hfill 0\hfill & \text{otherwise}\end{array}\right.
The system of linear equations affirms that every node is characterized by the relationships with his parents, while the covariance structure describes the relationships between unobserved nodes.
They encode two distinct causal assumptions: (1) a “weak” assumption on the possible existence of (direct) casual influences of explanatory variables on Y_{
i
}, and (bidirected) correlated unmeasured variables U_{
i
}, quantified by the regression (path) coefficients β_{
ij
}, and the covariances ψ_{
ij
}, respectively; and 2) a “strong” assumption based on the absence of (direct) causal influences or (bidirected) correlations of any observed/unobserved variables neither in the “parents” set pa(i) nor in the “siblings” set sib(i). In other terms, a weak assumption excludes some values for a parameter (the null value zero), but permits a range of other values; while, strong assumptions assume that parameters take specific values (null value zero or a fixed a priori value). The linear equations and the covariance structure can be encoded and visualized in a “path diagram”, that is, a mixed graph G = (V,E) featuring both directed (→) and bidirected (↔) edges. The vertex set V includes the genes and the edge set E represents relations or reactions among vertexes. The “activity” of a given gene is embedded in a path diagram: the actions performed by a gene on downstream molecules, and the signals that it receives from upstream regulators. “Directed edges” between two genes (j→i, if and only if j ∈ pa(i)), measured by path coefficients (ranging usually from 1 to 1, if genes are standardized), represent expected change in the activity of the downstream gene, given a unit of change in the upstream gene while the values of the other genes remain constant. Considering that paths reflect a direct influence of one gene on another, negative path coefficients indicate ensemble inhibition (negative control) and positive paths measure net activation (positive control). “Bidirected edges” between two genes (j↔i if and only if j ∈ sib(i), or equivalently, if and only if i ∈ sib(j)) encode a hidden common cause that may be interpreted as latent or unobserved measurement of upstream regulators that could account for the observed covariances (correlations) between the two genes.
One important feature of SEM is that direct and indirect effects can be computed and compared. “Directed paths” between two genes are the sequence of all the directed edges (j → k_{1},… → …, k_{m} → i) from genes Y_{j} to Y_{i}. Each directed path is a channel along which information (gene’s activities) can flow, and so a “total effect” (TE) of gene Y_{j} on gene Y_{i} is defined as the total sum of the products of the sequence of arrows (edges) along all directed paths from Y_{j} to Y_{i}. Accordingly, a “indirect effect” (IE) of gene Y_{j} on gene Y_{i} represents the portion of the total effects not considering the directed edge effect (DE), i.e. TE = DE + IE.
The wellknown SEM analysis consists of four steps [22]: a) definition and identification of an initial path model, b) estimation of parameters, c) evaluation of the fitting, and d) model modification.
Initial model building
Specification of initial pathway models (step a) was obtained taking the perturbed pathways and converting them in directed graphs or gene networks. Generalizing, each pathway can be seen as a mixed graph. The idea is to understand how DEGs are connected in the perturbed pathways by other microarray genes. A natural way to solve this problem is to identify the shortest paths (geodesic distance) between DEGs. The geodesic distance d_{geo}(y_{i}, y_{j}) between two DEGs, y_{i} and y_{j}, is defined as the minimum distance between these two genes. The function get.shortest.paths( ) of the R package igraph was used to compute all the shortest paths [24]. Define the microarray genes, DEGs, and not DEGs in the following way: MG = {mg_{
1
}, mg_{
2
}, …, mg_{
m
}}; DEG = {deg_{
1
}, deg_{
2
},…, deg_{
n
}} and NDEG = {ndeg_{
1
}, ndeg_{
2
},…,ndeg_{
mn
}}, where MG = DEG ∪ NDEG and DEG ∩ NDEG = {∅}. Each shortest path could be represented as a list of nodes Y_{
k
}= (y_{
i
}, y_{
i+1
},..., y_{
j1,
}y_{
j
}) and a list of the corresponding edges E_{k} = (e_{i(i+1)}, ..., e_{(j1)j}) where (y_{
i
}, y_{
j
}) ∈ DEG; (y_{
i+1
},…, y_{
j1
}) ∈ (DEG ∨ NDEG); Y_{k} ⊆ Y and E_{k} ⊆ E. The shortest paths for each pathway constitute k (k = 1, …, K) subgraphs G_{k} = {Y_{k},E_{k}} of the original pathway, G = {Y, E}. Not all DEGs and NDEGs will be included in the shortest paths. Therefore, we define two new sets: DEG(s) and NDEG(s) respectively, the sets of DEGs and the set of not DEGs that include all genes in shortest paths, where DEG(s) ⊆ DEG and NDEG(s) ⊆ NDEG.
To reach the final model the NDEG(s) that connect DEG(s) are grouped in basis to their PIR superfamily (PIRSF). Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The PIRSF can be represented as SUPF = {supf_{
1
}, supf_{
2
},…, supf_{
g
}}, where ∀ supf_{
i
} ⊆ NDEG(s) and ∀_{i≠j} supf_{
i
} ∩ supf_{
j
} = {∅}. Using this information, each original shorthest path G_{k} = {Y_{k},E_{k}} is transformed in a new shortest paths, G*_{k} = {Y*_{k} = (y*_{i}, y*_{i+1},…, y*_{j1,} y*_{j}), E*_{k} = (e*_{i(i+1)},…, e*_{(j1)j}) }, where (y*_{i+1},…, y*_{j1}) ∈ (DEG(s) ∨ SUPF ∨ NDEG(s)) and E* ⊆ E. The function to obtain the final graph, G* = (Y*,E*), is described in the following pseudocode:
The graph G* = (Y*,E*) is the fusion of all shortest paths found, where each node and each edge cannot be present more than once, the selfloops are not considered but the feedbacks and cycles were preserved. To ensure the identification of the initial models, the “blockrecursive” criterion of Rigdon [25] and the “bow free” criterion of Brito and Pearl [26] were applied. The first affirms that reciprocal relationships, feedback loops, or covariances are segregated into groups, or blocks, with no more than two equations per block. The second affirms that a model is ensured if variables standing in direct causal relationships (directed edges) do not have correlated errors (bidirected edges). So a new graph is attained in which the DEGs are connected by other DEGs, PIRSFs or NDEGs. In this way a model was created for each significant pathway found.
Successively, PIRSFs composite variables are defined considering not DEGs, present in shortest paths, as causal indicators of latent (hidden) constructs [27]. To generate the PIRSFs, a principal component analysis (PCA) was performed on genes belonging to a PIRSF and the principal component scores of the first principal component (PC1) were considered as the values that characterize the PIRSF. Only PIRSFs for which the PC1 represents 50% or more of the total variance are considered. At the end of process we have the initial SEM model.
The pathway graph conversion, the graph analysis, and the PC1 scores are obtained by graphite [28], igraph [24] and stats [29] R packages, respectively, while R functions for network analysis are implemented ad hoc, and are available Additional file 1.
SEM fitting
For parameter estimation (stepb), the classic derivation of the Maximum Likelihood estimation (MLE) is used, that assumes all observed variables are jointly Gaussian. The system of structural equations and covariance structure of unmeasured variables can be written compactly in a matrix form as: Y = BY + U, and Cov(U) = Ψ. This specification induces a structure on the covariance matrix of the joint distribution of the genes Y as:
\Sigma \left(\theta \right)={\left(IB\right)}^{1}\Psi {\left(IB\right)}^{T}
where θ=(β; ψ) is the list of the free parameters in the model of dimension t. The unknown parameters are estimated so that the implied covariance matrix Σ(θ) is close to the observed sample covariance matrix S.
The assessment of the model (stepc) involves the Likelihood Ratio test (LRT) converted to a Chisquare test of the fitted model. Specifically, let Σ_{0} = E(S) to be the true population covariance matrix, and Σ(θ) the modelimplied covariance matrix. The hypothesis to be tested is:
{H}_{0}:{\Sigma}_{0}=\Sigma \left(\theta \right)\phantom{\rule{1em}{0ex}}\mathit{\text{vs}}.\phantom{\rule{1em}{0ex}}{H}_{1}:{\Sigma}_{0}\ne \Sigma \left(\theta \right)
The chisquare test is then χ^{2} = 2logLRT = 2[logL(Σ(θ))  logL(Σ_{0})] with d = p(p + 1)/2t degree of freedom (d.f.). logL( ) represents the loglikelihood of the model, p the number of genes, t the number of parameters of the fitted model. Notsignificant Pvalues (P >0.05) indicate that the model provides a good fit to the data. The Pvalues are derived by using the χ^{2}(d) distribution or a resampling bootstrap distribution [30].
An alternative procedure [31] assumes that in the population, a modelimplied covariance matrix Σ(θ_{0}), which is approximately correct, is in the neighborhood of Σ_{0}. So the null hypothesis of “exact fit” is replaced by the null hypothesis of “close fit”:
{H}_{0}:\mathit{\Sigma}\left({\mathit{\theta}}_{0}\right)\mathit{\Sigma}\left(\mathit{\theta}\right)<\mathit{\epsilon}=0.05\phantom{\rule{1em}{0ex}}\mathit{vs}.\phantom{\rule{1em}{0ex}}{\mathit{H}}_{1}:\mathit{\epsilon}>0.05
and the Root Mean Square Error of Approximation (RMSEA) measures the discrepancy ε for the fitted model:
\mathit{\text{RMSEA}}=\sqrt{max\left(0;{\mathit{\chi}}^{2}\mathit{d}\right)/\mathit{d}\left(\mathit{n}1\right)}
Pvalues for RMSEA are set up from the noncentral χ^{2}(λ,d) distribution with noncentrality parameter, λ = (n1) × d × 0.05^{2} or from a resampling bootstrap distribution. The null hypothesis of close fit is not rejected if P > 0.05.
We also consider the Standardized RootMeansquare Residual (SRMR), one of the most used SEM fit indices. SRMR is a measure based on the differences between observed (s) values and the ones obtained from the model (σ) of the covariance matrix:
\mathit{\text{SRMR}}=\frac{{\sum}_{\mathit{j}=1}^{\mathit{p}}{\sum}_{\mathit{k}=\mathit{j}+1}^{\mathit{p}}{\left({\mathit{s}}_{\mathit{\text{jk}}}{\mathit{\sigma}}_{\mathit{\text{jk}}}\right)}^{2}/{\mathit{s}}_{\mathit{\text{jj}}}{\mathit{s}}_{\mathit{\text{kk}}}}{\mathit{p}\left(\mathit{p}+1\right)/2}
SRMR values <0.10 are assumed as an adequate fitting measure, whereas values <0.05 may be considered as a good fit [32].
Finally, the model refinement (stepd) is obtained adding new directed or bidirected edges to the initial model. This modification was needed considering that the initial model is only a simplified representation of the whole pathway. The criteria used for the refinement are based on the combination of three elements. First, the modification indexes (MI), that is an estimate of the decrease in the χ^{2}score statistic that would result by freeing each fixed (=0) parameter in the model; second, ztests (=parameter estimate/standard error) of the MLE; and finally, biological evidences obtained by STRING database [33] and by the existence of a direct path between the nodes that MI proposes to connect. The following heuristic stepwise strategy was used:
Heuristic stepwise procedure:
Input: list of the fixed (=0) parameters (paths and/or covariances) in the model.
Output: new free parameters (paths and/or covariances) in the model.
Steps:

1.
freeing just a single parameter (path coefficient or covariance) at a time, and these in turn are sorted in descending order of magnitude using MI;

2.
verify if the edge (path coefficient or covariance) to add is present in STRING or when the edge is a path coefficient, if it represents a direct path that connects the nodes in the pathway selected, and then add this new edge in the model;

3.
fit the model and if the new edge is statistically not significant (P > 0.05, onesided), using a z value (z<1.64), remove it and repeat step 12;

4.
STOP the selection procedure if the model achieves a non significant LRT (P > 0.05) or RMSEA (P of “close” fit > 0.05) or SRMR < 0.1, otherwise repeat step 13
Multiplegroup analysis
When data are observed from multiple subsamples, the representation of groups with “indicator variables”, considered as nodes, allows to recognize DEGs. Instead, “multiplegroup analysis” allows to identify differentially regulated genes (DRGs) across groups.
Specifically, define μ_{1}(θ) and Σ_{1}(θ) as the modelimplied mean vector and covariance matrix of group 1 (experimental group) respectively and μ_{2}(θ) and Σ_{2}(θ) as the corresponding moments of group 2 (control group). For each models, two omnibus tests are performed considering the two experimental conditions (groups), one for the differential expression genes (nodes) and the other for the strength of the edges. In the first case, the hypothesis to be tested is:
{H}_{0}:{\mathit{\mu}}_{1}\left(\mathit{\theta}\right)={\mathit{\mu}}_{2}\left(\mathit{\theta}\right)\phantom{\rule{1em}{0ex}}\mathit{vs}.\phantom{\rule{1em}{0ex}}{\mathit{H}}_{1}:{\mathit{\mu}}_{1}\left(\mathit{\theta}\right)\ne {\mathit{\mu}}_{2}\left(\mathit{\theta}\right)
while, in the second case, is:
{\mathit{H}}_{0}:{\mathit{\Sigma}}_{1}\left(\mathit{\theta}\right)={\mathit{\Sigma}}_{2}\left(\mathit{\theta}\right)\phantom{\rule{1em}{0ex}}\mathit{\text{vs}}.\phantom{\rule{1em}{0ex}}{\mathit{H}}_{1}:{\mathit{\Sigma}}_{1}\left(\mathit{\theta}\right)\ne {\mathit{\Sigma}}_{2}\left(\mathit{\theta}\right)
In the “null” model (H_{0}), the mean or covariance estimates are constrained to be equal across groups; in the “alternative” model (H_{1}), they are allowed to differ across groups. The statistical significance is determined by comparison of LRT chisquare (χ^{2}diff) values at a given degree of freedom (d.f. diff). If there is a significant difference (P < 0.05) in the chisquared goodnessoffit index, the groups differ significantly for one or more specific gene expression (nodes) and/or generelationships (edges). Finally, three pathcoefficient differences are screened: 1) “up/down” expression (gene nodes), testing the “zero value” for the group indicator variable (C = experimental =1, and C = control = 0) path coefficients; 2) “up/down” regulation (gene edges), testing the “zero value” for the differences of path coefficients across groups; 3) “on/off” regulation (gene edges) with respect to a priori KEGG gene regulation target, testing the “zero value” of the edge coefficients across groups.
Specifically, assume C to be the path coefficient matrix of the group indicator variables and c_{
i
} be an element of the matrix C. Let B_{1} and B_{2} to be the corresponding path coefficient matrices in the experimental and control groups; D = B_{1} – B_{2} and d_{
ij
} be an element of the matrix D. We consider the test statistics:
\begin{array}{l}{\mathit{t}}_{\mathit{C}}={\mathit{c}}_{\mathit{i}}/\mathit{\text{SE}}\left({\mathit{c}}_{\mathit{i}}\right)\phantom{\rule{1.5em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\mathit{t}}_{\mathit{D}}={\mathit{d}}_{\mathit{\text{ij}}}/\mathit{\text{SE}}\left({\mathit{d}}_{\mathit{\text{ij}}}\right)\\ {\mathit{t}}_{1}={\mathit{b}}_{\mathit{\text{ij}}1}/\mathit{\text{SE}}\left({\mathit{b}}_{\mathit{\text{ij}}1}\right)\phantom{\rule{0.75em}{0ex}}\text{and}\phantom{\rule{1em}{0ex}}{\mathit{t}}_{2}={\mathit{b}}_{\mathit{\text{ij}}12}/\mathit{\text{SE}}\left({\mathit{b}}_{\mathit{\text{ij}}2}\right)\end{array}
where SE( ) is the estimated standard error of the parameters. The statistic t_{C} can be used to test the conditional “up/down” expression level difference of one gene between groups, given the parents of the gene in the network. Similarly, t_{
D
} checks the conditional “up/down” expression regulatory differences of one gene on another between groups. Moreover, t_{1} and t_{2} check the “on/off” regulatory differences compared to a priori KEGG pathway. The Pvalues of these statistics (twosided, for t_{C} and t_{D} and onesided, for t_{1} and t_{2}) are derived either asymptotically from the N(0,1)distribution or empirically from the nonparametricbased or using modelbased bootstrap distribution with B bootstrap samples (usually, B = 100, or 1000).
Note that the marginal bivariate test of DEGs with SAM approach can be regarded as the special case of the conditional test with t_{C,} when the pathway graph is G = (Y, ∅), so pa(y) = ∅ for all genes ∈Y.
We use the implementation provided by the lavaan [34] R package for estimation, evaluation, and modification of SEM data analysis, and R codes is available in Additional file 1.