Illustration of the proposed pipeline
The pipeline, similar to that described by [12], is illustrated in Figure 1. In step-1, DEGs were obtained by Significance Analysis of Microarray (SAM) [16] and the perturbed pathways by Signaling Pathway Impact Analysis (SPIA) [17] using KEGG database [18]; in step-2, the pathway models were generated by network analysis and evaluated with SEM in step-3 for: 1) improving the models generated by the biological pathways found; 2) testing if the pathway models differ across groups by multiple group analysis; 3) screening of single differences in expression (gene nodes) and in regulation (gene-gene edges) across groups. We used the implementation provided by the R packages samr [19] and SPIA [20] for SAM/SPIA procedures.
Structural equations models (SEM)
SEM is a statistical procedure for confirmatory causal inference originated from path analysis proposed in 1921 by the American geneticist Sewall Wright [21]. It is based on multivariate linear regression equations, where the response variable in one regression equation may appear as a predictor in another equation. Indeed, variables may influence one-another reciprocally, either directly or through other variables as intermediaries. Additionally, correlated or uncorrelated unmeasured variables may indicate the presence of unobserved factors that influences observed variables.
In general, a SEM consists of a structural model describing (causal) relationships among latent (hidden) variables and a measurement model describing the relationships between the observed measurements and the underlying latent variables [22]. Here we consider SEM with observed variables only, and therefore no measurement models have been used [23]. Specifically, let V to be the index set of the Y observed variables, represented as the “parent” set {pa(i)|i ∈ V}, i.e. the explanatory variables of Y
i
, or as the “siblings” set {sib(i)|i ∈ V}, i.e. the unmeasured linked variables with Y
i
, respectively. These sets determine a system of linear equations:
and a covariance structure:
The system of linear equations affirms that every node is characterized by the relationships with his parents, while the covariance structure describes the relationships between unobserved nodes.
They encode two distinct causal assumptions: (1) a “weak” assumption on the possible existence of (direct) casual influences of explanatory variables on Y
i
, and (bi-directed) correlated unmeasured variables U
i
, quantified by the regression (path) coefficients β
ij
, and the covariances ψ
ij
, respectively; and 2) a “strong” assumption based on the absence of (direct) causal influences or (bi-directed) correlations of any observed/unobserved variables neither in the “parents” set pa(i) nor in the “siblings” set sib(i). In other terms, a weak assumption excludes some values for a parameter (the null value zero), but permits a range of other values; while, strong assumptions assume that parameters take specific values (null value zero or a fixed a priori value). The linear equations and the covariance structure can be encoded and visualized in a “path diagram”, that is, a mixed graph G = (V,E) featuring both directed (→) and bi-directed (↔) edges. The vertex set V includes the genes and the edge set E represents relations or reactions among vertexes. The “activity” of a given gene is embedded in a path diagram: the actions performed by a gene on downstream molecules, and the signals that it receives from upstream regulators. “Directed edges” between two genes (j→i, if and only if j ∈ pa(i)), measured by path coefficients (ranging usually from -1 to 1, if genes are standardized), represent expected change in the activity of the downstream gene, given a unit of change in the upstream gene while the values of the other genes remain constant. Considering that paths reflect a direct influence of one gene on another, negative path coefficients indicate ensemble inhibition (negative control) and positive paths measure net activation (positive control). “Bi-directed edges” between two genes (j↔i if and only if j ∈ sib(i), or equivalently, if and only if i ∈ sib(j)) encode a hidden common cause that may be interpreted as latent or unobserved measurement of upstream regulators that could account for the observed covariances (correlations) between the two genes.
One important feature of SEM is that direct and indirect effects can be computed and compared. “Directed paths” between two genes are the sequence of all the directed edges (j → k1,… → …, km → i) from genes Yj to Yi. Each directed path is a channel along which information (gene’s activities) can flow, and so a “total effect” (TE) of gene Yj on gene Yi is defined as the total sum of the products of the sequence of arrows (edges) along all directed paths from Yj to Yi. Accordingly, a “indirect effect” (IE) of gene Yj on gene Yi represents the portion of the total effects not considering the directed edge effect (DE), i.e. TE = DE + IE.
The well-known SEM analysis consists of four steps [22]: a) definition and identification of an initial path model, b) estimation of parameters, c) evaluation of the fitting, and d) model modification.
Initial model building
Specification of initial pathway models (step a) was obtained taking the perturbed pathways and converting them in directed graphs or gene networks. Generalizing, each pathway can be seen as a mixed graph. The idea is to understand how DEGs are connected in the perturbed pathways by other microarray genes. A natural way to solve this problem is to identify the shortest paths (geodesic distance) between DEGs. The geodesic distance dgeo(yi, yj) between two DEGs, yi and yj, is defined as the minimum distance between these two genes. The function get.shortest.paths( ) of the R package igraph was used to compute all the shortest paths [24]. Define the microarray genes, DEGs, and not DEGs in the following way: MG = {mg
1
, mg
2
, …, mg
m
}; DEG = {deg
1
, deg
2
,…, deg
n
} and NDEG = {ndeg
1
, ndeg
2
,…,ndeg
m-n
}, where MG = DEG ∪ NDEG and DEG ∩ NDEG = {∅}. Each shortest path could be represented as a list of nodes Y
k
= (y
i
, y
i+1
,..., y
j-1,
y
j
) and a list of the corresponding edges Ek = (ei(i+1), ..., e(j-1)j) where (y
i
, y
j
) ∈ DEG; (y
i+1
,…, y
j-1
) ∈ (DEG ∨ NDEG); Yk ⊆ Y and Ek ⊆ E. The shortest paths for each pathway constitute k (k = 1, …, K) subgraphs Gk = {Yk,Ek} of the original pathway, G = {Y, E}. Not all DEGs and NDEGs will be included in the shortest paths. Therefore, we define two new sets: DEG(s) and NDEG(s) respectively, the sets of DEGs and the set of not DEGs that include all genes in shortest paths, where DEG(s) ⊆ DEG and NDEG(s) ⊆ NDEG.
To reach the final model the NDEG(s) that connect DEG(s) are grouped in basis to their PIR superfamily (PIRSF). Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The PIRSF can be represented as SUPF = {supf
1
, supf
2
,…, supf
g
}, where ∀ supf
i
⊆ NDEG(s) and ∀i≠j supf
i
∩ supf
j
= {∅}. Using this information, each original shorthest path Gk = {Yk,Ek} is transformed in a new shortest paths, G*k = {Y*k = (y*i, y*i+1,…, y*j-1, y*j), E*k = (e*i(i+1),…, e*(j-1)j) }, where (y*i+1,…, y*j-1) ∈ (DEG(s) ∨ SUPF ∨ NDEG(s)) and E* ⊆ E. The function to obtain the final graph, G* = (Y*,E*), is described in the following pseudo-code:
The graph G* = (Y*,E*) is the fusion of all shortest paths found, where each node and each edge cannot be present more than once, the self-loops are not considered but the feed-backs and cycles were preserved. To ensure the identification of the initial models, the “block-recursive” criterion of Rigdon [25] and the “bow free” criterion of Brito and Pearl [26] were applied. The first affirms that reciprocal relationships, feedback loops, or covariances are segregated into groups, or blocks, with no more than two equations per block. The second affirms that a model is ensured if variables standing in direct causal relationships (directed edges) do not have correlated errors (bi-directed edges). So a new graph is attained in which the DEGs are connected by other DEGs, PIRSFs or NDEGs. In this way a model was created for each significant pathway found.
Successively, PIRSFs composite variables are defined considering not DEGs, present in shortest paths, as causal indicators of latent (hidden) constructs [27]. To generate the PIRSFs, a principal component analysis (PCA) was performed on genes belonging to a PIRSF and the principal component scores of the first principal component (PC1) were considered as the values that characterize the PIRSF. Only PIRSFs for which the PC1 represents 50% or more of the total variance are considered. At the end of process we have the initial SEM model.
The pathway graph conversion, the graph analysis, and the PC1 scores are obtained by graphite [28], igraph [24] and stats [29] R packages, respectively, while R functions for network analysis are implemented ad hoc, and are available Additional file 1.
SEM fitting
For parameter estimation (step-b), the classic derivation of the Maximum Likelihood estimation (MLE) is used, that assumes all observed variables are jointly Gaussian. The system of structural equations and covariance structure of unmeasured variables can be written compactly in a matrix form as: Y = BY + U, and Cov(U) = Ψ. This specification induces a structure on the covariance matrix of the joint distribution of the genes Y as:
where θ=(β; ψ) is the list of the free parameters in the model of dimension t. The unknown parameters are estimated so that the implied covariance matrix Σ(θ) is close to the observed sample covariance matrix S.
The assessment of the model (step-c) involves the Likelihood Ratio test (LRT) converted to a Chi-square test of the fitted model. Specifically, let Σ0 = E(S) to be the true population covariance matrix, and Σ(θ) the model-implied covariance matrix. The hypothesis to be tested is:
The chi-square test is then χ2 = -2logLRT = -2[logL(Σ(θ)) - logL(Σ0)] with d = p(p + 1)/2-t degree of freedom (d.f.). logL( ) represents the log-likelihood of the model, p the number of genes, t the number of parameters of the fitted model. Not-significant P-values (P >0.05) indicate that the model provides a good fit to the data. The P-values are derived by using the χ2(d) distribution or a resampling bootstrap distribution [30].
An alternative procedure [31] assumes that in the population, a model-implied covariance matrix Σ(θ0), which is approximately correct, is in the neighborhood of Σ0. So the null hypothesis of “exact fit” is replaced by the null hypothesis of “close fit”:
and the Root Mean Square Error of Approximation (RMSEA) measures the discrepancy ε for the fitted model:
P-values for RMSEA are set up from the non-central χ2(λ,d) distribution with non-centrality parameter, λ = (n-1) × d × 0.052 or from a resampling bootstrap distribution. The null hypothesis of close fit is not rejected if P > 0.05.
We also consider the Standardized Root-Mean-square Residual (SRMR), one of the most used SEM fit indices. SRMR is a measure based on the differences between observed (s) values and the ones obtained from the model (σ) of the covariance matrix:
SRMR values <0.10 are assumed as an adequate fitting measure, whereas values <0.05 may be considered as a good fit [32].
Finally, the model refinement (step-d) is obtained adding new directed or bi-directed edges to the initial model. This modification was needed considering that the initial model is only a simplified representation of the whole pathway. The criteria used for the refinement are based on the combination of three elements. First, the modification indexes (MI), that is an estimate of the decrease in the χ2-score statistic that would result by freeing each fixed (=0) parameter in the model; second, z-tests (=parameter estimate/standard error) of the MLE; and finally, biological evidences obtained by STRING database [33] and by the existence of a direct path between the nodes that MI proposes to connect. The following heuristic stepwise strategy was used:
Heuristic stepwise procedure:
Input: list of the fixed (=0) parameters (paths and/or covariances) in the model.
Output: new free parameters (paths and/or covariances) in the model.
Steps:
-
1.
freeing just a single parameter (path coefficient or covariance) at a time, and these in turn are sorted in descending order of magnitude using MI;
-
2.
verify if the edge (path coefficient or covariance) to add is present in STRING or when the edge is a path coefficient, if it represents a direct path that connects the nodes in the pathway selected, and then add this new edge in the model;
-
3.
fit the model and if the new edge is statistically not significant (P > 0.05, one-sided), using a z value (z<|1.64|), remove it and repeat step 1-2;
-
4.
STOP the selection procedure if the model achieves a non significant LRT (P > 0.05) or RMSEA (P of “close” fit > 0.05) or SRMR < 0.1, otherwise repeat step 1-3
Multiple-group analysis
When data are observed from multiple subsamples, the representation of groups with “indicator variables”, considered as nodes, allows to recognize DEGs. Instead, “multiple-group analysis” allows to identify differentially regulated genes (DRGs) across groups.
Specifically, define μ1(θ) and Σ1(θ) as the model-implied mean vector and covariance matrix of group 1 (experimental group) respectively and μ2(θ) and Σ2(θ) as the corresponding moments of group 2 (control group). For each models, two omnibus tests are performed considering the two experimental conditions (groups), one for the differential expression genes (nodes) and the other for the strength of the edges. In the first case, the hypothesis to be tested is:
while, in the second case, is:
In the “null” model (H0), the mean or covariance estimates are constrained to be equal across groups; in the “alternative” model (H1), they are allowed to differ across groups. The statistical significance is determined by comparison of LRT chi-square (χ2diff) values at a given degree of freedom (d.f. diff). If there is a significant difference (P < 0.05) in the chi-squared goodness-of-fit index, the groups differ significantly for one or more specific gene expression (nodes) and/or gene-relationships (edges). Finally, three path-coefficient differences are screened: 1) “up/down” expression (gene nodes), testing the “zero value” for the group indicator variable (C = experimental =1, and C = control = 0) path coefficients; 2) “up/down” regulation (gene edges), testing the “zero value” for the differences of path coefficients across groups; 3) “on/off” regulation (gene edges) with respect to a priori KEGG gene regulation target, testing the “zero value” of the edge coefficients across groups.
Specifically, assume C to be the path coefficient matrix of the group indicator variables and c
i
be an element of the matrix C. Let B1 and B2 to be the corresponding path coefficient matrices in the experimental and control groups; D = B1 – B2 and d
ij
be an element of the matrix D. We consider the test statistics:
where SE( ) is the estimated standard error of the parameters. The statistic tC can be used to test the conditional “up/down” expression level difference of one gene between groups, given the parents of the gene in the network. Similarly, t
D
checks the conditional “up/down” expression regulatory differences of one gene on another between groups. Moreover, t1 and t2 check the “on/off” regulatory differences compared to a priori KEGG pathway. The P-values of these statistics (two-sided, for tC and tD and one-sided, for t1 and t2) are derived either asymptotically from the N(0,1)-distribution or empirically from the nonparametric-based or using model-based bootstrap distribution with B bootstrap samples (usually, B = 100, or 1000).
Note that the marginal bivariate test of DEGs with SAM approach can be regarded as the special case of the conditional test with tC, when the pathway graph is G = (Y, ∅), so pa(y) = ∅ for all genes ∈Y.
We use the implementation provided by the lavaan [34] R package for estimation, evaluation, and modification of SEM data analysis, and R codes is available in Additional file 1.