 Research article
 Open Access
 Published:
Inferring transcriptional compensation interactions in yeast via stepwise structure equation modeling
BMC Bioinformatics volume 9, Article number: 134 (2008)
Abstract
Background
With the abundant information produced by microarray technology, various approaches have been proposed to infer transcriptional regulatory networks. However, few approaches have studied subtle and indirect interaction such as genetic compensation, the existence of which is widely recognized although its mechanism has yet to be clarified. Furthermore, when inferring gene networks most models include only observed variables whereas latent factors, such as proteins and mRNA degradation that are not measured by microarrays, do participate in networks in reality.
Results
Motivated by inferring transcriptional compensation (TC) interactions in yeast, a stepwise structural equation modeling algorithm (SSEM) is developed. In addition to observed variables, SSEM also incorporates hidden variables to capture interactions (or regulations) from latent factors. Simulated gene networks are used to determine with which of six possible model selection criteria (MSC) SSEM works best. SSEM with Bayesian information criterion (BIC) results in the highest true positive rates, the largest percentage of correctly predicted interactions from all existing interactions, and the highest true negative (nonexisting interactions) rates. Next, we apply SSEM using real microarray data to infer TC interactions among (1) small groups of genes that are synthetic sick or lethal (SSL) to SGS1, and (2) a group of SSL pairs of 51 yeast genes involved in DNA synthesis and repair that are of interest. For (1), SSEM with BIC is shown to outperform three Bayesian network algorithms and a multivariate autoregressive model, checked against the results of qRTPCR experiments. The predictions for (2) are shown to coincide with several known pathways of Sgs1 and its partners that are involved in DNA replication, recombination and repair. In addition, experimentally testable interactions of Rad27 are predicted.
Conclusion
SSEM is a useful tool for inferring genetic networks, and the results reinforce the possibility of predicting pathways of protein complexes via genetic interactions.
Background
While the existence of genetic compensation is widely accepted, the mechanism is largely unknown but important [1, 2]. The proposed algorithm (SSEM) was motivated by inferring transcriptional compensation (TC) networks of SGS1 (or RAD27) and its synthetic sick or lethal (SSL) partners [3, 4]. However, SSEM can also be applied to infer other types of networks, such as transcriptional regulatory networks. Following a gene's loss, the expression level of its compensatory gene increases (decreases), this phenomenon is called TC (transcriptional diminishment, abbreviated as TD). Paralogs or redundant genes are called digenic SSL gene pairs if the combination of two mutants, neither by itself lethal, causes the organism to die or malfunction [3, 5, 6]. SSL effects underlie many complex human diseases, such as type II diabetes, schizophrenia, Alzheimer's disease, and others [4]. Since genetic networks derived from model organisms, such as yeast, are likely to be conserved in humans the prediction of TC and TD may shed light on pathways that cause complex human diseases. With the abundant information produced by microarray technology, various approaches have been proposed to infer genetic networks or transcriptional regulatory networks. Most of them may be classified into three classes, namely, graph models, discrete variable models and continuous variable models. Due to space limits, we refer to [7] (in Additional file 1) for a thorough review of the models.
Graph models (for instance, [8]) depict genetic interactions through directed graphs or digraphs instead of characterizing the interactions quantitatively. Some graph models simply reveal structural information, others annotate the directions and signs of the regulations among genes. Because of their simplicity, graph models usually require much less data than models in the other two categories. But they are inherently static and may not capture the dynamics of genetic regulations and the simultaneous regulation of a given gene by multiple genes. Discrete variable models discretize gene expressions into a few states. The dynamics of gene expressions may be perceived as transitions of finite states. Typical discrete variable models proposed are Boolean networks, probabilistic Boolean networks and discrete Bayesian networks (for details, see a classic paper [9]). Continuous variable models characterize the expression of a gene or its change by a linear or nonlinear continuous function of the expression of other genes. The genetic interactions are frequently modeled by a firstorder or a secondorder differential (or difference) equation. Continuous variable models consist of two major types: continuous Bayesian networks [10–12] and deterministic differential systems [13].
Although each class of models has been shown to be informative for understanding genetic interactions, most of the models, except some Bayesian networks, have the estimation bias problem due to model misspecification. The model misspecification arises from the fact that microarrays measure the mRNA expressions only, while genetic interactions may be influenced by enzymes or proteins, for instance transcriptional factors. Furthermore, most genetic networks reconstructed in previous studies considered only a subset of the whole genome. Consequently, those genes that were left out may be regarded as latent factors influencing the genes of interest. Thus, ignoring latent factors in the models may cause bias when inferring the genetic interactions. Although a Bayesian network can also incorporate latent factors [11, 12], the amount of data required may prevent it from being used.
To account for the latent factors effect using a reasonable amount of microarray data, a stepwise structural equation modeling algorithm (SSEM) is proposed in this article. SSEM is based on structural equation modeling (SEM) [14], which unifies factor analysis and path analysis. Assuming linear relations among the observed and latent variables, the basic idea of SEM is to minimize the discrepancy between the fitted covariance matrix and the sample covariance matrix. Zhou et al. [15] used shortest path analysis to identify transitive genes between two given genes in the same biological pathway. Xie and Bentler [16] showed that the latent factors can be identified and their relations may be estimated reasonably by SEM. Note that without identifying the latent factors reasonably, the causal relations among genes can not be estimated correctly.
In this article, we extend the model in [16] to simultaneously infer both latent factorgene and genegene interactions. Both [16] and SSEM extend the methodology of SEM in the sense that the latent factors are formed from data and not chosen a priori from domain knowledge as commonly practiced in the social sciences. SSEM learns genetic interactions by both exploratory factor analysis (EFA) and SEM with various model selection criteria (MSC) in a stepwise fashion. The incorporation of MSC helps SSEM circumvent the overfitting problem. The performance of SSEM with six different MSC is evaluated using two sets of simulated networks to determine which MSC works best. The software SSEM automatically runs through all of the steps of SSEM, and outputs predicted gene interactions. Finally, SSEM is applied to infer TC and TD interactions for (1) small groups of genes that are synthetic sick or lethal (SSL) to SGS1, and (2) SGS1 or RAD27 with their SSL partners from 51 genes involved in yeast DNA synthesis and repair that are of interest. Both predictions are verified by an extensive quantitative RTpolymerase chain reaction experiment (qRTPCR); see Additional file 2 for details.
Results and discussion
The MSC suitable for predicting genetic networks remains unknown, while an adequate MSC can prevent the algorithm from overfitting. Thus we have carried out an extensive simulation to evaluate eight criteria used in commercial SEM software, such as Mplus version 3 [17]. The results of the top six MSC χ^{2}/df, χ^{2}df, Mean square error (MSE), Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjBIC are reported in Additional file 3. SSEM with BIC outperforms all of the others. Since the network topology, latent factors (x(t)), genegene interactions (w), and latent factorgene regulations (Λ) are well defined for the simulated data, exact quantitative performance can be computed.
Results of SSEM with various MSC using simulated data
Time course data from 6gene and 10gene regulatory networks with two latent factors are generated. The simulation consists of various sample sizes and noise levels. Let x_{ i }(t), y_{ i }(t) and ε_{ i }(t) denote the expression level of latent factor i, gene i and noise variable i, respectively. The linear dynamic factor model (LDFM, see Section 4.1 for general model setting) to generate the 6gene network is as follows:
where x_{1}(t) ~ N(0,0.1), x_{2}(t) ~ N(0,0.1), y_{ i }(0) ~ U(0,1), and ε_{ i }(t) ~ N (0, {\sigma}_{i}^{2}), i = 1,...,6. Note that {\sigma}_{i}^{2} is determined by the variance of y_{ i }(t) and a prespecified noise level. The noise level is quantified by a contrasttonoise ratio (CNR), defined as the ratio of the signal standard deviation to the noise standard deviation. CNR = 1.3 or 2.0 corresponds to high or medium noise levels, respectively. For the 10gene network, we refer to Equation (5) of Simulation.pdf of the Supplementary data.
Note that both the 6 and 10gene networks are sparse, which roughly follow the sparse property of cisregulatory networks [18]. For each network, time course data are simulated under various conditions; sample sizes (T =T_{min},50 or 100) and noise levels, where T is the number of time points and T_{min} = 2n + 1. Without incorporating any biological knowledge, for a fully connected ngene network (namely all interactions are nonvanished), T = 2n+1 is the minimum number of time points required (denoted as T_{min}) for proper estimation of \widehat{\Sigma} in (4), and hence for all parameters in the model. However, the latest version of SSEM can be iterated from a nonfully connected network, and hence the restriction T = 2n+1 no longer exists. Table 1 summarizes the performance of SSEM with AIC and BIC for the 6gene network under various settings of (CNR, T). The averages of the true positive rate (TPR), true negative rate (TNR), and false positive rate (FPR) for the top 1 (top 5) networks, in terms of MSC value, in 100 experiments are reported. TPR (also known as sensitivity) is the percentage of correctly predicted links from the total existing links (positives) in the simulated network. Likewise, TNR (specificity) is the percentage of correctly predicted nonexisting links (negatives) out of the total nonexisting links in the simulated network. Clearly, SSEM with BIC outperforms SSEM with AIC, and the results from the 10gene network also confirm this.
We further compared SSEM with BIC to VBEM [12] using the 6 and 10gene networks in Simulation.pdf. The results are in Table 2, and SSEM with BIC outperforms VBEM in terms of TPRs for both networks; details are provided in Simulation.pdf.
Results on real time course microarray data
In this section, SSEM is first applied to infer TC/TD interactions for small groups of genes SSL to SGS1, for example CSM3, MUS81, SIS2, SWE1and TOP1 in [3]. Next, SSEM infers TC/TD interactions from SGS1 or RAD27 SSL gene pairs, formed from 51 genes involved in yeast DNA synthesis and repair. SGS1 encodes a RecQ DNA helicase, of which the homologues in human cells include the WRN, BLM and RECQ4 genes. Mutation of the SGS1 gene results in premature aging in yeast mother cells as well as genome instability. Further, these genes and their processes are highly conserved in eukaryotic cells, and mutations in these genes may lead to cancerpredisposition syndromes and symptoms resembling premature aging [4]. On the other hand, Rad27 encodes a structurespecific (5'flap) endonuclease which has a human homolog, FEN1; Rad27 has a distinct role in processing Okazaki fragments during DNA synthesis in the S phase. Deletion of RAD27 in cells also causes hypersensitivities to various DNA damaging agents [19]. Rad27 was shown to be necessary for maintaining genome stability by restricting DNA recombination between short repeated sequences and processing longpatch base excision repair [20–23].
cDNA microarray data from the alpha, cdc15 and cdc28 experiments in [24] were applied to the four algorithms to infer the gene network of interest. The elu data set was not included because it was synchronized differently from alpha, cdc15 and cdc28. The experiment and control groups were mRNAs extracted from synchronized and nonsynchronized yeast cultures, respectively. The synchronization was conducted by treating yeast cultures with alpha factor arrest and arrests of a temperaturesensitive mutant cdc15 and mutant cdc28. A full description and complete data sets are available at [25]. The red (R) and green (G) fluorescence intensities were measured from the mRNA abundance in the experiment group and control group, respectively. There were 18, 24 and 17 time points in the alpha, cdc15 and cdc28 data sets with no replicates; we first aggregated these three datasets to increase the number of time points to 59. This aggregation was applied in [16], and it resulted in some meaningful gene networks.
Log ratios of the six genes' expression levels were fitted to SSEM with BIC, VBEM [12], MAPEM [26] and LDS [11] algorithms. The results were checked against qRTPCR results (see Figure 1 in Additional file 4). Excluding latent factorgene interactions, the modified true positive rate (mTPR) of the top model selected by SSEM with BIC equals 7/12. While the mTPRs of VBEM, MAPEM and LDS equal 2/12 (at 99% significance level), 6/12 and 0, respectively. Fitting five genes' expression to a multivariate AR(1) model resulted in 0/12 mTPR; see Additional file 5 for details. This shows how latent factors improve the estimation of gene interactions \widehat{W} and thus mediate proper extraction of biological knowledge. We also ran SSEM when the sample size was small (T = 11) for the 6gene network, and the mTPR of the top model predicted by SSEM equaled 7/12. For this application, SSEM took about 19 minutes on PCs with Pentium IV 3.4 GHz and 2.5 GB RAM.
Next, SSEM was applied to infer TC/TD interactions among the 51 genes that are SSL to SGS1 or RAD27. Our collaborator has conducted extensive qRTPCR experiments (in Additional file 6) to verify that among these predictions, SSEM successfully uncovered several TC/TD interactions of SGS1 with genes involved in DNA replication (e.g., SRS2, PLO32, RNR1, SLX1, MUS81 and TOP1), DNA repair (e.g., RAD51 and RAD52), checkpoint arrest (RAD9) and chromosome segregation (CSM3). These genetic interactions are consistent with the following experimental results from published literature. Sgs1 and Srs2 are known redundant pathways in replication [27, 28]; for instance, Srs2 and Sgs1Top3 suppress crossovers during double strand break repair in yeast. Further, defects in RAD51 and other homologous recombination genes suppressed synthetic lethality/sickness of the double mutant sgs1Δ srs2Δ. Slx1Slx4 was found to be a second structure specific endonuclease functionally redundant with Sgs1Top3 in [29]. The Sgs1/Top3/Rmi1 and Mus81/Mms4 complexes are involved in both doublestrand break repair and homologous recombination [30]. This indicates that Sgs1/Top3/Rmi1 and Mus81/Mms4 are alternative pathways to resolve recombination intermediates. [31] identified that Sgs1 participated in a RAD52dependent recombination pathway. [32] found that Rad9 and Sgs1 interacted genetically and possibly physically. Cells lacking Sgs1 frequently arrest as largebudded cells with a single nucleus in the mother cell, or "stuck" between mother and daughter cells, which resulted in missegregation during mitosis [33, 34], whereas Csm3 is required for DNA replication checkpoint and accurate chromosome segregation. Similarly, SSEM was applied to predict the interactions of the fifteen SSL pairs of RAD27, and among them HPR5, SGS1, MUS81, ESC2, HST1, HST3 and CSM3 had TC interactions with RAD27, whereas RAD52, HPR5, SIS2, SOD2, HPC2, LYS7, RAD9, RAD51 and RAD54 had TD interactions with RAD27. For the second application, SSEM took about 3 to 4 hours on PCs with Pentium IV 3.0 GHz and 1 GB RAM.
The results involving SGS1 reinforce the possibility of applying genetic interactions to predict pathways of protein complexes [35]. The predictions of RAD27 are intriguing to biologists since biological experiments to screen all possible interactions have been prohibitive thus far. Note that SSEM can also be applied to infer TC interactions of 872 SSL gene pairs in [3, 4] or other large networks with a similar structure, for instance the other six groups of SSL pairs involving ARC40, ARP2, BBC1, BIM1, BNI1 and NBP2. The large network of 887 SSL pairs can be broken down to subgroups that center on SGS1, RAD27, the above six genes, and other hub genes. Then each subgroup can be inferred individually, similarly to the group involving SGS1.
Conclusion
The novelties and merits of SSEM are as follows. First, SSEM expands the scope of application of most algorithms in the area of gene networks. Specifically, SSEM is shown to predict several TC/TD interactions of SGS1 accurately, verified by qRTPCR experiments, and these interactions coincide with existing pathways. Further, SSEM predicts a few novel TC/TD interactions involving RAD27, and these predictions can be verified by biological experiments. Importantly, SSEM can be further applied to predict genetic interactions of other large networks with a similar structure, while biological experiments to screen all possible interactions may be prohibitive. Second, SSEM extends the approach in [16] such that it can infer both latent factorgene and genegene interactions simultaneously. Third, SSEM incorporates an MSC in a stepwise fashion to circumvent the overfitting problem. Although SSEM was shown to infer genetic networks using time course data with no replicates, it can also be applied to short time course data with replicates by modifying the terms involved in replicates and the sample size. As technology advances, we anticipate more data sets with replicates to become available and a greater demand for algorithms like SSEM to infer gene networks.
Using SSEM with the model in Equation (2) has been shown to outperform fitting a multivariate autoregressive model straightforwardly. This demonstrates the important role of latent factors and the efficiency of SSEM. Further, SSEM outperforms three Bayesian network algorithms that impose linear models on latent factors, while SSEM does not assume any structure on latent factors. However, SSEM shares one drawback with continuous Bayesian networks. Both approaches assume that the vector of log ratios of gene expression y(t) follows a multivariate normal distribution. This assumption may limit its application, although log ratios of gene expression do follow a normal distribution in most cases.
Although SSEM may serve as an exploratory tool for genetic interactions, the model in (2) is an approximation to the true model, and BIC is a largesample result. Further improvements for future research include finding a novel MSC for SSEM when the sample size is small, and developing a nonlinear model with latent factors or a lagk and k > 1 in time to model genetic interactions. The goal of SSEM is to model small to medium networks with precise prediction instead of modeling large or genomewide networks with inaccurate prediction. Some results on incorporating various types of data, e.g. motif information, and ChIPchip data besides microarray data, to predict transcriptional modules have been explored in the literature [36–38]. However, integrating various types of data for reliable prediction of complex genetic networks remains a challenging topic, and we leave this for future research.
Methods
The linear dynamic factor model
We assume that time course microarray data follow an LDFM, which includes both factorgene and genegene regulation in the model. Let {\tilde{y}}_{i}(t) denote the expression of gene i at time t for 1 ≤ i ≤ n, where n is the number of genes in the network. Further, let y_{ i }(t) be the centered {\tilde{y}}_{i}(t), namely {y}_{i}(t)={\tilde{y}}_{i}(t){\overline{\tilde{y}}}_{i}, where {\overline{\tilde{y}}}_{i} is the mean of {\tilde{y}}_{i}(t) over time. We incorporate centered variables to avoid an intercept term in Equation (2) to reduce n parameters that are not of interest. Specifically, LDFM assumes that y_{ i }(t) is regulated by a linear combination of latent factors at time t and centered observed variables (genes) at time (t  1), and the regulation is invariant across time as follows.y(t) = Λx(t) + Wy(t  1) + ε(t),
where y(t) is the vector of the expression levels of the n genes at time t, x(t) is the (k × 1) vector of the latent factors' expression at time t, and ε(t) is the (n × 1) noise vector that assumes N_{ n }(\stackrel{\rightharpoonup}{0}, Q), where Q is a diagonal covariance matrix. Further, Λ is the (n × k) latent interaction matrix, in which λ_{ ij }denotes the influence of latent factor j on gene i at the same time, and w is the genegene interaction matrix, in which W_{ ij }denotes the influence of gene j at time (t  1) on gene i' s expression at time t. Latent factors x(t) are assumed to follow N_{ k }(\stackrel{\rightharpoonup}{0}, Σ_{ k }), and x(t) and ε(t) are uncorrelated such that the model is identifiable. Applying biological knowledge, SSEM can infer sufficiently large networks. For example, when inferring TC interactions from SSL pairs, interactions (W_{ ij }'s) are nonvanished only for SSL pairs. For instance, when predicting TC interactions of SSL gene pairs involving SGS1, fitting one equation SGS1(t)={\Sigma}_{i=1}^{k}{\lambda}_{i}{F}_{i}+{\Sigma}_{i=1}^{23}{W}_{i}{y}_{i}(t1) is sufficient, where y_{ i }'s are the twentythree genes that are SSL to SGS1 [3, 4], and the other W_{ j }'s are vanished for gene j that is not SSL to SGS1. The aforementioned equation can be inputed into the latest version of SSEM as an initial network, and when no links are specified to be deleted in the iteration, SSEM will predict genegene interactions for the nonvanished W_{ j }'s, and infer the factorgene interactions from data. Note that when inferring transcriptional regulatory networks, Equation (2) is also able to model the combination of multiple genes to activate (or repress) a target gene simultaneously. The major difference between LDFM and the state space model (SSM), for example the model in [12], is that the former does not model interactions among latent factors across time.
SEM is adopted since it considers latent factorgene and genegene interactions simultaneously to reveal gene networks using microarray data. An SSEM algorithm is introduced to learn the parameters Λ and W in LDFM. The main idea is to learn the regulation network iteratively. In each iteration, for a generated network, we estimate the parameters by SEM and evaluate its goodnessoffit. The top few networks of each iteration are retained for the next iteration, until the optimal network, in terms of any MSC, emerges. SSEM is available to users upon request from the corresponding author.
SSEM consists of three parts. Specifically, in Part 1, EFA is applied to learn some initial latent structures, which specify latent factorgene interaction. In Part 2, networks consist of any given initial latent structure and (randomly generated) partially connected genegene interactions are considered. SEM is applied to estimate x(t), Λ and w of any network considered, and a specified MSC evaluates the goodnessoffit of the network. In Part 3, plausible networks are generated by systematically and iteratively eliminating insignificant links (interactions) based on the associated tstatistics resulting from SEM. These three parts are described in detail in the learning networks section.
Learning the initial latent structures
Incorporating a correct latent structure is crucial for reconstructing genetic networks. First, EFA is employed to learn potential latent structures to start the iterative process. EFA is a common practice to ascertain the latent factors that influence the observed variables. Fundamentally, factor analysis assumes that there are some latent factors, fewer in number than genes, that are responsible for the covariation among the observed gene expressions. EFA may be expressed as
where \tilde{x}(t), \tilde{\Lambda} and \tilde{u}(t) are all estimated without taking genegene interaction into account. Specifically, \tilde{x}(t) is (m × 1) the common factors at time t, \tilde{\Lambda} is the n × m latent interaction matrix, in which {\tilde{\lambda}}_{ij} denotes the influence of latent factor j on the expression of gene i at time t estimated without explicitly taking account of genegene interaction, and \tilde{u}(t) is (m × 1) the unique factors at time t that can not be explained by the common factors \tilde{x}(t). Comparing Equations (2) and (3), the latent structure embedded in \tilde{\Lambda}\tilde{x}(t) would deviate from the true one except when the factor \tilde{u}(t) accounts for the effect of genegene interaction, that is, equal to w_{ y }(t  1). This shows that fitting a structural equation model with the latent factors estimated solely by EFA to the gene expressions [16] may not result in the correct latent structure. Therefore, given k latent factors suggested by EFA, we consider three possible numbers of latent factors (k1, k or k+1), along with the associated latent structure in Part 1 of SSEM. The common factors are extracted by a principal component analysis with promax oblique rotation, which rotates factors in order to fit a hypothesized structure of latent factors.
Determining the number of common factors that best explain the observed variables is one of the practical issues in EFA. Various guidelines have been proposed, for instance, eigenvalue ≥ 1 [39] and the scree test [40]. Different guidelines may lead to different choices. Based on the "weaker lower bound" suggested by [40], SSEM searches through k  1, k, and k + 1 common factors and the associated latent structures, where k is the number of common factors with eigenvalues ≥ 1 resulting from EFA. Then, for each given k, the latent structure is obtained by eliminating the links with factor loading less than a constant, which can be specified by users and the default value is 0.2.
Network (model) selection criterion
In the iterations of SSEM, latent factor x(t), and the parameters Λ and w of a given network are estimated, and the goodnessoffit of the network is computed by SEM. SEM is a statistical method to test the hypothesis for the existence of both latent factorgene and genegene interactions. The principal idea of SEM is to minimize the difference between the covariance matrices of the predicted variables and the observed variables. Let Cov(a, b) be the covariance matrix of two random vectors a and b. The LDFM is lag1 in time, so we consider the joined vector of y(t)^{T}and y(t  1)^{T}.
Let S denote the sample covariance matrix, which is defined as
where S_{t,t}= Cov(y(t), y(t)), S_{t1,t}= Cov(y(t  1), y(t)), S_{t,t1}= Cov(y(t), y(t 1)), and S_{t1,t1}= Cov(y(t  1), y(t  1)). Let \widehat{y}(t) be the column vector of the predicted expressions for the n genes at time t. Plugging in \widehat{y}(t  1) and \widehat{y}(t) for y(t  1) and y(t), respectively into the elements of S, we obtain the estimated covariance matrix \widehat{\Sigma}.
In SSEM, the parameters are estimated by the maximum likelihood method with the fitting function
where \widehat{\Sigma} denotes the estimated covariance matrix, S the sample covariance matrix, A and tr(A) the determinant and the trace of matrix A, respectively, and n the number of genes. When the sample size (T) is small, ridge estimation is applied to avoid the singularity of S and \widehat{\Sigma}. In the application section, small ridge constants are applied such that the condition number of S and \widehat{\Sigma} are not larger than 10^{2}.
Among the six MSC's studied, the χ^{2} statistic is based on the idea of minimizing the discrepancy between the estimated and the sample covariance matrices, and it is defined as (T  1) times the minimized value of F_{ML} in Equation (4), where T is the sample size. When the fitting function is F_{ML}, the χ^{2} statistic is equivalent to the generalized likelihood ratio [41]. Assuming multivariate normality, the χ^{2} statistic has an asymptotic (large sample) χ^{2} distribution with (p*  q) degrees of freedom, where p* = (3n^{2} + n)/2 since only {\widehat{\Sigma}}_{t1,t} and the lower triangle matrix of {\widehat{\Sigma}}_{t,t} form equations to estimate parameters. Further, q is the number of parameters that equal to n^{2} + kn in the LDFM in (2). The condition n > k is required to have degrees of freedom p*  q> 0. However, this condition is satisfied in general since k = [n/c], where c ≥ 3. A large sample size can inflate a small difference between S and \widehat{\Sigma}, and thus can inflate the χ^{2} statistic. Numerous indices were proposed to remedy the bias, among them four have been assessed in our pilot studies, namely, χ^{2}/df, χ^{2}  df [42], TLI [43] and CFI [44], where df denotes degree of freedom. The former two were found to be more effective than the latter two for network (model) selection.
MSE between the observed and the predicted gene expressions is defined as {\displaystyle {\sum}_{i=1}^{n}{\displaystyle {\sum}_{t=1}^{T}{({y}_{i}(t){\widehat{y}}_{i}(t))}^{2}/nT}}, where T is the number of time points in the gene expression data. AIC [45] and BIC [46] are two widely used information criteria for model selection, which take model complexity into account. AIC is a measure based on the KullbackLeibler distance between the fitted and the true model, and AIC = 2log L({\widehat{\theta}}_{j}) + 2q_{ j }, where log L({\widehat{\theta}}_{j}) is the loglikelihood with estimates {\widehat{\theta}}_{j}, and q_{ j }is the number of parameters in model j. To solve the inconsistency problem of AIC, Schwarz [44] proposed BIC based on maximization of the posterior choice probability. BIC = 2log L({\widehat{\theta}}_{j}) + q_{ j }log T, where T is the number of time points. To reduce the penalty imposed in BIC, Sclove [47] suggested samplesize adjusted BIC (adjBIC) by replacing T with T*, where T* = (T + 2)/24.
Learning networks through iterated SEM
A genetic network inferred from LDFM can be built by latent factorgene and genegene interactions. A correct network is essential for estimation of genegene interactions using SEM. However, learning the optimal network from data subject to a goodnessoffit index is NPhard. Although global optimization techniques, such as simulated annealing and genetic algorithms, may be applied, the required computation time is not feasible. To make the learning process practical, we propose a stepwise approach. The key idea is to generate a set of candidate networks and retain plausible links by both using SEM and iteratively filtering with a moving window as follows. For any network generated in the iteration, we apply SEM to estimate x(t), Λ and w. The significance of each link (λ_{ ij }and W_{ kl }) is tested by its associated tstatistic. Let t^{i}window (denoted by [{t}_{l}^{i},{t}_{u}^{i}]) be a window of some given lower and upper bounds in the i th iteration to screen for significance of generated links. A link with a tstatistic value greater than {t}_{u}^{i}, within [{t}_{l}^{i},{t}_{u}^{i}] or less than {t}_{l}^{i} is regarded as a candidate link, a possible link or a futile link (denoted by clink, plink and flink), respectively.
Let S_{ c }, S_{ p }and S_{ f }denote the sets of clinks, plinks and flinks, respectively. Suppose that EFA suggests k factors for a given data set. Given fixed k  1, k or k + 1 factors, EFA is applied again to learn the associated latent (factorgene) structures. SSEM begins with the aforementioned latent structure and a fully connected gene structure, namely, each gene is regulated by all genes and k  1, k or k + 1 latent factors. To start the stepwise search, SEM is applied to the initial networks to estimate x(t), Λ and w. For a given initial network, first let the initial t^{0}window be [{t}_{l}^{0},{t}_{u}^{0}]. Then, a set of networks Φ^{0} can be generated as follows. Checking the tstatistics of all links against the t^{0}window [{t}_{l}^{0},{t}_{u}^{0}], we discard all flinks, and retain all clinks, while considering all 0–1 combinations of plinks. Suppose there are l plinks in an initial model, then there are 2^{l}combinations of each plink being included in a model or not. Models including all clinks and each aforementioned combination are considered, and these 2^{l}models can be viewed as generated by the twindow filtering. That is, the twindow serves as a filter to eliminate insignificant (the lesslikelytoexist) links. Specifically, Φ^{0} = {φφ ∈ S_{ c }∪ L_{ p }, ∀ L_{ p }∈ P(S_{ p })}, where P(S_{ p }) = {L_{ p }L_{ p }⊆ S_{ p }} is the power set of S_{ p }and L_{ p }is a subset of S_{ p }. Furthermore, we apply SEM to each candidate network in Φ^{0} to obtain the prespecified goodnessoffit index. To save computation time and to ensure that the superior networks are kept for the next iteration, only the top m networks (denoted by Ω^{0}) are retained for the next iteration.
Similar to the initial iteration, for each iteration i ≥ 1, SSEM generates a set of candidate networks by t window filtering all networks generated by the top m networks from iteration(i1), i.e., Ω^{i1}, with k  1, k or k + 1 factors, respectively, to form Φ^{i}. So in total, there are 3 m seed models to generate networks. Among the networks in Φ^{i}∪ Ω^{i1}, only the top m networks (Ω^{i}) are retained by the specified goodnessoffit index for iteration (i + 1). First, we let the t^{i}window equal to the t^{i1}window +c. We use c = 0.1, but c can be other small constants. Again, given the t^{i}window, each link in the j th network in Ω^{i1}can be discarded, retained or considered according to its tstatistic value. We denote the collection of these flinks, clinks, and plinks by S_{ jf }, S_{ jc }, and S_{ jp }, respectively. A set of candidate networks is generated by retaining all clinks and considering all 0–1 combinations of plinks with k  1, k or k + 1 factors in the model, and this set is denoted by {\Phi}_{j}^{i}=\left\{\phi \phi \in {S}_{jc}\cup {L}_{p},\forall {L}_{p}\in P({S}_{jp})\right\} = {φφ ∈ S_{ jc }∪ L_{ p }, ∀ L_{ p }∈ P(S_{ jp })}. We combine all the generated sets to result in the i th set of networks {\Phi}^{i}={\displaystyle \underset{\forall j}{\cup}{\Phi}_{j}^{i}}.
Evaluating the specified goodnessoffit index for every network in Φ^{i}, we obtain the top m scored networks from Φ^{i}∪ Ω^{i1}, which form Φ^{i}, to go to iteration (i + 1). The iteration terminates if the specified goodnessoffit index can not be further improved.
The proposed SSEM algorithm
Initialization
Fit EFA to a given data set to determine the number of factors, say k.

Apply EFA to generate three initial networks by estimating the latent structures with k  1, k or k + 1 latent factors, respectively.

For given k factors, obtain the latent structure by eliminating the links with factor loading less than a constant (the default value used is 0.2).

Specify an MSC.
Stepwise search

For each initial network:
Step 1. Set iteration i = 0, run SEM on the data set. Specify the t^{0}window = [{t}_{l}^{0},{t}_{u}^{0}]. Generate a set of networks that consist of all clinks and one of all the 0–1 combinations of plinks. Compute the MSC of all networks and select the top m models to form the candidate set Ω^{0}.
Step 2. Set i = i + 1. Specify the t^{i}window [{t}_{l}^{i},{t}_{u}^{i}]=[{t}_{l}^{i1},{t}_{u}^{i1}]+0.1.
Step 3. Similarly to Step 1, for each network in Ω^{i1}, generate a set of networks, and form Φ^{i}
Step 4. Evaluate the MSC for all networks in Φ^{i}, and choose the best m networks from Φ^{i}∪ Ω^{i1}to form the i th candidate set Ω^{i}.
Step 5. If the i th top 1 MSC = the (i1)th top 1 MSC, stop ; Otherwise, go to Step 2.

Select the best m networks from the union of all networks generated by different initial guesses.
Availability and requirements
Project home page is in [48]. SSEM algorithm is written in Visual C++ 6.0, and it calls SAS 8.2 and Mplus 3.0. Program runs under Windows 2000 or higher version operating system. The zipped code of SSEM is attached in Additional file 7. Visual C++, SAS and Mplus are readily available for purchase through Microsoft, SAS and Mplus, respectively.
References
Lesage G, Sdicu AM, Menard P, Shapiro J, Hussein S, Bussey H: Analysis of β 1, 3glucan assembly in S. cerevisiae using a synthetic interaction network and altered sensitivity to caspofungin. Genetics 2004, 167: 35–49. 10.1534/genetics.167.1.35
Kafri R, BarEven A, Pilpel Y: Transcriptional control reprogramming in genetic backup circuits. Nature Genetics 2005, 37: 295–299. 10.1038/ng1523
Tong AH, Evangelista M, Parsons AB, Xu H, Bader GD, Page N, Robinson M, Raghibizadeh S, Hogue CW, Bussey H, Andrews B, Tyers M, Boone C: Systematic genetic analysis with ordered arrays of Yeast deletion mutants. Science 2001, 294: 2364–2366. 10.1126/science.1065810
Tong AH, Lesage G, Bader GD, Ding H, Xu H, Xin X, Young J, Berriz GF, Brost RL, Chang M, Chen Y, Cheng X, Chua G, Friesen H, Goldberg DS, Haynes J, Humphries C, He G, Hussein S, Ke L, Krogan N, Li Z, Levinson JN, Lu H, Menard P, Munyana C, Parsons AB, Ryan O, Tonikian R, Roberts T, Sdicu AM, Shapiro J, Sheikh B, Suter B, Wong SL, Zhang LV, Zhu H, Burd CG, Munro S, Sander C, Rine J, Greenblatt J, Peter M, Bretscher A, Bell G, Roth FP, Brown GW, Andrews B, Bussey H, Boone C: Global mapping of the Yeast genetic interaction network. Science 2004, 303: 808–813. 10.1126/science.1091317
Hartman JL, Garvik B, Hartwell L: Principles for the buffering of genetic variation. Science 2001, 291: 1001–1004. 10.1126/science.291.5506.1001
Pan X, Ye P, Yuan DS, Wang X, Bader JS, Boeke JD: A DNA integrity network in the yeast Saccharomyces cerevisiae. Cell 2006, 124: 1069–1081. 10.1016/j.cell.2005.12.036
Shieh GS, Chen CM, Yu CY, Huang J, Wang WF: A stepwise structural equation modeling algorithm to reconstruct genetic networks. In Technical Report C2005–04. Institute of Statistical Science, Academia Sinica, Taiwan; 2005.
Schäfer J, Strimmer K: An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics 2005, 21: 754–764. 10.1093/bioinformatics/bti062
Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. Journal of Computational Biology 2000, 7: 601–620. 10.1089/106652700750050961
Kim SY, Imoto S, Miyano S: Inferring gene networks from time series microarray data using dynamic Bayesian networks. Briefings in Bioinformatics 2003, 4: 228–235. 10.1093/bib/4.3.228
Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'AlchéBuc F: Gene networks inference using dynamic Bayesian networks. Bioinformatics 2003, 19: ii138ii148. 10.1093/bioinformatics/btg1071
Beal MJ, Falciani F, Ghahramani Z, Rangel C, Wild DL: A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics 2005, 21: 349–356. 10.1093/bioinformatics/bti014
Kimura S, Ide1 K, Kashihara A, Kano M, Hatakeyama M, Masui R, Nakagawa N, Yokoyama S, Kuramitsu S, Konagaya A: Inference of Ssystem models of genetic networks using a cooperative coevolutionary algorithm. Bioinformatics 2005, 21: 1154–1163. 10.1093/bioinformatics/bti071
Kaplan D: Structural equation modeling: Foundation and extensions. Sage Publication: Thousand Oaks, California, USA; 2000.
Zhou X, Kao MC, Wong WH: Transitive functional annotation by shortestpath analysis of gene expression data. Proc Natl Acad Sci USA 2002, 99: 12783–1278. 10.1073/pnas.192159399
Xie J, Bentler PM: Covariance structure models for gene expression microarray data. Structural Equation Modeling 2003, 10: 566–582. 10.1207/S15328007SEM1004_5
Muthén LK, Muthén BO: Mplus User's Guide. Los Angeles, CA: Muthén & Muthén; 2004.
Van Someren EP, Wessels LFA, Backer E, Reinders MJT: Genetic network modelling. Pharmacogenomics 2002, 3: 507–525. 10.1517/14622416.3.4.507
Hoops LL, Budd M, Choe W, Weitao T, Campbell JL: Mutations in DNA replication genes reduce yeast life span. Mol Cell Biol 2002, 22: 4136–4146. 10.1128/MCB.22.12.41364146.2002
Klungland A, Lindahl T: Second pathway for completion of human DNA base excisionrepair: reconstitution with purified proteins and requirement for DNaseIV (FEN1). EMBO J 1997, 16: 3341–3348. 10.1093/emboj/16.11.3341
Tishkoff DX, Filosi N, Gaida GM, Kolodner RD: A novel mutation avoidance mechanism dependent on S. cerevisiae RAD27 is distinct from DNA mismatch repair. Cell 1997, 88: 253–263. 10.1016/S00928674(00)818462
Negritto MC, Qiu J, Ratay DO, Shen B, Bailis AM: Novel function of Rad27 (FEN1) in restricting shortsequence recombination. Mol Cell Biol 2001, 21: 2349–2358. 10.1128/MCB.21.7.23492358.2001
Xie J, Qian M, Gong G: Reversible algorithm of simulating multivariate densities with multihump. Science in China Series A 2001, 44: 357–364. 10.1007/BF02878717
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Sarcharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.
Yeast Cell Cycle Analysis project[http://cellcyclewww.stanford.edu]
Rangel C, Angus J, Ghahramani Z, Lioumi M, Sotheran E, Gaiba A, Wild DL, Falciani F: Modelling Tcell activation using gene expression profiling and state space models. Bioinformatics 2004, 20: 1361–1372. 10.1093/bioinformatics/bth093
Ira G, Malkova A, Liberi G, Foiani M, Haber JE: Srs2 and Sgs1Top3 suppress crossovers during doublestrand break repair in yeast. Cell 1999, 115: 401–411. 10.1016/S00928674(03)008869
Lee SK, Johnson RE, Yu SL, Prakash L, Prakash S: Requirement of yeast SGS1 and SRS2 genes for replication and transcription. Science 1999, 286: 2339–2342. 10.1126/science.286.5448.2339
Fricke WM, Brill SJ: Slx1Slx4 is a second structurespecific endonuclease functionally redundant with Sgs1Top3. Genes Dev 2003, 17: 1768–1778. 10.1101/gad.1105203
Fabre F, Chan A, Heyer WD, Gangloff S: Alternate pathways involving Sgs1/Top3, Mus81/Mms4, and Srs2 prevent formation of toxic recombination intermediates from singlestranded gaps created by DNA replication. Proc Natl Acad Sci USA 2002, 99: 16887–16892. 10.1073/pnas.252652399
Onoda F, Seki M, Miyajima A, Enomoto T: Involvement of SGS1 in DNA damageinduced heteroallelic recombination that requires RAD52 in Saccharomyces cerevisiae . Mol Gen Genet 2001, 264: 702–708. 10.1007/s004380000358
Ooi SL, Shoemaker DD, Boeke JD: DNA helicase gene interaction network defined using synthetic lethality analyzed by microarray. Nat Genet 2003, 35: 277–286. 10.1038/ng1258
McVey M, Kaeberlein M, Tissenbaum HA, Guarente L: The short life span of Saccharomyces cerevisiae sgs1 and srs2 mutants is a composite of normal aging processes and mitotic arrest due to defective recombination. Genetics 2001, 157: 1531–1542.
Lo YC, Paffett KS, Amit O, Clikeman JA, Sterk R, Brenneman MA, Nickoloff JA: Sgs1 regulates gene conversion tract lengths and crossovers independently of its helicase activity. Mol Cell Biol 2006, 26: 4086–4094. 10.1128/MCB.0013606
Collins SR, Miller KM, Maas NL, Roguev A, Fillingham J, Chu CS, Schuldiner M, Gebbia M, Recht J, Shales M, Ding H, Xu H, Han J, Ingvarsdottir K, Cheng B, Andrews B, Boone C, Berger SL, Hieter P, Zhang Z, Brown GW, Ingles CJ, Emili A, Allis CD, Toczyski DP, Weissman JS, Greenblatt JF, Krogan NJ: Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature 2007. 10.1038/nature05649
Lemmens K, Dhollander T, Bie TD, Monsieurs P, Engelen K, Smets B, Winderickx J, Moor BD, Marchal K: Inferring transcriptional modules from ChIPchip, motif and mircoarray data. Genome Biology 2006, 7: R37. 10.1186/gb200675r37
Tsai HK, Lu HHS, Li WH: Statistical methods for identifying yeast cell cycle transcription factors. Proc Natl Acad Sci 2005, 12: 13532–13537. 10.1073/pnas.0505874102
HK Tsai, GT Huang, MY Chou, HH Lu, WH Li: Method for identifying transcription factor binding sites in yeast. Bioinformatics 2006, 22: 1675–1681. 10.1093/bioinformatics/btl160
Guttman L: Some necessary conditions for commonfactor analysis. Psychometrika 1954, 19: 149–161. 10.1007/BF02289162
Cattell RB: The scree test for the number of factors. Multivariate behavioural research 1966, 1: 245–276. 10.1207/s15327906mbr0102_10
Kline RB: Principles and practice of structural equation modeling. The Guilford Press: New York NY, U.S.A; 1998.
Jöreskog KG: A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 1969, 34: 183–202. 10.1007/BF02289343
Tucker LR, Lewic C: A reliability coefficient for maximum likelihood factor analysis. Psychometrika 1973, 38: 1–10. 10.1007/BF02291170
Bentler PM: Comparative fit indices in structural equation models. Psychological Bulletin 1990, 107: 238–246. 10.1037/00332909.107.2.238
Akaike H: Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory. Edited by: Petrov BN, Csaki F. Akademiai Kiado, Budapest; 1973:267–281.
Schwarz G: Estimating the dimension of a model. Annals of Statistics 1978, 6: 461–464. 10.1214/aos/1176344136
Sclove SL: Application of modelselection criteria to some problems in multivariate analysis. Psychometrika 1987, 52: 333–343. 10.1007/BF02294360
Acknowledgements
The authors wish to thank ChiaChang Wang and JyeJung Chang for computational assistance, Drs. TingFang Wang, ChihHung Jen, John Aston and Ivan Chang for constructive discussions in biology (the former two) and in statistics, especially Dr. Wang for kindly providing us with the qRTPCR results. This work was supported in part by NSC grant 922118M001023 and thematic grant ASTP 2333 to G.S.S.; C.Y.Y. was supported by NSC postdoctoral fellowship 922811M001037 and 932811M001071.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
GS and CC conceived the study, devised the method, and supervised methodology and implementation. CY carried out the method and part of simulation, and wrote an early draft of Shieh et al. in [7]. JH wrote and automated the code. WW participated in implementation. CC and YL wrote part of the paper. GS wrote the paper and coordinated the entire work. All of the authors have read and approved the final manuscript.
Grace S Shieh, ChungMing Chen contributed equally to this work.
Electronic supplementary material
12859_2007_2119_MOESM2_ESM.pdf
Additional file 2: qRTPCR. Description of the design of qRTPCR experiments and how the experiments were conducted to confirm the predicted TC and TD interactions. (PDF 38 KB)
12859_2007_2119_MOESM3_ESM.pdf
Additional file 3: Simulation. The description of the 6 and 10gene networks, and the results of EBVM applied to the two networks. (PDF 246 KB)
12859_2007_2119_MOESM4_ESM.pdf
Additional file 4: BayesianNW. The 6gene network predicted by the three Bayesian network algorithms in Beal et al. (2005), Rangel et al. (2004) and Perrin et al. (2003). (PDF 89 KB)
12859_2007_2119_MOESM5_ESM.pdf
Additional file 5: MultiAR(1). The result of fitting multivariate AR(1) straightforwardly to the real data for the 6gene network. (PDF 211 KB)
12859_2007_2119_MOESM6_ESM.pdf
Additional file 6: SSLTCNW. TC networks of SSL gene pairs. Description of how TC and TD interactions of SGS1 and RAD27 SSL gene pairs were predicted by SSEM. (PDF 389 KB)
12859_2007_2119_MOESM7_ESM.zip
Additional file 7: SSEMalgorithm. The zipped file consists of the standalone executable (.exe) file of SSEM. (ZIP 212 KB)
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Shieh, G.S., Chen, CM., Yu, CY. et al. Inferring transcriptional compensation interactions in yeast via stepwise structure equation modeling. BMC Bioinformatics 9, 134 (2008). https://doi.org/10.1186/147121059134
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121059134
Keywords
 Structural Equation Modeling
 Bayesian Network
 Exploratory Factor Analysis
 Bayesian Information Criterion
 Latent Factor