 Methodology Article
 Open access
 Published:
Iterative subnetwork component analysis enables reconstruction of large scale genetic networks
BMC Bioinformatics volumeÂ 16, ArticleÂ number:Â 366 (2015)
Abstract
Background
Network component analysis (NCA) became a popular tool to understand complex regulatory networks. The method uses highthroughput gene expression data and a priori topology to reconstruct transcription factor activity profiles. Current NCA algorithms are constrained by several conditions posed on the network topology, to guarantee unique reconstruction (termed compliancy). However, the restrictions these conditions pose are not necessarily true from biological perspective and they force network size reduction, pruning potentially important components.
Results
To address this, we developed a novel, Iterative SubNetwork Component Analysis (ISNCA) for reconstructing networks at any size. By dividing the initial network into smaller, compliant subnetworks, the algorithm first predicts the reconstruction of each subntework using standard NCA algorithms. It then subtracts from the reconstruction the contribution of the shared components from the other subnetwork. We tested the ISNCA on real, large datasets using various NCA algorithms. The size of the networks we tested and the accuracy of the reconstruction increased significantly. Importantly, FOXA1, ATF2, ATF3 and many other known key regulators in breast cancer could not be incorporated by any NCA algorithm because of the necessary conditions. However, their temporal activities could be reconstructed by our algorithm, and therefore their involvement in breast cancer could be analyzed.
Conclusions
Our framework enables reconstruction of large gene expression data networks, without reducing their size or pruning potentially important components, and at the same time rendering the results more biological plausible. Our ISNCA method is not only suitable for prediction of key regulators in cancer studies, but it can be applied to any highthroughput gene expression data.
Background
Gene expression is a highly regulated process and difficult to understand without computer added tools. The relationship between target genes (TG) and their regulators, the transcription factors (TF), is complex and even simple gene expression studies usually incorporate hundreds of TGs, TFs and the relationship between them. Several statistical methods including principal component analysis (PCA), singular value decomposition (SVD), independent component analysis (ICA), partial least squares regression (PLSR) and their variants were successfully applied on expression data to extract biologically significant knowledge [1â€“4]. However, these methods incorporate statistical assumptions, either of orthogonality and/or statistical independence which are not true for biological data [5]. Network component analysis (NCA) attempts to overcome these limitations [6]. The NCA integrates gene expression and a priori TFTG connectivity data (known relationships obtained from previous experiments) and computes the activities of the TFs and the connectivity strength of each TF to their TGs. The decomposition of the gene expression matrix (termed E) into a topology (termed A, relating the observed TF and TG expression covariance patterns) and a temporal score matrix (termed P, describing the TF activity development pattern), according to model:
This is achieved by solving a bilinear least squares optimization problem. In order to guarantee a unique solution up to scale, the matrices A and P are subjected to three conditions, termed as NCA criteria (see â€˜Methodsâ€™) [6]. Briefly speaking, the first condition implies that there cannot be two or more TFs with the same regulatory functionality. This makes little sense, because it is well known that redundancy is very common in living systems, as it contributes to robustness [7]. Another condition implies that there cannot be two or more TFs or TF combinations with the same temporal behavior, but again it is not consistent with our knowledge that TFs often work cooperatively [8, 9]. Therefore, these conditions imply restrictions that do not seem plausible from biological perspective. Moreover, these conditions pose necessary restrictions on the size and structure of the network [6], and the problem with the current solutions is that in order to avoid false discovery (outcome of nonunique solutions), they usually reduce the size of the network significantly, losing in the process potentially important components. Therefore, we seek to avoid these restrictions if possible.
The original NCA algorithm suffered from unstable solutions due to illconditioned matrices and multiple local solutions. Tikhonov regularization method (termed as GNCAr) overcomes these two issues but is computationally expensive for solving larger networks [10]. Fast network component analysis (FastNCA) is a stable and fast approach, up to several hundred times faster than GNCAr but limited to smaller networks [11]. Recently, the robust network component analysis (ROBNCA) was developed that offers a stable, efficient and accurate solution, by explicitly modeling the presence of outliers in the microarray data [12]. Whereas these approaches were focused primarily on improving the accuracy of reconstruction, they were all subjected to the same (limiting) criteria mentioned above, that force reduction of the network size. The issues of limited network size and removal of key TFs from the network to satisfy the NCA conditions were the focus of several research groups [10â€“16]. For instance, the division of large networks into smaller, overlapping NCA compliant ones helped to reconstruct some of the shared components. However, this approach treated the subnetworks independently, as if they were obtained from different datasets. It ignores the interconnections exist between the subnetworks. More specifically, when computing the least square of one subnetwork using this method, the contributions of the shared TGs and TFs from all the other subnetworks are ignored, consequentially loosing valuable information. It is a heuristic approach and works only for specific network configurations, but does not work for the general case [13].
We propose a novel algorithm, termed Iterative SubNetwork Component Analysis (ISNCA), which solves compliant subnetworks, and iterates between them in order to provide a solution to the complete, possible incompliant, network. The ISNCA predicts a solution using a standard NCA algorithm on one subnetwork to update the common components in the expression matrix of the other. Then the ISNCA predicts the solution of the other subnetwork (using the same standard NCA algorithm), in order to update the first one. This is done iteratively until the error reconstruction of the entire network (see â€˜Methodsâ€™) convergences to a minimum. We tested first the performance of the ISNCA algorithm against the common GNCAr [10] for a small synthetic network that is compliant (i.e. satisfying the three necessary conditions). Secondly, we compared the performance of ISNCA iterating on a small, synthetic, incompliant network that was divided into two compliant subnetworks. We applied the ISNCA using GNCAr, FastNCA [11] and ROBNCA [12], to solve the entire network in an iterative manner. We compared also the stability of the iterations and the accuracy of the complete network solution. Finally we tested our proposed algorithm on two, full scale, independent, real biological expression data, each containing hundreds of genes with more than 200 network configurations. We compared the solutions of the ISNCA to standard NCA algorithms, and showed that our proposed method retains many essential components in breast cancer studies, that otherwise were removed by standard NCA.
Methods
Network component analysis algorithms
Network component analysis algorithms decompose gene expression data matrix into a weighted topology TFTG matrix and the temporal profile matrix of the TFs. The model can be represented in the matrix form as follows:
where, \(E\in \mathbb {R}^{n\times m}\) represents an expression matrix, \(A\in \mathbb {R}^{n\times l}\) represents the initial connectivity matrix, defining the sign and size of how each of the n target genes involved in this network are linked to each of the l transcription factors involved, in terms of l regulatory patterns. \(P\in \mathbb {R}^{l\times m}\) represents the TF activity matrix, defining how each of the l regulatory transcription factor pattern develops over time. The index m is the number of time points or measurement conditions. The decomposition of E into A and P is achieved by solving a bilinear alternating least squares optimization problem subjected to three conditions termed as NCA criteria: (i) the connectivity matrix A should be fullcolumn rank; this means that each of the l transcription factor patterns in this network contribute some unique variation, so that the number of independent transcription factor patterns equals the number of TFs included. Otherwise they may be difficult to observe experimentally. (ii) If a column is removed from A as well as TGs connected to it, the resulting matrix still should be fullcolumn rank; (iii) TF activity matrix P should be fullrow rank, which means that the temporal behavior of each of the l regulatory patterns should have different kinetics  otherwise they cannot be distinguished experimentally.
Iterative subnetwork component analysis (ISNCA)
We propose a novel algorithm, the iterative subnetwork component analysis (ISNCA), that iterates between NCA compliant, overlapping subnetworks (Fig. 1). These subnetworks share common TGs in order to solve larger, and most importantly, NCA incompliant networks. In order to apply the ISNCA, we first divide the network into two compliant subnetworks. The expression and connectivity matrices for each subnetwork can be represented by
and
with \(E_{\textit {ui}}\in \mathbb {R}^{nui\times m}\) and \(E_{c}\in \mathbb {R}^{nc\times m}\) denote the expression matrices of subnetworks i=1,2, the index c denotes the common components, ui are the unique components of subnetwork i, and \(A_{\textit {ui}}\in \mathbb {R}^{nui\times lui} A_{\textit {ci}}\in \mathbb {R}^{nc\times lui}\) are the partition matrices of A. Assuming no TFs are shared between the networks, the decomposition of P is simply \(P_{i}=P_{\textit {ui}}\in \mathbb {R}^{lui\times m}\). A graphical representation of the approach is shown in Fig. 1. In all the following, when we write A _{ i }, E _{ i } or P _{ i }, we refer to matrices of the entire subnetwork i, including both its exclusive and common components.
The entire network can be described in the following manner:
The matrices \(\mathbf O_{1}\in \mathbb {R}^{nu2\times lu1}\) and \(\mathbf O_{2}\in \mathbb {R}^{nu1\times lu2}\) denote zero matrices. Assuming that P does not have common components, the corresponding partitions of E and P can be obtained as follows:
where \(P_{\textit {ui}}\in \mathbb {R}^{lui\times m}\) are the activities of the unique TFs of subnetwork i.
Example 1.
Network decomposition: Consider the network presented in Case Study 2 (Fig. 3 a). The connectivity matrix A can be decomposed to the exclusive components and the common components in the following manner:
and partition matrices for subnetworks 1 and 2 respectively are,
To initialize the ISNCA algorithm, we divide the expression matrix, E to E _{ i } using Eq. 3 and connectivity matrix, A to A _{ i } using Eq. 4. At the start of each iteration k, we compute solution to âˆ¥E _{ i }(k)âˆ’A _{ i } P _{ i }âˆ¥, separately for subnetworks 1 and 2 using any standard NCA method, and obtain \(\hat A_{i}(k)\) and \(\hat P_{\textit {ui}}(k)\). We can then proceed to construct \(\hat A(k)\) and \(\hat P(k)\) by combining Eqs. 4 and 5, as
and calculate the error of the entire network,
If the error does not converge (see below), we proceed to update the subnetworks in the following manner. Let T _{ i }(k) be the common TGs contribution from subnetworks i, that is,
We then update the matrices E _{1} and E _{2} for next iteration, from Eq. 11 by subtracting the common TGs contribution from other subnetwork, that is,
Here, Î´âˆˆ [0,1] denotes the attenuation factor (see below for details). Notice that E _{ c } and E _{ ui } do not change from iteration to iteration as they represent the original expression matrices. We then proceed to the next iteration and predict the solution to the expression âˆ¥E _{ i }(k)âˆ’A _{ i } P _{ i }âˆ¥ using standard NCA methods. We keep iterating until the reconstruction error in Eq. 9 for the entire network is sufficiently small, for instance by
In our simulations, we set Îµ to be 1e05 and maximum number of iterations to 100.
Microarray data and preprocessing
The microarray data used in this case study was obtained by treating the MCF7 breast cancer cells with two growth factors (Epidermal growth factor, EGF and Heregulin, HRG) at different time points over a period of 0â€“72 hours [17]. We downloaded the data from GEO data base with array express accession number: GSE13009. We applied loess normalization within time points and quantile normalization across time points. The expression values were averaged over two replicate measurements. We conducted ttests to identify differentially expressed genes (DEGs). The DEGs with pvalue < 0.05 and fold change >1.5 at more than 2 time points were selected. All the computations were performed in the MATLAB bioinformatics toolbox.
We downloaded the experimentally verified TFTG interaction data from TFactS database [18]. This database includes interaction from TRED, TRDD, PAZAR, NFIregulomeDB databases and their own experimental predictions. This database provides â‰ˆ 7000 interactions between 2700 TGs and 330 TFs. To test the algorithmâ€™s performance on an independent topology data acquired elsewhere, we downloaded the TFTG interaction data from HTRIdb database developed by Bovolenta et al. [19].
Generation of synthetic data and network configurations
We created 100 different expression matrices for each case study, by randomly generating A and fixed P matrices according to Eq. 2. We used Gaussian distribution to generate random elements of A (both positive and negative values), while keeping its null space. We used Matlab function ^{â€²} randn ^{â€²} for this purpose. The different network configurations for EGF and HRG systems are generated as follows. First, we identified two NCA compliant subnetworks. Then, a subset of the components of each subnetwork is randomly selected by randomly removing one or more TF, with their corresponding TGs. Then each of the new subnetworks are checked for NCA criteria. In this way, we generated 100 NCA compliant network configurations for each system.
Statistical analysis and calculations
All the calculations were performed using Matlab R12 (Mathworks Inc.). The standard NCA algorithms (GNCAr, FastNCA and ROBNCA) are downloaded from respective websites which are publicly available. The full ISNCA algorithm is available for download at the corresponding authorâ€™s website.
Gene ontology analysis
The significantly enriched gene ontology terms or biological processes are identified using the GOrilla tool developed by Eden et al. [20].
Results
We first tested the algorithm on a small toy network containing four TGs, two TFs and six interactions (Fig. 2 a). The gene expression matrix incorporated three time points, and the TF profiles of the network was reconstructed using Eq. 2 (see â€˜Methodsâ€™). The complete network satisfies the NCA conditions, and can therefore be solved by NCAbased methods. We wanted to examine the accuracy of our iterative approach layered on a standard NCA method (GNCAr) and compare it with the same standard GNCAr method that solves the entire network. We generated 100 random initial E matrices and applied ISNCA and GNCAr to reconstruct A, and P. The mean reconstruction error (see â€˜Methodsâ€™) of the ISNCA method was significantly lower (p < 10^{âˆ’12}, KruskalWallis test; n = 100) compared with the GNCAr (Fig. 2 b). The ISNCA yielded error of less than 0.04 in 91 % of the simulations (91/100), compared to 50 % (50/100) by the GNCAr (Fig. 2 c). The reconstruction errors of 100 simulations converged after 3â€“5 iterations (Fig. 2 d) and stayed stable thereafter, with a sharp drop already after the second iteration.
We constructed a more complex example in which the entire network was incompliant, i.e. the conditions that guarantee a unique solution up to a scale matrix were not satisfied (Fig. 3 a, red shaded). The common procedure of NCA based methods is then to reduce the network size, for instance by removing TF4 and its corresponding genes TG1 and TG7. We divided this complete network into two subnetworks (Fig. 3 a, green shaded), each with two TFs, with six genes in the first subnetwork and 4 genes in the second. Notice that TF4 is also a target gene for TF2. Recall also that it is not possible to guarantee a unique solution to the entire network using standard NCA methods, so comparison of the ISNCA to these is not feasible. We tested the reconstruction of the complete network by the ISNCA, layered with three different NCA methods; GNCAr, Fast NCA and ROBNCA (see â€˜Methodsâ€™). All the three ISNCA layers converged to a stable solution with a sharp drop after 3â€“5 iterations (Fig. 3 b). The mean error of the ISNCA (GNCAr) and the ISNCA (FastNCA) were significantly lower (p < 10^{âˆ’4}, oneway ANOVA; n = 100) than the ISNCA (ROBNCA) (Fig. 3 c). ISNCA (FastNCA) was the most accurate approach for this network, with 68 % the simulations resulted in error of less than 0.001. In contrast, more than 69 % of the simulations by the ISNCA (ROBNCA) produced error larger than 0.1.
To test the ability of the ISNCA algorithms to reconstruct large, real biological networks, we finally used two microarray gene expression matrices for the epidermal growth factor (EGF) and heregulin (HRG) stimuli systems [17]. We generated (see â€˜Methodsâ€™) 100 network configurations for each system, consisting of different sets of TFs and TGs based on the interaction data downloaded from TFactS database [18]. Each of the networks generated was relatively large (See Table 1 for network size comparison reconstructed by ISNCA and any NCA algorithm, GNCAr) and initially (before the network reduction procedure), did not satisfy the conditions for uniqueness of the solution. We tested our iterative algorithm with two layouts, the GNCAr and ROBNCA. The FastNCA algorithm can reconstruct network size with maximum TFs equal to number of experimental time points, and therefore could not be used to reconstruct these large networks. Reconstruction with ISNCA was successful in all the 200 trials. The ISNCA algorithm converged relatively fast, after about 5 iterations in all the 100 simulations tested for each EGF (Fig. 4 a) and HRG (Fig. 4 b) systems. We found that ISNCA (ROBNCA) performed better than the ISNCA (GNCAr), with a lower mean error for both EGF (Fig. 4 c) and HRG (Fig. 4 d) systems.
Selection of delta and convergence properties
We also tested the convergence properties of the ISNCA algorithm as a function of the attenuation factor, Î´ (see â€˜Methodsâ€™). We began Î´ with a fixed value in the first iteration (k = 1), and changed this value at the second iteration and onwards (Fig. 5 a upper panel). We found that the response to the values Î´ (k = 1) = 0.5 and Î´(k >1) = 1 was a stable and fast convergence compared with the other functions (Fig. 5 a lower panel; see also Figure S1 in the Additional file 1). Of the functions we tested, only values of Î´ larger than 1 resulted in divergence. We found that the value of Î´ at the first iteration is also important, with 0.5 being the optimal value (Fig. 5 b upper panel, red) yielding best convergence (Fig. 5 b lower panel, red). Other increasing or decreasing values Î´ of with iteration were found to yield nonoptimal convergence properties (see Additional file 1: Figure S1).
Discussion
Reconstruction of complex transcriptional networks from expression data is a common approach that helps to understand cellular signaling and gene regulation. These reconstructions by existing NCA algorithms are limited to relatively modest network sizes because of the necessary criteria (See â€˜Methodsâ€™). The most common reduction procedure to satisfy the criteria is the pruning approach [10]. In our case, it removed up to 34 % (224/653) of the network connections, and almost 40 % (25/64) of the initial TFs (Table 1). The ISNCA approach reconstructed relatively the larger networks in terms of number of TFs, TGs and interactions between them compared to standard NCA algorithms (Table 1). We repeated the same analysis, comparing network sizes reconstructed by ISNCA approach with standard NCA algorithms, but this time using an independent TFTG topology, acquired from HTRIdb (see â€˜Methodsâ€™) [19]. The ISNCA algorithm demonstrated again superior performance to the current NCA algorithm (see Table S1 in Additional file 1), indicating that ISNCA performance is not limited by the quality of the TFTG interactions.
There is no direct manner to control which TFs are pruned, and potentially removing TFs that may be important for a specific study. To demonstrate the consequences of this, we tested and compared the standard NCA algorithm with the ISNCA using 100 network configurations and microarray data obtained from breast cancer cells treated with EGF (Fig. 4). Firstly, the transcription factor FOXA1 (forkhead box protein A1) that is known to be strongly involved in breast cancer [21â€“23] was removed by the NCA algorithm in 84 % (84/100) of the configurations (it was retained 100 % by the ISNCA), loosing the ability to study its effect on the network. This occured despite the importance of FOXA1 in process involved in cancer development: it forms a strong network with ER Î± (estrogen receptoralpha) and GATA3 (GATA binding protein 3) and controls the gene expression pattern in luminal subtype A breast cancers [21]. In addition, it is shown to be a potential prognosis marker and a significant predictor of good outcome in breast cancer [23]. Secondly, the activating transcription factor 2 (ATF2) is also strongly involved in breast cancer studies [24â€“26] and was removed by the NCA prior to the analysis. The active ATF2 regulates the genes MMP2 and MMP9 in the transforming growth factor (TGF Î²) induced MCF10A human breast epithelial cells, and induces migration and invasion of MCF10A cells [26]. Additionally, ATF2 regulates the transcription of maspin and GADD45 Î± genes in mammary tumors [25]. Thirdly, ATF3 is known to be strongly involved as both tumor suppressor and an oncogene in breast cancer cells, and was proposed as potential therapeutic target in breast cancer treatment [27â€“29]. ATF3 upregulates the genes TWIST1, fibronectin (FN)1, SNAIL and SLUG in MCF10A cancer cells [27]. Together with FOXA1 and ATF2, ATF3 was completely removed from the network, reducing the possibility that these regulators could be analyzed and targeted. Similarly, many other pivotal TFs in breast cancer were removed by NCA (Table 2) but retained by ISNCA, which not only reconstructed large gene regulatory networks, but also retained their key components.
We repeated the same analysis on an independent microarray data set in order to demonstrate the biological importance of the ISNCA and its implications on cancer studies. Here we analyzed the data set obtained from breast cancer cells treated with HRG (see â€˜Methodsâ€™ and Table 3). The ISNCA persistently retained several key TF that we suspected were relevant to the breast cancer studies, whereas these TF were removed by other standard NCA algorithms. By closer examination of the TGs which are regulated by those TFs, we found (Table 4 and Additional file 1: Table S2) that they are strongly involved in biological processes relevant to breast cancer studies. What appears to be a simple pruning of several TFs by the standard NCA algorithms (consequentially eliminates their corresponding TGs) may impair our analysis of the data, and weaken our understanding of the processes involved in cancer.
In addition to the downsizing the network, the original NCA criteria seem very harsh from biological perspective. Condition I of fullcolumn rank on connectivity matrix A, means that there cannot be two or more TFs or TF combinations with the same regulatory functionality (see â€˜Methodsâ€™). Condition III of fullrow rank on TF activity matrix, P implies that there cannot be two or more TFs or TF combinations with the same temporal behavior. Both restrictions produce conservative solutions that are not always acceptable in biological processes. Our approach avoids these restrictions. Contrary to solving overlapping subnetworks independently [13], our ISNCA algorithm links together the subnetworks by predicting and updating the contribution of the common components at each iteration, and minimizes the error reconstruction of the entire network. We tested both approaches using a large number (>200) of network configurations using several independent systems (see â€˜Methodsâ€™). The advantage of predictupdate process became apparent from the analysis of both iterating and noniterating, overlapping subnetworks. Firstly, we studied small synthetic network, where the reconstructed profiles could be compared to the original profiles (see â€˜Methodsâ€™). The accuracy (Pearsonâ€™s correlation) of the ISNCA was significantly (p < 10^{âˆ’4}, twosamples ttest) higher than the one computed from independent networks (Additional file 1: Figure S2), for all the four TFs. Secondly, we analyzed two large biological networks (EGF and HRG systems), each subdivided to two subnetworks. We repeated the procedure with 100 different network structures for each system. For all large systems we tested, the mean of the reconstruction error was significantly (p < 10^{âˆ’10}, two sample ttest, n = 100) lower (about 25 % reduction) for ISNCA algorithm than independent networks approach (Additional file 1: Figure S3). Thirdly, we compared the correlation (Pearson) of the reconstructed profiles of the TF that were shared between the two subnetworks, to evaluate the consistency of the reconstruction. This analysis also confirmed that the TF profiles calculated by the ISNCA were more accurate than the ones calculated from independent networks (Additional file 1: Figure S4). We stress that similarly to other NCA methods, the quality of the TF profile reconstruction depends on the noise and quality of the expression data. Together, the analysis confirms that the link between subnetworks is necessary to obtain more accurate (in terms of low reconstruction error and consistent TF profiles) and feasible network reconstruction. The predict and update feature of ISNCA algorithm is comparable to nonlinear iterative partial least squares (NIPALS) algorithm used for PCA and PLSR modeling [30]. In the NIPALS, the score matrix (equivelant to our A matrix) is predicted and updated until it reaches a desired convergence criteria.
We did not focus here on the optimal division of the complete network into NCA compliant ones. We initially assumed a certain set of TGs are shared between the subnetworks. In this work, we divided the main network heuristically, with the only requirement that both sub networks are compliant, so that they can be solved by a standard NCA method. The number of common components, and their interconnections, will ultimately affect the solution. In practical terms, it is possible to apply the algorithms that choose the common components and predict the optimal configuration. One heuristic approach [13, 31] generated overlapping subnetworks which satisfy NCA compliancy criteria. This approach starts with a randomly chosen subnetwork composed of several TFs. If this subnetwork is not compliant, it removes a set of TFs that did not satisfy the NCA criteria, and substitute with the new set of TFs. This process is repeated until it finds an NCA compliant sub network. Another approach proposed [32] finds the best network structure, A, which satisfies the NCA conditions. Here, several NCA compliant network structures are generated in an intelligent manner, based on mixed integer, nonlinear programming optimization formulation. It then checks the reconstruction error of all generated networks and chooses the network with the minimum error. The Branch and Bound algorithms that are implemented to solve NPhard discrete optimization problems can also be used to identify the best network configurations [33]. It can do so by either minimizing or maximizing the number of common components (TGs, TFs, interactions), or focus on a search to minimize the error of the entire network. The former case does not require running ISNCA at every iteration, only to test the network configurations for NCA compliancy, and is a faster and easier problem to solve than the latter. Several modeling approaches are developed for network divisions, finding modularizations, with specific constraints based on Branch and Bound formulations. For instance, a branch and bound based approach to divide a given cellular network into several, smaller subnetworks or modules [34] or the partitioning the acyclic networks into disjoint subnetworks [35]. It is possible to combine these approaches with NCA criteria as constraints for finding the optimum NCA compliant subnetworks and ISNCA for the best feasible reconstruction. Additionally, we proposed here network configurations sharing only TGs, but it can be easily extended to include also TFs as common components. We provide solutions to this problem formulation in the Additional file 1: Supplementary information and Additional file 1: Figure S5. However, since the ISNCA converged in all the network configurations we tested, it was not necessary to include TFs as common subnetworks components.
The ISNCA can be further expanded to reconstruct extensively large network in a recursive manner. With the recursive approach, we divide the network into compliant or non compliant subnetworks. These can be further divided until each subnetwork is compliant (see Additional file 1: Figure S6). The algorithm works recursively at each iteration between the parents sub networks, the recursive ISNCA iterates between the children subnetworks until convergence to a local solution. This solution is sent to the next iteration in the parents sub networks. We can subdivide the entire network to 2n subnetworks with n is the number of generations, all the generations but the last may be incompliant. In this manner, we are able to find a local solution to any large network. Theoretically, n can be arbitrarily large, but the computation complexity is also increased exponentially. The iterative ISNCA is subjected to our future work.
All the networks we tested (>400) demonstrated rapid convergence. We found that the convergence was also dependent on the parameter Î´ the attenuation factor that weights the update of the common expression matrix in the next iteration (see Eq. 10 in â€˜Methodsâ€™). We tested different variations of Î´ and found (heuristically) that the convergence of the ISNCA was optimal when the algorithm applies the weight of Î´= 0.5 at the first update, followed by consequent updates of Î´= 1 (Fig. 5). We stress that Î´= 0 transforms the problem to the simple network division with no updates (independent networks), discussed above. Similarly to the known NIPALS algoritms, convergence cannot be proven in general [36], and is dependent on the topology and the network division. However, similarly to convergence of NIPALS [30], ISNCA was found to converged in parctice (it converged in the houndreds of simulations and network configurations we tested).
Conclusions
Taken together, we developed an iterative approach, which employs existing NCA algorithms to solve iteratively networks without reducing their size. The ISNCA is able to i) incorporate these known properties of redundancy and cooperative behavior of TFs, making the solution more biological plausible, and ii) prevent undesired elimination of potentially essential components, and iii) increase the size of the solution, incorporating more information into the network. We propose to apply our algorithm to study data obtained from any biological system.
Abbreviations
 ATF:

Activating transcription factor
 NIPALS:

Nonlinear iterative partial least squares
 DEG:

Differentially expressed genes
 PCA:

Principle component analysis
 EGF:

Epidermal growth factor
 PLSR:

Partial least squares regression
 GNCAr:

General NCA
 ROBNCA:

Robust NCA
 HRG:

Heregulin
 SVD:

Singular value decomposition
 ICA:

Independent component analysis
 TF:

Transcription Factor
 ISNCA:

Iterative subnetwork component analysis
 TG:

Target gene
 NCA:

Network component analysis
References
Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics. 2002; 18(1):51â€“60. http://bioinformatics.oxfordjournals.org/content/18/1/51.abstract.
Raychaudhuri S, Stuart JM, Altman RB, Altman R B. Principal Components Analysis To Summarize Microarray Experiments: Application To Sporulation Time Series. 2000:452â€“63.
Boulesteix AL, Strimmer K. Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach. Theor Biol Med Model. 2005; 2:23.
Holter NS, Maritan A, Cieplak M, Fedoroff NV, Banavar JR. Dynamic modeling of gene expression data. Proc Nat Acad Sci. 2001; 98(4):1693â€“8. http://www.pnas.org/content/98/4/1693.abstract.
Kossenkov AV, Ochs MF. Matrix Factorization Methods Applied in Microarray Data Analysis. Int J Data Mining Bioinformatics. 2010; 4(1):72â€“90. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2998896/.
Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowdhury VP. Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci U S A. 2003; 100(26):15522â€“7.
Alon U. An Introduction to Systems Biology: Design Principles of Biological Circuits. Boca Raton: Chapman and Hall/CRC (Taylor and Francis group); 2007.
Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, et al.An Atlas of Combinatorial Transcriptional Regulation in Mouse and Man. Cell. 2010; 140(5):744â€“52. http://www.sciencedirect.com/science/article/pii/S0092867410000796.
Karin M, Liu Zg, Zandi E. AP1 function and regulation. Curr Opinion Cell Biol. 1997; 9(2):240â€“46. http://www.sciencedirect.com/science/article/pii/S0955067497800683.
Tran LM, Brynildsen MP, Kao KC, Suen JK, Liao JC. gNCA: a framework for determining transcription factor activity based on transcriptome: identifiability and numerical implementation. Metab Eng. 2005; 7(2):128â€“41.
Chang C, Ding Z, Hung YS, Fung PC. Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data. Bioinformatics. 2008; 24(11):1349â€“58.
Noor A, Ahmad A, Serpedin E, Nounou M, Nounou H. ROBNCA: robust network component analysis for recovering transcription factor activities. Bioinformatics. 2013; 29(19):2410â€“18. [doi:10.1093/bioinformatics/btt433].
Boscolo R, Sabatti C, Liao JC, Roychowdhury VP. A generalized framework for network component analysis. IEEE/ACM Trans Comput Biol Bioinform. 2005; 2(4):289â€“301.
Galbraith SJ, Tran LM, Liao JC. Transcriptome network component analysis with limited microarray data. Bioinformatics. 2006; 22(15):1886â€“94.
Wang C, Xuan J, Shih IM, Clarke R, Wang Y. Regulatory component analysis: A semiblind extraction approach to infer gene regulatory networks with imperfect biological knowledge. Signal Process. 2012; 92(8):1902â€“15.
Neil J. Noniterative convex optimization methods for network component analysis. IEEE/ACM Trans Comput Biol Bioinformatics. 2012; 9(5):1472â€“81.
Saeki Y, Endo T, Ide K, Nagashima T, Yumoto N, Toyoda T, et al.Ligandspecific sequential regulation of transcription factors for differentiation of MCF7 cells. BMC Genomics. 2009; 10(545):1â€“16.
Essaghir A, Toffalini F, Knoops L, Kallin A, Helden J, Demoulin JB. Transcription factor regulation can be accurately predicted from the presence of target gene signatures in micro array gene expression data. Nucleic Acids Res. 2010; 38(11):e120.
Bovolenta L, Acencio M, Lemke N. HTRIdb: an openaccess database for experimentally verified human transcriptional regulation interactions. BMC Genomics. 2012; 13(1):405.
Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009; 10(1):48.
Nakshatri H, Badve S. FOXA1 in breast cancer. Expert Rev Mol Med. 2009; 11:nullâ€“null. M3 doi:10.1017/S1462399409001008.
Nakshatri H, Badve S. FOXA1 as a therapeutic target for breast cancer. Expert Opin Ther Targets. 2007; 11(4):507â€“14. http://dx.doi.org/10.1517/14728222.11.4.507.
Albergaria A, Paredes J, Sousa B, Milanezi F, Carneiro V, Bastos J, et al. Expression of FOXA1 and GATA3 in breast cancer: the prognostic significance in hormone receptornegative tumours. Breast Cancer Res. 2009; 11(3):R40â€“. http://breastcancerresearch.com/content/11/3/R40.
Song H, Ki SH, Kim SG, Moon A. Activating transcription factor 2 mediates matrix metalloproteinase2 transcriptional activation induced by p38 in breast epithelial cells. Cancer Res. 2006; 66(21):10487â€“96. http://cancerres.aacrjournals.org/content/66/21/10487.abstract.
Maekawa T, Sano Y, Shinagawa T, Rahman Z, Sakuma T, Nomura S, et al.ATF2 controls transcription of Maspin and GADD45[alpha] genes independently from p53 to suppress mammary tumors. Oncogene. 2007; 27(8):1045â€“54. http://dx.doi.org/10.1038/sj.onc.1210727.
Kim ES, Sohn YW, Moon A. TGFbetainduced transcriptional activation of MMP2 is mediated by activating transcription factor (ATF)2 in human breast epithelial cells. Cancer Lett. 2007; 252(1):147â€“56. http://www.cancerletters.info/article/S03043835(06)006902/abstract.
Yin X, DeWille JW, Hai T. A potential dichotomous role of ATF3, an adaptiveresponse gene, in cancer development. Oncogene. 2007; 27(15):2118â€“27. http://dx.doi.org/10.1038/sj.onc.1210861.
Yin X, Wolford CC, Chang YS, McConoughey SJ, Ramsey SA, Aderem A, et al.ATF3, an adaptiveresponse gene, enhances TGFbetainduced signaling and cancerinitiating cell features in breast cancer cells. J Cell Sci. 2010; 123(20):3558â€“65. http://jcs.biologists.org/content/123/20/3558.abstract.
Wolford CC, McConoughey SJ, Jalgaonkar SP, Leon M, Merchant AS, Dominick JL, et al.Transcription factor ATF3 links host adaptive response to breast cancer metastasis. J Clin Invest. 2013; 123(7):2893â€“906. http://www.jci.org/articles/view/64410.
Geladi P, Kowalski BR. Partial leastsquares regression: a tutorial. Anal Chimica Acta. 1986; 185(0):1â€“17. http://www.sciencedirect.com/science/article/pii/0003267086800289.
Yang YL, Suen J, Brynildsen MP, Galbraith SJ, Liao JC. Inferring yeast cell cycle regulators and interactions using transcription factor activities. BMC Genomics. 2005; 6(90):1â€“15.
Yang E, Maguire T, Yarmush ML, Androulakis IP. Informative gene selection and design of regulatory networks using integer optimization. Comput Chem Eng. 2008; 32(4â€“5):633â€“49.
Clausen J. Branch and Bound Algorithms â€“ Principles and Examples. University of Copenhagen. 1999.
Lang M, Summers S, Stelling J. Cutting the wires: modularization of cellular networks for experimental design. Biophys J. 2014; 106(1):321â€“31.
Nossack J, Pesch E. A branchandbound algorithm for the acyclic partitioning problem. Comput Oper Res. 2014; 41:174â€“84.
Henseler J. On the convergence of the partial least squares path modeling algorithm. 2010; 25(1):107â€“120. http://dx.doi.org/10.1007/s001800090164x.
Acknowledgements
Authors would like to thank P. Bouza and Professors H. Preisig, H. Martens and S. Skogestad for their discussions on this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
LA and NB conceived and developed the method and algorithm. NDJ and NB analyzed, tested and improved it. NB and NDJ wrote the manuscript. All authors read and approved the final version of the manuscript.
Additional file
Additional file 1
The supplementary files include an ISNCA formulation for the case where TFs are also shared between the sebnetworks. We also included Figures S1S6 that evaluate the performance of the ISNCA and add extended description. (PDF 394 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Jayavelu, N.D., Aasgaard, L.S. & Bar, N. Iterative subnetwork component analysis enables reconstruction of large scale genetic networks. BMC Bioinformatics 16, 366 (2015). https://doi.org/10.1186/s1285901507689
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901507689