 Methodology article
 Open Access
 Published:
Bottomup GGM algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways or processes
BMC Bioinformatics volume 17, Article number: 132 (2016)
Abstract
Background
Multilayered hierarchical gene regulatory networks (MLhGRNs) are very important for understanding genetics regulation of biological pathways. However, there are currently no computational algorithms available for directly building MLhGRNs that regulate biological pathways.
Results
A bottomup graphic Gaussian model (GGM) algorithm was developed for constructing MLhGRN operating above a biological pathway using small to mediumsized microarray or RNAseq data sets. The algorithm first placed genes of a pathway at the bottom layer and began to construct a MLhGRN by evaluating all combined triple genes: two pathway genes and one regulatory gene. The algorithm retained all triple genes where a regulatory gene significantly interfered two paired pathway genes. The regulatory genes with highest interference frequency were kept as the second layer and the number kept is based on an optimization function. Thereafter, the algorithm was used recursively to build a MLhGRN in layerbylayer fashion until the defined number of layers was obtained or terminated automatically.
Conclusions
We validated the algorithm and demonstrated its high efficiency in constructing MLhGRNs governing biological pathways. The algorithm is instrumental for biologists to learn the hierarchical regulators associated with a given biological pathway from even smallsized microarray or RNAseq data sets.
Background
Present knowledge indicates that genes in genomes operate in multilayered hierarchical gene regulatory networks (MLhGRNs) to control biological processes and pathways [1–6]. A typical hGRN contains a few high hierarchical regulators, some middlelevel regulators and many terminal/structural genes at bottom layer. Studies have shown high hierarchical regulators seem to be global modulators that respond to various cellular signals [7, 8] and environmental cues [3, 9]. The middlelevel genes play the managerlike roles, through which the commands from high hierarchical regulators at upper layers are synthesized and then passed down to terminal genes at bottom layer for execution [3, 10]. In general, the high hierarchical regulators at the top levels have more pleiotropic effect while terminal or structural genes have more specific functions. Genes involved in various metabolic or canonical pathways are regulated by MLhGRNs [11, 12]. To understand how pathways are regulated, we should develop methods for constructing MLhGRNs that operate above biological pathways and processes. This kind of MLhGRNs can provide a hierarchy in addition to connectivity of regulators, which are essential in understanding wired complex regulation on metabolic or canonical pathways through multiple chainsofcommand [13].
Despite the critical importance of MLhGRNs, there is a lack of computational algorithms for directly building MLhGRNs from gene expression data. Reverse engineering of MLhGRN governing a biological pathway or process remains challenging, and the algorithms that are well suited for building MLhGRNs have not yet been established. Although several algorithms have been developed for reverseengineering GRNs, they are not specifically tailored for constructing the MLhGRNs that mimic the hierarchical regulation [10]. Currently, the majority of gene expression data in public repositories are static data, namely, nontimecourse data or time course data with large time intervals that vary from a few hours to even several days [14]. These types of data often miss some regulatory events and interactions between two adjacent time intervals, nullifying dynamic methods that include differential equation [15], finite state [16], dynamic bayesian network [17], control logic [18], boolean network [19], and stochastic networks [20]. In general, the methods that are of more useful to static data include various static methods that comprise GGM [21], mutual information basedRN [22], Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) [23], Context Likelihood of Relatedness (CLR) [24], C3NET [25], MI3 [26], and probabilistic basedBayesian network [27]. Bayesian network, a probabilistic graphic model that represents a set of variables and their conditional dependencies via a directed acyclic graph, has been used to infer the “optimal” structure of a set of genes. However, due to astronomical number of possible structures and the computational complexity, approximate inference methods, such as Gibbs sampling, Metropolis–Hastings algorithms and other variant methods, instead of the original probability method of Bayesian network, were used to approximate the inference of the possible structures of GRNs [27], and these methods are capable of producing a local interacting dependence variables containing causality relationships. ARACNE is the other widely used algorithm for building graphic dependency network using mutual informationbased method. It can identify both linear and nonlinear dependence relationships between genes while eliminating potential indirect gene–gene interactions through implementing data processing inequality (DPI). Another informationtheorybased algorithm is the Context Likelihood of Relatedness (CLR), which scores all possible pairwise interactions based on mutual information and compared that to an interaction specific background distribution. Although computationally less complicated, most of the methods that evaluate pairwise relationships among genes can easily lead to spurious relationships when the number of RNAseq or microarray data sets is small. Finally, GGM uses the partial correlation as a measure of dependency and interaction between any pair of genes by removing effect of third one [28], allowing to distinguish direct from indirect interactions [4]. Recently, GGM with a limitedorder partial correlation function, which estimates correlations conditional on one or two, but not all other genes, and has been used to infer gene networks from Arabidopsis [29] and yeast [30] transcript profiles. However, as aforementioned, these methods are not specifically designed and tuned for inferring MLhGRNs.
Although mathematical models are critically important in capturing the causality relationships for reconstructing gene networks, what is also very important is the biological regulatory models, which, when integrated into mathematical models, can empower efficiency of the network construction algorithms substantially. Biological regulatory model, in this study, refers to the defined regulatory structure to which the input genes can be functionally fitted in and then evaluated as a building block during network construction process. The integration of a biologically valid regulatory model into an algorithm demands seamless design that can enhance the recognition of the causal regulatory relationships of input genes. In this study, we developed a novel algorithm, named as bottomup GGM algorithm, specifically for reverse engineering of MLhGRNs that govern a biological pathway or process through integration of GGM algorithm with an authentic biological regulatory model. The input files for bottomup GGM algorithm include: (1) the transcriptomic profiles of differentially expressed genes (DEGs) involved in a known metabolic pathway, or a canonical pathway defined by a gene ontology (2) the transcriptomic profiles of differentially expressed transcription factors (TFs) or all TFs under experimental condition. We evaluated bottomup GGM algorithm for several pathways or biological processes and found it in general performs well for constructing MLhGRNs. We believe the algorithm can meet the great needs for constructing MLhGRNs using small to mediumsized gene expression data sets, and the MLhGRNs built will be instrumental for us to understand the hierarchical regulation of many biological processes and pathways.
Results
Selection of genes for bottomlayer and top regulatory layers
We applied our bottomup GGM algorithm to multiple pathways for reverseengineering MLhGRNs, each is considered to govern a given pathway or biological process. For each pathway, there were two files: one contained the gene expression profiles of pathway genes of interest, and the other file contains the expression profiles of all TF genes. The genes involved in a pathway are generally nonregulatory genes,, which can be obtained from existing annotation of metabolic pathways. Certainly, genes involved in a biological process, for example, as defined by a gene ontology that is enriched in differentially expressed genes, can be treated as a canonical pathway, and used to replace the pathway genes.
Reverseengineering of MLhGRN governing lignocellulosic pathway
The plant lignocellulosic pathway controls the biosynthesis of wide variety of secondary metabolic compounds including cellulose and lignin [31]. In addition to their roles in the structure and protection of the plants, cellulose and lignin have important roles in the structural integrity of plant cell walls, and the stiffness and strength of stems [32, 33].
The inputs for our bottomup GGM algorithm include the profiles of 25 pathway genes, and 1622 TFs extracted from the 128 pooled Arabidopsis microarray data sets under shortday condition that is known to induce secondary wall biosynthesis. We constructed a MLhGRN using these 25 pathway genes used as bottom layer and these TFs as candidates for top layers. The construction of MLhGRN was a dynamic process in our pipeline software with the parameters predetermined by users. These parameters include the number of layers, significant levels of p values for correlation and partial correlation and their differences, and a percentage of genes to be kept for each layer above the bottom one. The pipeline first built the second layer immediately above the bottom (or first) layer, and then used the second layer as the bottom layer, and repeated the above procedure to obtain the third layer and so on. For this MLhGRN (Fig. 1a), we obtained 14, 16, and 18 TFs for second, third and fourth layer respectively above the bottom layer. The detailed results that include correlations and pvalues for all the interfering TFs are given in (Additional file 1: Table S1). There are 14 TFs in the second layer (Fig. 1), out of which 10 TFs (GATA12, SND1, 2, and 3, MYB85, NST1 and 2, MYB103, MYB46 and MYB58) were positive TFs known to regulate lignocellulosic biosynthesis [34]. NAC domain proteins: NST1 [35], NST2 [36], and SND1 (also called NST3) [35, 37] are key regulators of secondary wall biosynthesis. NST1, NST2 and NST3 are key regulators involved in wall thickenings in various tissues when expressed ectopically [36, 38]. The expressions of SND2, SND3, MYB103, and MYB46 are regulated by SND1 and all are developmentally associated with cells undergoing secondary wall thickening [39–41]. The SND2 regulates genes involved in secondary cell wall development in Arabidopsis fibres, and increases fibre cell area in Eucalyptus [42]. MYB46 regulates the biosynthesis pathways of cellulose, xylan, and lignin [40] and GATA12 [43] controls xylem vessel differentiation. In the third layer, there are 6 positive TFs (MYB43, MYB92, MYB61, MYB63, MYB86, GRF3). The MYB63 is known to be involved in the activation of lignin biosynthetic pathway during secondary wall formation in Arabidopsis. The TF MYB61 controls stomatal aperture in Arabidopsis [44] and is required for mucilage deposition and extrusion in the seed coat [45]. MYB84 regulates the accumulation of the UVprotectant compound sinapoylmalate by repressing the transcription of the gene encoding the lignocellulosic enzyme cinnamate 4hydroxylase [46], and the MYB43 regulates the thickening of secondary wall of cells [47]. In fourth layer, there are 18 TFs. Most appear to be the high hierarchical regulators. Some of them are known to be responsive to various environmental cues and intercellular cues, for example, SLR [48] and HB53 [49] are auxininducible whereas ERF38 [50] and OBP3 [51] are responsive to ethylene and salicylic acid, respectively. In this layer, the SHP1, also referred as to be SHATTERPROOF1, is known to control the differentiation of the dehiscence zone where it promotes the lignification of adjacent cells [52] while HB53 boosts vascular development in meristem [53].
Although we have indicated there are currently no computational algorithms for directly building MLhGRNs from gene expression data, ARANCE can take the same inputs as our bottomup GGM algorithm and obtain TFs that have dependency on pathway genes. These TFs can serve as rough controls and allow us to obtain some idea of the performance of our bottomup GGM algorithm. We input these expression profiles of 25 pathway genes and 1622 TFs to ARANCE, and obtained the TFs that have dependency with at least two pathway genes with mutual information > 0.25 (Additional file 1: Table S2). Of the 40 TFs obtained, there are 13 positive TFs (Fig. 1B). Ten of them are common with the MLhGRN built with bottomup GGM algorithm. Although bottomup GGM failed to capture three TFs, VND4, MYB52 and XND1, which were recognized by ARANCE, it identified eight more positive TFs that included GATA12, MYB86, MYB43, MYB61, GRF3, MYB92, SHP1, and HB53.
Construction of MLhGRN that controls human embryonic pluripotency renewal
There are three master transcription factors, NANOG, POU5F1, and SOX2, which are known to govern the pluripotentcy renewal in human embryonic stem cells [54]. Early studies have identified some target genes that can be bound by the above three transcription factors using ChIPseq experiments [54, 55]. We assume that these target genes belong to a canonical pathway that plays key roles in pluripotency renewal. We would like to test if we could infer these three master transcription factors by building a onelayered hGRN using our bottomup GGM algorithm. The 189 microarray data sets [56, 57] for human stem cells was collected from 17 experiments in which hES cells were treated with various differentiation reagents. Therefore, these datasets include states involved in many regulatory events underpinning pluripotency, such as ES maintenance, exiting the pluripotent state, and differentiation. We used 19 target genes as bottomlayer, all TFs in human as inputs, and then used our bottomup GGM algorithm to build one regulatory layer above these 19 pathway genes. The network we obtained is shown in Fig. 2a, with 25 top genes shown in second layer. All above three transcription factors were shown up in top 25 TFs captured (2λ was used). When the same inputs were used for ARANCE, and the network obtained is shown in Fig. 2b. We also kept top 25 TFs based on the mutual information (MI) on the second layer but none of above three TFs was present in these 25 genes. We searched the rankings of above three TFs in the ARANCE output sorted by MI, and found SOX2 and NANOG ranked at 68 and 154, respectively.
Implementation of the bottomup GGM to synthetic yeast data
We generated a set of synthetic compendium gene expression data sets using the SynTReN software [58] and the regulatory network models based upon yeast experimental data as original seeds. There are 200 genes each with 100 expression values. From these 200 genes, we randomly selected 48 nonregulatory genes, which are presumably the pathway genes, and were used as bottom layer. The remaining genes are regulatory genes that contained 26 positive TFs. The results are shown in Additional file 1: Table S5. We constructed only one layer in order to make a better comparison with the result generated with ARCNE method. The list of top 50 TFs yielded from bottomup GGM algorithm contains 25 the positive TFs, whereas, the top 50 TFs yielded from ARANCE contain 15 positive TFs, indicating the evaluation of triplet genes for causal relationships may have some advantages over pairwise evaluation used by ARANCE.
Performance comparison between the networks constructed by bottomup GGM and ARACNE
We ran bottomup GGM and ARANCE algorithm with the 25 wood formation genes as bottom layer, and five groups of 322 TFs as candidates for regulatory layer. Based on the number of positive TFs that were builtin regulatory layer, we calculated the sensitivities and specificities for eight different numbers of TF cutoffs, namely, the number of TFs retained in the second layer. For ARCNE, we just kept those TFs that were directly dependent on wood formation genes and counted their frequencies. After the same TF cutoffs as bottomup GGM algorithm were applied, we calculated the sensitivities and specificities. For both bottomup GGM and ARACNE, we plotted ROC curves (sensitivity vs (1specificity)) (Fig. 3a and b). The dashed curves shown in Fig. 3 (a, and b) correspond to the five groups of TF inputs and the solid curve is the average of the five curves. The cohesion of dashed curves suggests that the performance of our algorithms was persistent and did not change much with different groups of TFs. We also calculated the Fscores from the averaged curves and plotted against the different TF cutoffs for both bottomup GGM algorithm and ARANCE (Fig. 3c). Higher Fscore represents better performance. To see how the performance of the bottomup GGM algorithm changes with the size of the data set (number of genes in the input), we calculated and plotted the True Positive Rates (TPR) against different number of TFs in the input for same TF cutoffs (Fig. 3d). We used 22 TF cutoff to find the TPR in Fig. 3d as there are 22 positive TFs in the input file. It is obvious that the bottomup GGM has significantly higher TPR than ARACNE.
Discussion
Like GGM, we have used the first order partial correlation but adopted a biological and mathematical model integrated approach to infer MLhGRNs. Our approach is based on a biological theory that when a TF exerted its control over a pair of genes, the correlation between these two genes will be either enhanced or impaired. The change in correlation is represented by a significant difference between the correlation of the pair of genes (in the presence of a regulatory TF, z), namely, r _{ xy }, and the correlation of the same pair of genes at the absence of this TF, namely r _{ xy z }. We implemented a multivariate delta method [59] to test the significance of difference d = r _{ xy } − r _{ xy z } in four different circumstances where the null hypothesis of zero difference. The higher accuracy of method can be at least partially ascribed to the integration of biological triplet gene regulatory model, which played significantly roles in the following aspects: (1) gene (variable) reduction. By fitting differentially expressed regulatory genes and a group of pathway genes into the model based on their annotation, the dimensionality of gene space can be significantly reduced. (2) noise reduction. By filtering out the irrelevant genes that do not fit the biological model, we significantly reduced the noise, enabling the true regulatory relationships to be easily emerged. (3) reduction in gene dimensionality and explicitly defined roles of input genes in turn empowered mathematical modeling for capturing true regulatory relationships. In addition to the integration of biological regulatory model that enhanced the efficiency of the approach, the association of two paired genes under same regulatory mechanism was achieved by Spearman method, a nonparametric method that measures the strength of association between two ranked variables, which was demonstrated to be extremely wellsuited for associating functionally relevant genes as compared to seven other gene association methods [57]. Spearman correlation method makes no assumption about the distribution of the data, and Spearman rho measures the strength of linear/nonlinear monotonic relationship between two variables [57]. In addition, the bottomup GGM algorithm performs exhaustive combinations of genes, providing numerous evaluation opportunities for positive genes to be eventually emerged from large number of candidates. Finally, we integrated weighted sparse canonical correlation analysis method (WSCCA) to determine the number of genes being kept for each layer, which in general could produce ideal number of regulators for each layer. To identify the TFs with higher interference frequency to pathway genes, one can double the λ value.
The efficiency of triplet gene model utilized in this study was experimentally validated in our earlier study [4]. In that study, we used a probability model and triple gene model combined approach to build a SND1mediated twolayered hGRN in a topdown fashion with 12 RNAseq libraries as input. Up to 97 % regulatory relationships in the built network were successfully proven by ChIPPCR using SND1 antibodies and again verified by RTPCR in stable transgenic lines where these targets were activated by overexpressed SND1 [4], indicating the efficiency of triplet gene regulatory model. In addition, the early version of this bottomup GGM algorithm was implemented to nine RNAseq libraries generated from nine independent poplar overexpression transgenic lines of microRNA397a [5]. microRNA397a targets mRNAs species of different laccase genes whose proteins catalyze polymerization reactions of S, G, and H monomers during lignin biosynthesis [5]. We constructed a threelayered hGRNs with 14 differentially expressed laccase genes as bottomlayer. The algorithm successfully identified microRNA397a from 1208 regulatory genes and constructed it into the secondary layer, where it directly regulated 12 laccase genes at the bottom layer. We chose seven laccase genes to validate using 5’RACE, and five of them were proven to be downregulated in the microRNA397a transgenic lines. In the network built, several transcription factors that were known to govern lignin biosynthesis were recognized by our triple gene model and built into the regulatory layers. These experimental results implicate that the networks built from our bottomup GGM algorithm are highly accurate and trustworthy.
Finally, the bottomup GGM algorithm we developed can be used to obtain a hierarchy of TFs that are coordinated to pathway genes in expression profiles at the bottom layer. We believe these hierarchical TFs contain significantly enriched positive genes that govern the underlying pathway either directly or indirectly, allowing biologists to initiate the experimental validation. Our method is compliant with the present knowledge that when the gene expression profile of a transcription factor and the profile of target gene are correlated, it is more likely that the target genes are authentic targets [60, 61]. We employed robust Spearmanrank correlation that has been proven to be efficiently in associating loosely and functionally coordinated genes as shown earlier [57] to augment the recognition of real hierarchical TFs that collectively control underlying pathway genes. The computational validation with test data sets has shown the positive regulatory genes can be significantly enriched in the built MLhGRNs. We believe the algorithm is instrumental for constructing the hGRNs that govern pathways or biological processes.
Conclusions
A bottomup graphic Gaussian model (GGM) algorithm was developed for constructing a multilayered hierarchical gene regulatory network that operates above a given metabolic or canonical pathway using small to mediumsized microarray or RNAseq data sets. The algorithm was validated with both synthetic and real gene expression data sets, leading to the networks that were dominated with significantly enriched known positive regulatory genes in most of the cases. We believe the algorithm is in particular instrumental for biologists to identify the hierarchical regulators associated with a given biological pathway of interest for experimental validation.
Methods
Arabidopsis and human microarray data sets
The Arabidopsis gene expression data used in this study were downloaded from public repository. The wood formation compendium data set contains the 128 microarrays pooled from six experiments, which have the accession identifiers of GSE607, GSE6153, GSE18985, GSE2000, GSE24781, and GSE5633, in NCBI Gene Expression Omnibus(GEO) (http://www.ncbi.nlm.nih.gov/geo/ ). These data sets were obtained from hypocotyledonous stems under shortday that can induce secondary wood formation [62]. The salt stress compendium microarray data set contains108 microarrays from 6 experiments with the accession identifiers of GSE5620, GSE6153, GSE24781, GSE5633, GSE6151, GSE18985) in the NCBI GEO database. This salt compendium data were used in our earlier research [56, 63]. All data sets mentioned above were derived from hybridization of Affymetrix 25 k ATH1 microarrays. The original CEL files were downloaded and processed by the robust multiarray analysis (RMA) algorithm using the Bioconductor package. For quality control we used the methods that were previously described [64]. The 189 human microarray data sets were introduced in detail in our previous publication [56].
Biological model for reverseengingeering MLhGRNs
Based on the fact that genes with similar patterns of transcriptomic expression are likely to be regulated via same mechanism [65–67], we proposed that when a regulatory gene in layer i +1 of a hGRN controls a pair of genes in layer i, where i ≥ 0, the presence of this regulatory gene either significantly enhances or impairs the correlation of the paired genes at one level below, we consider this regulatory gene interfere with the paired genes. This model was integrated into the bottomup GGM algorithm to evaluate the building blocks of combined triple genes, namely, a regulatory gene in layer i +1 and a pair of genes in layer i, during the construction of MLhGRN using GGM.
Biological model based graphic Gaussian model (GGM) for construction of MLhGRNs
We developed a bottomup GGM algorithm that contains multistep mathematical procedure to construct MLhGRN operating above a biological process or pathway. The algorithm integrated the GGM with the biological model as described earlier. Initially, the pathway genes were placed at bottom layer, namely layer i =1, while regulatory genes like TFs were used as candidate genes for layer i +1. The algorithm evaluated each combination of triplet genes, namely one regulatory gene, and two pathway genes, to determine if the regulatory gene significantly interfered with the two pathway genes. The interference of regulatory gene on two pathway genes could enhance or impair the coordination of two pathway genes. The significance of interference was tested by examining if the significant difference existed between the correlation of two pathway genes in the presence of the regulatory gene, namely, r _{ xy }, and the correlation of the same pair of genes at the absence of this TF, namely r _{ xy z }. We implemented a multivariate delta method [59] to test the significance of difference d = r _{ xy } − r _{ xy z } in four different circumstances where the null hypothesis of zero difference. The four circumstances shown in Fig. 4 are: (1) The correlation is significant but the partial correlation is not significant, indicating that the presence of the TF make the paired pathway genes more coordinated. Therefore, the TF interfered with the pathway genes pair. (2) Both the correlations are significant. The pathway genes in the pair are correlated before and after removing the effect of the transcription factor. To find if TF has significant effect on the relation of the two pathway genes, we need to determine if the difference between the correlations and partial correlation are significant. (3) The correlation is not significant but partial correlation is significant, which implies the TF is interfering. (4) Both correlation and partial correlation are not significant. In this case, we discarded the triplet and moved to the next triplet genes. Both correlation and partial correlation were calculated using Spearman Rank Correlation method. Spearman method is one of the most effective method to identify functionally associated genes using microarray data [57]. The pseudo code for the bottomup GGMalgorithm is shown below.
Pseudo code for regulatory model based statistical model

1
Input: (G, R) [G → BPP genes, R → TFs], where BPP represents biological pathways or processes

2
For each pair (x, y), [(x, y) ∈ G]

Calculate pairwise correlation of (x, y), r _{ xy } = cor(x, y), and partial correlation r _{ xyz } = pcor(x, yz) (partial correlation of x, y given z∈ Z).


3
Perform the significance test for the correlation and the partial correlation and find the corresponding pvalues.

4
Find the difference of the correlations and perform the significance test:

Difference of correlations: \( D={r}_{xy}\frac{r_{xy}{r}_{zx}{r}_{zy}}{\sqrt{\left(1{r}_{zx}^2\right)\left(1{r}_{zy}^2\right)}} \)

Standard error of D ≜ D _{ se }

Find the test statistics, \( {z}_1=\frac{D}{D_{se}} \), which follows a standard normal distribution approximately.

The difference of correlation, D, is a function of three correlations, r _{ xy }, r _{ yz } and r _{ xz }, and denote D = f(r _{ xy }, r _{ yz }, r _{ xz }). Let
be the covariance matrix of (r _{ xy }, r _{ yz }, r _{ xz }).
Let (pd _{1}, pd _{2}, pd _{3}) denote the partial derivatives \( \left(\frac{\partial D}{\partial {r}_{xy}},\ \frac{\partial D}{\partial {r}_{zy}},\frac{\partial D}{\partial {r}_{xz}}\right) \) of D with respect to (r _{ xy }, r _{ yz }, r _{ xz }), then according the multivariate Delta method [59], the variance of D, is approximated by
The standard error is taken as the square root of var(D). That is, \( {D}_{se}=\sqrt{var(D)} \).
The formulae to calculate the variance and covariance among the correlations were based on asymptotic theory [68].
The partial derivatives of the difference are
Bottomup GGM algorithm for reconstruction of MLhGRNs controlling a biological pathway or biological process
Based on the regulatory model and mathematical model as described, we developed the following integrated procedure to reconstruct MLhGRNs using two inputs: the profiles of pathway genes, and the profiles of TFs. These pathway genes and TFs are in general from differentially expressed gene sets under the experimental condition in which profiles were obtained.

1.
Initially, the algorithm set a group of pathway genes to be the bottom layer, and built regulatory layer immediately above the bottom layer.

2.
Using the aforementioned GGM method as shown in Pseudo code for regulatory model based statistical model, we evaluated the relationships between each paired pathway gene pair from the bottom layer with a TF from TF pool. When a TF interfered (either enhanced or impaired) the coordination of two paired genes with significant pvalues, keep them. Discard this TF otherwise. Continue until all TFs are evaluated.

3.
When a given pathway gene pair against all TF are completed, pick up a new pair of genes from bottom layer. Repeat Step 2 and 3 until all pair of bottomlayer genes against all TFs are completed.

4.
Correct the interfering pvalues for multiple testing and keep the triple genes in which the TFs and two paired bottomlayer genes have corrected pvalues less than the given significance level.

5.
Count the total number of paired genes each TF interferes with.

6.
Sort the TFs by the number of significantly interfered genes from the largest to smallest and determine the number of TFs of interest to be retained in current layer using the method given below.

7.
Remove the TFs kept in current layer from the TF input file, and then set the current layer as bottom layer, and then repeat step 16 with the remaining TFs to obtain a new layer.

8.
We repeat the above steps from 27 until the desired number of layers is obtained.
This framework enhances the discovery of regulatory layers operating over a pathway or a biological process by recursively evaluating and identifying candidate regulatory genes with strongest interference on layer in a layerbylayer fashion.
Determination of TF interference on paired target genes
In the step 3 of abovementioned procedure, we need to determine how TF interferes two paired genes. Let p _{1} and p _{2} be the pvalues of the significance tests of correlation and partial correlation respectively. If r _{ xy } and r _{ xyz }, are not significant i.e., the pvalues p _{1} and p _{2} are greater than the significance level, discard (x, y, z). If r _{ xy } is significant and r _{ xyz } is not significant i.e., p _{1} is less and p _{2} is greater than the given significance level then z interferes with x and y. If both, r _{ xy } and r _{ xyz } are significant then test the significance of the difference between r _{ xy } and r _{ xyz }. Let p _{3} to be the pvalue of the test. If there is significance difference between the correlations i.e., if p _{3} is less than the prespecified significance level, then z interferes with x and y. There is no interference of z otherwise. If p _{1} is greater but p _{2} is less than the given significance level, then, z interferes with x and y.
Determination of number of TFs to be kept in each layer
In order to determine how many TFs should be included in each layer, we designed a weighted sparse canonical correlation analysis method (WSCCA), which is similar to Witten’s sparse canonical correlation analysis method (SCCA) [69]. For the construction of first layer network, let X _{ n × p } be the pathway gene expression matrix, where n denotes the number of samples and p denotes the number of pathway genes. Let Y _{ n × q } be the expression matrix of TFs we get in step 6 of above given algorithm, where q denotes the number of TFs. The matrices, X and Y are centered and scaled. Canonical correlation analysis (CCA), developed by Hetelling [70], involves finding vectors u and v that maximize cor(Xu, Yv), that is
There is a closedform solution for canonical vectors u and v. However, u and v are not sparse, and these vectors are not unique if p or q exceeds n. SCCA use l _{1} penalization for highdimensional problems to get sparse u and v. So the optimization problem now is
In our setting, as the number of pathway genes is not big, there is no need to impose l _{1} penalization on vector u. We used the number of interference of each TF as weight to penalize vector v, our optimization model is
Where the i ^{th} element of weight vector W is maximal number of interference minus the number of interference of TF _{ i }. Then the above model becomes a biconvex problem. We designed the following algorithm to solve it.

1.
Initialize v to have l _{2} norm 1.

2.
Iterate until convergence:

a.
Fix v, solve u ← ar gmx _{ u } u ^{T} X ^{T} Yv subject to  u  ^{2}_{2} = 1.

b.
Fix u, solve v ← ar gmx _{ v } u ^{T} X ^{T} Yv + λ  W ^{T} v _{1} subject to  v  ^{2}_{2} = 1.

a.
The tuning parameter λ is determined by crossvalidation. We keep the TFs selected by this algorithm in the current layer.
Testing efficiency of the methods
We performed sensitivity, specificity and Fscore analyses to assess the efficiency and quality of our algorithm. First, we built a test data set with the 129 microarray data sets, and 108 microarray data sets we used earlier [63]. The 25 pathway genes we used for initiating the MLhGRN construction was the same as those used for building MLhGRN governing lignocellulosic pathway and to construct the upper layers, 1500 TFs including 22 true positive TFs that are known to govern lignocellulosic pathway were used. We classified the 1500 TFs into five groups for testing the efficiency of the bottomup GGM algorithm. Each group contained 278 randomly selected TFs from 1500 TFs. We then added 22 positive TFs to each group. These 300 TFs, together with the 25 wood formation genes, were used as inputs for building one layer of hGRN. We built one layer of hGRN and calculated the sensitivity specificity and Fscore using the formulae as shown below. ARACNE was used for comparison.
Where,
TP, FP, TN, and FN are true positive, false positive, true negative, and false negative respectively.
References
 1.
Jin Y, Guo H, Meng Y. A hierarchical gene regulatory network for adaptive multirobot pattern formation. IEEE Trans Syst Man Cybern B Cybern. 2012;42(3):805–16.
 2.
Wei H et al. Genetic networks involved in poplar root response to low nitrogen. Plant Signal Behav. 2013;8(11):e27211.
 3.
Wei H et al. Nitrogen deprivation promotes Populus root growth through global transcriptome reprogramming and activation of hierarchical genetic networks. New Phytol. 2013;200(2):483–97.
 4.
Lin YC et al. SND1 transcription factordirected quantitative functional hierarchical genetic regulatory network in wood formation in Populus trichocarpa. Plant Cell. 2013;25(11):4324–41.
 5.
Lu S et al. PtrmiR397a is a negative regulator of laccase genes affecting lignin content in Populus trichocarpa. Proc Natl Acad Sci U S A. 2013;110(26):10848–53.
 6.
Ma HW et al. An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Res. 2004;32(22):6643–9.
 7.
Erwin DH, Davidson EH. The evolution of hierarchical gene regulatory networks. Nat Rev Genet. 2009; 10(2):141–8.
 8.
MartínezAntonio A. Escherichia coli transcriptional regulatory network. Network Biology 1. 2011;1(1):21–33.
 9.
Balazsi G, Barabasi AL, Oltvai ZN. Topological units of environmental signal processing in the transcriptional regulatory network of Escherichia coli. Proc Natl Acad Sci U S A. 2005;102(22):7841–6.
 10.
Yu H, Gerstein M. Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci U S A. 2006;103(40):14724–31.
 11.
Blum A. Drought resistance  is it really a complex trait? Functional & Plant Biology. 2011;38:753–7.
 12.
Hasegawa PM et al. Plant Cellular and Molecular Responses to High Salinity. Annu Rev Plant Physiol Plant Mol Biol. 2000;51:463–99.
 13.
Bhardwaj N, Kim PM, Gerstein MB. Rewiring of transcriptional regulatory networks: hierarchy, rather than connectivity, better reflects the importance of regulators. Sci Signal. 2010;3(146):ra79.
 14.
Yang C and Wei H. Designing Microarray and RNAseq Experiments for Greater Systems Biology Discovery in Modern Plant Genomics. Mol Plant 2014.
 15.
Chen T, He HL, and Church GM. Modeling gene expression with differential equations. Pac Symp Biocomput. 1999: p. 2940.
 16.
Ruklisa D, Brazma A, Viksna J. Reconstruction of gene regulatory networks under the finite state linear model. Genome Inform. 2005;16(2):225–36.
 17.
Dojer N et al. Applying dynamic Bayesian networks to perturbed gene expression data. BMC bioinformatics. 2006;7:249.
 18.
Louis M, Becskei A. Binary and graded response in gene networks. Sci STKE. 2002;43:PE33.
 19.
Kauffman S. Homeostasis and differentiation in random genetic control networks. Nature. 1969;224(5215):177–8.
 20.
Chen BS et al. Robust model matching design methodology for a stochastic synthetic gene network. Math Biosci. 2011;230(1):23–36.
 21.
Schafer J, Strimmer K. An empirical Bayes approach to inferring largescale gene association networks. Bioinformatics. 2005;21(6):754–64.
 22.
Butte A, Kohane I. Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. Proc Pac Symp Biocomput. 2000;5:415–26.
 23.
Margolin AA et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7 Suppl 1:S7.
 24.
Faith JJ et al. Largescale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 2007;5(1):e8.
 25.
Altay G, EmmertStreib F. Inferring the conservative causal core of gene regulatory networks. BMC Syst Biol. 2010;4:132.
 26.
Luo W, Hankenson KD, Woolf PJ. Learning transcriptional regulatory networks from high throughput gene expression data using continuous threeway mutual information. BMC bioinformatics. 2008;9:467.
 27.
Friedman N et al. Using bayesian networks to analyze expression Data. J Comput Biol. 2000;7(3/4):601–20.
 28.
Whittaker J. Graphical models in applied multivariate statistics. New York: John Wiley; 1990.
 29.
Magwene PM, Kim J. Estimating genomic coexpression networks using firstorder conditional independence. Genome Biol. 2004;5(12):R100.
 30.
Wille A et al. Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol. 2004;5(11):R92.
 31.
Dixon RA, Paiva NL. Stressinduced phenylpropanoid metabolism. Plant Cell. 1995;7(7):1085.
 32.
Chabannes M, Ruel K, Yoshinaga A, Chabbert B, Jauneau A, Joseleau JP, Boudet AM. In situ analysis of lignins in transgenic tobacco reveals a differential impact of individual transformations on the spatial patterns of lignin deposition at the cellular and subcellular levels. The Plant journal for cell and molecular biology. 2001;28:271–282.
 33.
Donaldson LA. Lignification and lignin topochemistry  an ultrastructural view. Phytochemistry. 2001;57:859–873.
 34.
Zhong R et al. A battery of transcription factors involved in the regulation of secondary cell wall biosynthesis in Arabidopsis. Plant Cell. 2008;20(10):2763–82.
 35.
Mitsuda N, OhmeTakagi M. NAC transcription factors NST1 and NST3 regulate pod shattering in a partially redundant manner by promoting secondary wall formation after the establishment of tissue identity. Plant J. 2008;56(5):768–78.
 36.
Mitsuda N et al. The NAC transcription factors NST1 and NST2 of Arabidopsis regulate secondary wall thickenings and are required for anther dehiscence. Plant Cell. 2005;17(11):2993–3006.
 37.
Kim WC et al. Transcription factor MYB46 is an obligate component of the transcriptional regulatory complex for functional expression of secondary wallassociated cellulose synthases in Arabidopsis thaliana. J Plant Physiol. 2013;170(15):1374–8.
 38.
Mitsuda N et al. NAC transcription factors, NST1 and NST3, are key regulators of the formation of secondary walls in woody tissues of Arabidopsis. Plant Cell. 2007;19(1):270–80.
 39.
Zhong R, Demura T, Ye ZH. SND1, a NAC domain transcription factor, is a key regulator of secondary wall synthesis in fibers of Arabidopsis. The Plant Cell Online. 2006;18(11):3158–70.
 40.
Zhong R, Richardson EA, Ye ZH. The MYB46 transcription factor is a direct target of SND1 and regulates secondary wall biosynthesis in Arabidopsis. The Plant Cell Online. 2007;19(9):2776–92.
 41.
Zhong R, Richardson EA, Ye ZH. Two NAC domain transcription factors, SND1 and NST1, function redundantly in regulation of secondary wall synthesis in fibers of Arabidopsis. Planta. 2007;225(6):1603–11.
 42.
Hussey SG, et al. SND2, a NAC transcription factor gene, regulates genes involved in secondary cell wall development in Arabidopsis fibres and increases fibre cell area in Eucalyptus. BMC Plant Biology. 11(1): p. 173.
 43.
Endo H et al. Multiple Classes of Transcription Factors Regulate the Expression of VASCULARRELATED NACDOMAIN7, a Master Switch of Xylem Vessel Differentiation. Plant Cell Physiol. 2015;56(2):242–54.
 44.
Liang Y.K, et al. AtMYB61, an R2R3MYB transcription factor controlling stomatal aperture in Arabidopsis thaliana. Curr Biol. 2005;15(13):1201–6.
 45.
Penfield S et al. MYB61 is required for mucilage deposition and extrusion in the Arabidopsis seed coat. The Plant Cell Online. 2001;13(12):2777–91.
 46.
Hemm MR, Herrmann KM, Chapple C. AtMYB4: a transcription factor general in the battle against UV. Trends Plant Sci. 2001;6(4):135–6.
 47.
Wang HZ, Dixon RA. Onoff switches for secondary cell wall biosynthesis. Mol Plant. 2012;5(2):297–303.
 48.
De Smet I. Multimodular auxin response controls lateral root development in Arabidopsis. Plant Signal Behav. 2010;5(5):580–2.
 49.
Son O et al. Induction of a homeodomainleucine zipper gene by auxin is inhibited by cytokinin in Arabidopsis roots. Biochem Biophys Res Commun. 2005;326(1):203–9.
 50.
Lasserre E et al. AtERF38 (At2g35700), an AP2/ERF family transcription factor gene from Arabidopsis thaliana, is expressed in specific cell types of roots, stems and seeds that undergo suberization. Plant Physiol Biochem. 2008;46(12):1051–61.
 51.
Kang HG, Singh KB. Characterization of salicylic acidresponsive, arabidopsis Dof domain proteins: overexpression of OBP3 leads to growth defects. Plant J. 2000;21(4):329–39.
 52.
Colombo M et al. A new role for the SHATTERPROOF genes during Arabidopsis gynoecium development. Dev Biol. 2010;337(2):294–302.
 53.
Baima S et al. The arabidopsis ATHB8 HDzip protein acts as a differentiationpromoting transcription factor of the vascular meristems. Plant Physiol. 2001;126(2):643–55.
 54.
Boyer LA et al. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122(6):947–56.
 55.
Sharov AA et al. Identification of Pou5f1, Sox2, and Nanog downstream target genes with statistical confidence by applying a novel algorithm to time course microarray and genomewide chromatin immunoprecipitation data. BMC Genomics. 2008;9:269.
 56.
Nie J et al. TFCluster: a pipeline for identifying functionally coordinated transcription factors via network decomposition of the shared coexpression connectivity matrix (SCCM). BMC Syst Biol. 2011;5:53.
 57.
Kumari S et al. Evaluation of gene association methods for coexpression network construction and biological knowledge discovery. PLoS One. 2012;7(11):e50411.
 58.
Van den Bulcke T et al. SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics. 2006;7:43.
 59.
MacKinnon DP et al. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83.
 60.
Kim H, Hu W, Kluger Y. Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae. BMC Bioinformatics. 2006;7:165.
 61.
Karczewski KJ et al. Coherent functional modules improve transcription factor target identification, cooperativity prediction, and disease association. PLoS Genet. 2014;10(2):e1004122.
 62.
Chaffey N et al. Secondary xylem development in Arabidopsis: a model for wood formation. Physiol Plant. 2002;114(4):594–600.
 63.
Cui X et al. TFfinder: a software package for identifying transcription factors involved in biological processes using microarray data and existing knowledge base. BMC Bioinformatics. 2010;11:425.
 64.
Persson S et al. Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets. Proc Natl Acad Sci U S A. 2005;102(24):8633–8.
 65.
Clements M et al. Integration of known transcription factor binding site information and gene expression data to advance from coexpression to coregulation. Genomics Proteomics Bioinformatics. 2007;5(2):86–101.
 66.
Yeung KY, Medvedovic M, Bumgarner RE. From coexpression to coregulation: how many microarray experiments do we need? Genome Biol. 2004;5(7):R48.
 67.
Allocco DJ, Kohane IS, Butte AJ. Quantifying the relationship between coexpression, coregulation and gene function. BMC Bioinformatics. 2004;5:18.
 68.
Olkin I, Siotani M. Asymptotic distribution of functions of a correlation matrix. In: Ikeda S, editor. Essays in Probability and Statistics Chapter 16, vol. MR0603847. Tokyo: Shinko Tsusho; 1976. p. 235–51.
 69.
Witten DM, Tibshirani R, Hastie T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics. 2009;10(3):515–34.
 70.
Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–77.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contribution
SK and HW developed bottomup GGM algorithm, SK wrote the methods, WD developed WSCCA, CG performed the data analysis and produced the networks, VC and XD and HM were involved in data collection, preparation and analysis, HSC helped SK in algorithm development and participated data analysis, and HW developed the project and wrote the manuscript. All authors read and approved the final manuscript.
Authors’ information
SK (Research Assistant Professor), WD (Ph.D. candidate), CG (Ph.D. candidate) and HW (Associate Professor), School of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI 49931, USA. VC (Professor) and XD (Postdoctoral researcher), Forest Biotechnology Group, Department of Forestry and Environmental Resources, North Carolina State University, Raleigh, NC 27695. HSC (Mathematical Statistician), Statistical Methodology and Applications Branch, Division of Cancer Control and Population Sciences, National Cancer Institute, National Institutes of Health, Maryland, United States of America. Rockville, MD 20850, HM (Research Molecular Biologist), NCCWA, USDA ARS, Kearneysville, WV 25430
Additional file
Additional file 1: Table S1.
The output of MLhGRN operating over lignocellulosic pathway built with the bottomup GGM algorithm. Table S2. The output of regulatory layer over lignocellulosic pathway identified by ARACNE algorithm. Table S3. The MLhGRN operating over human pluripotency renewal pathway built with the bottomup GGM algorithm. Table S4. The output of regulatory layer over human stem cell renewal pathway identified by ARACNE algorithm. Table S5. Comparison of bottomup GGM algorithm with ARANCE using yeast synthetic data. (XLSX 232 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Kumari, S., Deng, W., Gunasekara, C. et al. Bottomup GGM algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways or processes. BMC Bioinformatics 17, 132 (2016). https://doi.org/10.1186/s1285901609811
Received:
Accepted:
Published:
Keywords
 Multilayered gene regulatory network
 Pathway
 Microarray or RNAseq data