SLDR: a computational technique to identify novel genetic regulatory relationships
© Yue et al.; licensee BioMed Central Ltd. 2014
Published: 21 October 2014
We developed a new computational technique called S tep-L evel D ifferential R esponse (SLDR) to identify genetic regulatory relationships. Our technique takes advantages of functional genomics data for the same species under different perturbation conditions, therefore complementary to current popular computational techniques. It can particularly identify "rare" activation/inhibition relationship events that can be difficult to find in experimental results. In SLDR, we model each candidate target gene as being controlled by N binary-state regulators that lead to ≤2N observable states ("step-levels") for the target. We applied SLDR to the study of the GEO microarray data set GSE25644, which consists of 158 different mutant S. cerevisiae gene expressional profiles. For each target gene t, we first clustered ordered samples into various clusters, each approximating an observable step-level of t to screen out the "de-centric" target. Then, we ordered each gene x as a candidate regulator and aligned t to x for the purpose of examining the step-level correlations between low expression set of x (Ro) and high expression set of x (Rh) from the regulator x to t, by finding max f(t, x): |Ro-Rh| over all candidate × in the genome for each t. We therefore obtained activation and inhibitions events from different combinations of Ro and Rh. Furthermore, we developed criteria for filtering out less-confident regulators, estimated the number of regulators for each target t, and evaluated identified top-ranking regulator-target relationship. Our results can be cross-validated with the Yeast Fitness database. SLDR is also computationally efficient with o(N2) complexity. In summary, we believe SLDR can be applied to the mining of functional genomics big data for future network biology and network medicine applications.
Identifying novel gene regulatory relationship from large-scale functional genomics data has been a major theme for the characterization of complex biomolecular systems. Gene regulator identification can be identified from gene expression data using DNA microarrays. With tens of thousands of microarray experiments deposited into public databases for yeast, Drosophila, Arabidopsis, mice, and humans, one may reconstruct molecular interaction or regulation relationships from mining the data without conducting specific experiments to test whether a candidate regulator-target relationship exists. For example, James et al  examined temporal gene expression patterns during chondrogenic differentiation in a mouse micromass culture system. Then, they determined transcriptional regulation by observing the impact of changed expression of molecules onto changed gene functional categories. Lorenz et al  used microarray analysis and scale-free gene networks analysis to identify candidate regulators in drought-stressed roots of loblolly pine. Systematic approaches to reconstruct transcriptional modules and identify their perturbation conditions are under way . Albert et al  summarized recent findings that the disruption of regulatory relationships may lead to human diseases, therefore shedding new light on disease intervention on gene regulatory relationships instead of genes as possible drug targets--hence the new field of "network medicine". These examples show a surging interest among genome biologists to study gene regulatory relationships.
Traditional experimental gene regulator finding methods, e.g., those using gene knockouts, synthetic lethality, or chip-seq in eukaryotes, are too costly to serve as the primary platform with which scientists explore the large combinatorial space between all candidate pairs of genes [5–7]. To overcome the data coverage gap, many computational methods have been proposed, e.g., homologous gene regulator database search, clustering of gene expression profiles and transcription factor binding site pattern matching, physics-based modeling of candidate transcription factors and target binding relationships, and network based methods [4, 8]. For example, Ru-Fang Yeh et al  introduced an accurate and efficient technique that performs homologous gene regulator database search in higher eukaryotes to annotate gene regulators for the human genome. Stephane et al  developed a rigorous statistical test to establish a link between selection threshold of putatively regulators and the identified false positive genes in clusters of candidate gene targets derived from gene expression profiles. Gerhard et al  developed ANREP, a system that can identify exact pattern matches to motifs with spacing constraints and approximate matches recursively. Physics-based methods that characterize protein-ligand relationships, e.g., the MM-based methods, have also been proposed .
In this study, our aim is to develop a new computational method to identify genetic regulatory relationships that are difficult to uncover using previously reported techniques. This type of relationships differ from gene regulatory relationships in that genes in the former type may affect each other indirectly through other genes or molecular regulation mechanisms while genes in the latter type affect each other as direct, observable regulator-target relationships. Current databases often cover reasonably well highly-connected gene regulators, or "hubs" of gene regulations in the gene regulatory network ; however, for low-connectivity regulators, or "de-centric nodes" in the gene regulatory network, the coverage is often poor because the chance for randomly observing their activities is low. S tep-L evel D ifferential R esponse (SLDR) is a new computational method developed to identify de-centric genetic regulatory relationship candidates. The input of SLDR is the functional genomics data under permutated perturbation conditions. In SLDR, we specifically search for all qualifying target genes, each which is controlled by N binary-state regulators that lead to ≤2N observable expression levels--which we call "step-levels"--of the target gene. The output of SLDR is statistically significant findings candidate genetic regulatory relationships. We describe our study in detail next.
An overview of the framework
1. Preparation of functional genomics data
We used raw data GSE25644 from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/) as the input. GSE25644 is a DNA microarray gene expression profile with all 158 viable protein kinase/phosphatase deletions in S. cerevisiae under a single growth condition . Each mutant was profiled four times, from two independent cultures on dual-channel microarrays using a batch of wild-type (WT) RNA as a common reference. The GSE25644 was normalized by averaging each of the two independent cultures' results on microarrays, and before our algorithm was applied, the wild-type groups were filtered off.
2. Selection of the de-centric targets
In order to find out the huge steps, we use average steps as a standard to filter. The average increasing value Δ of each gene was calculated and averaged values were sorted. The average increasing value Δ was calculated by the maximum value of the gene expression (we assume the target gene t) minus the minimum value and then divided by the number of samples n. Δt=(tmax-tmin)/n.
- 2)The average increasing value Δt of a target was regarded as a standard to seek steps which means the number of dj larger than Δt, which we defined as k value. For each target, gene expression was sorted from low to high and then we calculated the difference between adjacent samples, dj. If the difference dj is smaller than corresponding Δt, k will be retained. The formula of step k is shown below:
Iteration was performed for every target to find out each k of targets. k values were sorted from low to high and the largest kmax corresponds to target t'.
- 4)To avoid the situation that fake steps with small change causes high k' in individual, every target was normalized by the new average increasing value Δmax as a standard to seek for new step levels (k') of each target. The formulas of new average increasing value Δmax and step k' are shown below:
- 5)The binary state N was calculated, and N means the number of genetic regulators of each target. For instance, assuming that a target's binary state N is 2, this target would have less than 3 steps within 4 step-levels theoretically. The formula of N binary-state is shown below:
We can use the cluster k' to calculate the theoretical number of genetic regulators of the de-centric targets.
3. Identification of genetic regulatory relationship among genes
Where, cov is the covariance between potential regulator × and target t, ρ x is the standard deviation of x, µx is the mean of x, and E is the expectation.
Representation of Activation/Inhibition genetic relationship pairs identified and the threshold used between R0 and Rh cases.
To examine the step-level correlation between low step levels of the x (R0) and high step levels of the x (Rh) between the x and t, all genetic regulator candidates were ordered to each target using max f(t, x): |Ro-Rh|. Then we choose top 10 high |Ro-Rh| genetic regulatory pairs, since the number of nodes generating the genetic regulatory network should be balanced and moderate to present the de-centric targets. If the pairs are strict, they will not organize a network with enough pairs. If the pairs are relax, they will introduce noise of weak links in the networks.
4. Generation of the de-centric genetic regulatory network and testing of network significance
We generated the de-centric genetic regulatory network by the top 10 high |Ro-Rh| genetic regulator pairs and performed statistical data analysis tests to detect the significance of the connected network. Our hypothesis of this statistical evaluation is that if the prediction model indeed consists of de-centric targets involved in the same process even if complex and broad, then we should expect that the connectivity among the de-centric targets be lower than the connectivity among a set of randomly selected genes.
We defined the index of aggregation of a network  as the ratio of the size of the largest sub-network that exists in this network to the size of this network. Note that the size is calculated as the total number of genes within a given network/sub-network.
Randomly select from the pool of genetic regulators to targets, the same number of predicted targets generated from our method.
Retrieve the top 10 genetic regulators of each random target using |Ro-Rh| criteria.
Compute the index of the aggregation of superset.
Repeat steps 1 through 3 for 500 times to generate a distribution of the index of aggregation under random selection.
Compare the index of aggregation from our method with the distribution obtained in step 4 and calculate the p-value.
5. Significance testing of the de-centric genetic regulatory relationships
Our result is validated in the Yeast Fitness Database (http://chemogenomics.med.utoronto.ca/fitdb/fitdb.cgi). FitDB is a searchable database of quantitative chemical-genetic interactions based on data in Hillenmeyer . A gene search allows viewing of the compounds that are most sensitive to the gene specified in a heterozygous and homozygous yeast deletion strain, including a view of yeast deletion strains that behave similarly to the gene of interest. Compounds can also be searched to identify heterozygous or homozygous deletion strains exhibiting hypersensitivity to compound, including a view of compounds that behave similarly to the compound of interest.
A contingency table showing how to perform the hypergeometric test
n − k
6. Test de-centric genetic regulatory network robustness
In order to detect the robustness, we introduced the noise on the gene expression profiles of the 158 viable protein kinase/phosphatase deletions strains. The noise was designed as increasing 5%, 10%, 15%, 20%, 30%, 50%, 70% of noise by randomly shuffling the expression values of each sample.
7. Cluster the samples that significantly contribute to the de-centric target
The 158 viable protein kinase/phosphatase deletions' profiles was clustered by UPGMA (Unweighted Pair Group Method with Arithmetic Mean). The agglomerative clustering method UPGMA is one of the most popular methods for the classification of sampling units on the basis of their pairwise similarities in relevant descriptor variables. We used UPGMA algorithm to construct a rooted tree that reflects the structure in a pairwise samples' similarity matrix.
Then we find a minimal threshold in the hierarchical tree and pick a representative cluster. Delete it to see the effect of the cluster on finding of de-centric targets.
8. Analysis de-centric genetic regulatory network ontology
In order to explore the function of the de-centric regulatory networks, ClueGo , a Cytoscape plug-in was used. ClueGO performs single cluster analysis and comparison of clusters. From the ontology sources used, the terms are selected by different filter criteria. The related terms which share similar associated genes can be fused to reduce redundancy. The ClueGO network is created with kappa statistics and reflects the relationships between the terms based on the similarity of their associated genes. ClueGO charts are underlying the specificity and the common aspects of the biological role. The significance of the terms and groups is automatically calculated.
9. Locate the de-centric regulators and targets in cell
The sub cellular location of de-centric regulators and targets were retrieved in the Comprehensive Yeast Genome Database , which derived from experiments, (CYGD: http://mips.gsf.de/genre/proj/yeast/) and display by the tool, Cerebral, a cytoscape plug-in. Cerebral uses sub cellular localization attribute to create a layered view of a cell, placing nodes in the region of the screen corresponding to the appropriate localization. 
1. Determination of de-centric genetic regulated targets from the processed data
The 2N = k'+1 is the key point to control the theoretical genetic regulator number of targets. Especially the selection of binary-state N determines the theoretical number of genetic regulators directly. The reason we call N theoretical genetic regulator number is that in the formula 2N = k'+1, we cannot be certain about the number of N when several genetic regulator have same changing DNA expression level on their target giving merged step-levels. Assuming that a target t have n genetic regulators x1, x2 ... xn. And f(x1, x2 ... xn) stands for step levels due to the combined genetic regulatory effect on t. The states of × can be either 1 or 0 to indicate whether × is activated or not. The number of target's step-levels ft(x1, x2 ... xn) is greater than n: ft(x1, x2 ... xn) ≥ n + 1. For instance, if x1, x2, x3 have similar regulation ability on t, the f(1,0,0), f(0,1,0), f(0,0,1) will give a merged level. Similarly the f(0,1,1), f(1,0,1), f(1,1,0) will give another merged level. In this case, the step levels of t is 4 = n+1 that stands for four levels of f(0,0,0), f(1,0,0), f(1,1,0) and f(1,1,1). Hence, the N will be 2 meaning that our predicted targets of 2 genetic regulators will contain some targets of 3 genetic regulators in extreme case. And the mis-clustering will be more severe when applied for prediction of targets with more genetic regulators.
2. Identification of activation/inhibition genetic regulatory relationship details
Summary Statistics for validated de-centric targets and genetic regulatory relationship identified
A distribution of significant genetic regulatory relationship according to rank groups.
3. Construction of de-centric target-regulator network
4. Significance evaluation of de-centric genetic regulatory relationship identified
5. Robustness of the genetic regulatory network
Effect of introducing noise in the results on the network index of aggregation (showing p-value changes)
The result shows that the network has resistance of 5% noise. However, along the noise increase, a significant weakening appears on detecting the de-centric regulators and finding potential targets using SLDR. In the randomly generated networks by small number of targets, the network aggregation test tends to be bias reflected on the increased variance.
6. Decisive cluster of samples
Validated de-centric targets and genetic regulatory relationship
The empirical distribution of the index of aggregation was obtained after 500 random re-samplings. 93 runs out of 500 resulted in an index of aggregation value greater than 81.4% in de-centric genetic activation regulatory network and de-centric genetic inhibition regulatory network and 58 runs out of 500 resulted in an index of aggregation value greater than 88.4% in de-centric genetic activation regulatory network and de-centric genetic inhibition regulatory network. Therefore, the p-value of the index of aggregation in de-centric genetic activation regulatory network is 0.186 and the p-value of the index of aggregation in de-centric genetic inhibition regulatory network is 0.116.
A Comparison between all the samples and the All-5 Samples (5 deleted after hierarchical clustering) to show changes of p-values for the network's index of aggregation.
5 samples deletion
7. The biology process groups of de-centric genetic regulatory networks
The biology process group lists of de-centric genetic regulatory networks.
ascospore wall assembly
tRNA-type intron splice site recognition and cleavage
reciprocal meiotic recombination
histidine biosynthetic process
maltose catabolic process
glycerol metabolic process
regulation of arginine biosynthetic process
antisense RNA metabolic process
cellular zinc ion homeostasis
negative regulation of transposition, RNA-mediated
response to lipid
pyridoxine metabolic process
regulation of fatty acid metabolic process
piecemeal microautophagy of nucleus
regulation of protein stability
breciprocal meiotic recombination
glycogen metabolic process
response to carbohydrate stimulus
8. The sub cellular location of de-centric regulators and targets in cell
In this study, we discovered 112 significant new genetic regulatory targets with functional genomics data set derived from 158 viable protein kinase/phosphatase deletion experiments. For each new target, top 10 high-ranking suspected genetic regulatory pair candidates are also given. This genetic regulatory relationship of regulators to targets can include much more than the direct regulation and control, because genetic regulators could sit in the far upstream of its targets.
We developed SLDR to identify candidate genetic regulators particularly of none-hub gene targets. Unlike a whole-transcriptome profiling based correlation techniques, SLDR can efficiently detect signal levels as "clustered conditions", each of which correspond to a "step-level" of target gene perturbed/regulated by a set of combinatorial states of regulator genes. Because the correlation is performed at the "step levels", SLDR can therefore screen, identify, and rank potential regulation relationship of genetic regulators to targets. The prediction pairs can be validated reasonably well by data from the Yeast Fitness Database, although the overlap is not high due to incomplete coverage of the database. Note that our method of finding genetic regulatory relationship of regulators to targets relies on Pearson Correlation at the moment. Future extension of this work could explore additional correlation techniques such as using Goodman and Kruskal's gamma correlation coefficient or Spearman's rank correlation coefficient to compare their impact on prediction performance.
In summary, our study demonstrated a new direction to identify genetic regulatory relationships. If applied broadly, the technique could yield many new data worth biological investigations. Such relationships, although indirect, may help construct biological signaling network to overcome coverage issues facing network biology models today. Future directions to make SLDR an easy-to-use software package and develop databases to capture the results are under development. By applying SLDR to human gene expression perturbation data, we believe our framework may also be extended to provide complementary insights on human complex biological systems and disease network biology.
This work and publication was supported in part by research funding obtained by Jake Chen at Wenzhou Medical University from China and at Indiana University School of Informatics and Computing. We also thank the generous support of Capital Normal University, China to provide the initial resources that led to the conception of this work.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
- James CG, Appleton CT, Ulici V, Underhill TM, Beier F: Microarray analyses of gene expression during chondrocyte differentiation identifies novel regulators of hypertrophy. Molecular biology of the cell. 2005, 16 (11): 5316-5333. 10.1091/mbc.E05-01-0084.PubMed CentralView ArticlePubMedGoogle Scholar
- Lorenz WW, Alba R, Yu YS, Bordeaux JM, Simoes M, Dean JF: Microarray analysis and scale-free gene networks identify candidate regulators in drought-stressed roots of loblolly pine (P. taeda L.). BMC genomics. 2011, 12: 264-10.1186/1471-2164-12-264.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang W, Cherry JM, Botstein D, Li H: A systematic approach to reconstructing transcription networks in Saccharomycescerevisiae. Proceedings of the National Academy of Sciences of the United States of America. 2002, 99 (26): 16893-16898. 10.1073/pnas.252638199.PubMed CentralView ArticlePubMedGoogle Scholar
- Barabasi AL, Gulbahce N, Loscalzo J: Network medicine: a network-based approach to human disease. Nature reviews Genetics. 2011, 12 (1): 56-68. 10.1038/nrg2918.PubMed CentralView ArticlePubMedGoogle Scholar
- Snyder M, Gallagher JE: Systems biology from a yeast omics perspective. FEBS letters. 2009, 583 (24): 3895-3899. 10.1016/j.febslet.2009.11.011.PubMed CentralView ArticlePubMedGoogle Scholar
- Goodson HV, Anderson BL: Synthetic Lethality Screen Identifies a Novel Yeast. The Journal of Cell Biology. 1996, 133 (6): 1277-1291. 10.1083/jcb.133.6.1277.View ArticlePubMedGoogle Scholar
- Landt Stephen G, Marinov Georgi K, Anshul Kundaje, Pouya Kheradpour: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Cold Spring Harbor Laboratory Press. 2012, 1813-1831. 22Google Scholar
- Chen J, Shen C: MINING ALZHEIMER DISEASE RELEVANT PROTEINS FROM INTEGRATED PROTEIN INTERACTOME DATA. Pacific Symposium on Biocomputing. 2006, 11: 367-378.Google Scholar
- Yeh RF, Lim LP, Burge CB: Computational inference of homologous gene structures in the human genome. Genome research. 2001, 11 (5): 803-816. 10.1101/gr.175701.PubMed CentralView ArticlePubMedGoogle Scholar
- Audic S, Claverie J-M: The Significance of Digital Gene Expression Profiles. Genome research. 1997, 7: 986-995.PubMedGoogle Scholar
- Gerhard MGM: A system for pattern matching applications on biosequences. Comput Appl Biosci. 1993, 9 (3): 299-314.Google Scholar
- Huang N, Jacobson MP: Physics-based methods for studying protein-ligand interactions. Current Opinion in Drug Discovery & Development. 2007, 10 (3): 325-331.Google Scholar
- Wan P, Yue Z, Xie Z, Gao Q, Yu M, Yang Z, Huang J: Mechanisms of Radiation Resistance in Deinococcus radiodurans R1 Revealed by the Reconstruction of Gene Regulatory Network Using Bayesian Network Approach. J Proteomics Bioinform. 2013, 6 (7):Google Scholar
- van Wageningen S, Kemmeren P, Lijnzaad P, Margaritis T, Benschop JJ, de Castro IJ, van Leenen D, Groot Koerkamp MJ, Ko CW, Miles AJ: Functional overlap and regulatory links shape genetic interactions between signaling pathways. Cell. 2010, 143 (6): 991-1004. 10.1016/j.cell.2010.11.021.PubMed CentralView ArticlePubMedGoogle Scholar
- Hillenmeyer ME, Fung E, Wildenhain J, Pierce SE, Hoon S, Lee W, Proctor M, St Onge RP, Tyers M, Koller D: The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science. 2008, 320 (5874): 362-365. 10.1126/science.1150021.PubMed CentralView ArticlePubMedGoogle Scholar
- Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, Fridman W-H, Pagès F, Trajanoski Z, Galon J: ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks. Bioinformatics. 2009, 25 (8): 1091-1093. 10.1093/bioinformatics/btp101.PubMed CentralView ArticlePubMedGoogle Scholar
- ldener UG, tter MMn, ller GK, Strack N, Helden Jv, Lemer C, Richelles J, Wodak SJ, García-Martínez J, Peérez-Ortín3 JE: CYGD: the Comprehensive Yeast Genome Database. Nucleic acids research. 2005, 33 (Database): 10-1093.Google Scholar
- Barsky A, Gardy JL, Hancock REW, Munzner T: Cerebral: a Cytoscape plugin for layout of and interaction with biological networks using subcellular localization annotation. Bioinformatics. 2007, 23 (8): 1040-1042. 10.1093/bioinformatics/btm057.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.