Volume 11 Supplement 11
Proceedings of the 21st International Conference on Genome Informatics (GIW2010)
A novel parametric approach to mine gene regulatory relationship from microarray datasets
 Wanlin Liu†^{1},
 Dong Li†^{1},
 Qijun Liu^{1, 2},
 Yunping Zhu^{1}Email author and
 Fuchu He^{1}Email author
DOI: 10.1186/1471210511S11S15
© Zhu et al; licensee BioMed Central Ltd. 2010
Published: 14 December 2010
Abstract
Background
Microarray has been widely used to measure the gene expression level on the genome scale in the current decade. Many algorithms have been developed to reconstruct gene regulatory networks based on microarray data. Unfortunately, most of these models and algorithms focus on global properties of the expression of genes in regulatory networks. And few of them are able to offer intuitive parameters. We wonder whether some simple but basic characteristics of microarray datasets can be found to identify the potential gene regulatory relationship.
Results
Based on expression correlation, expression level variation and vectors derived from microarray expression levels, we first introduced several novel parameters to measure the characters of regulating gene pairs. Subsequently, we used the naïve Bayesian network to integrate these features as well as the functional coannotation between transcription factors and their target genes. Then, based on the character of timedelay from the expression profile, we were able to predict the existence and direction of the regulatory relationship respectively.
Conclusions
Several novel parameters have been proposed and integrated to identify the regulatory relationship. This new model is proved to be of higher efficacy than that of individual features. It is believed that our parametric approach can serve as a fast approach for regulatory relationship mining.
Background
Gene regulation, a basic process of organisms, is important for systems biology research. Gene regulatory relationship mining can help identify the complicated regulatory networks, uncover the regulatory patterns in the cell, and expand the systematic view of biological processes.
In the past decade, as a novel highthroughput method, microarray has been widely used in genome wide research. Therefore, many algorithms have also been introduced in this field to construct gene regulatory networks based on microarray data.
A basic hypothesis among these approaches is that the variation of expression levels of transcription factors (TF) will affect expression levels of its target genes (TGs) through the regulatory relationships. In other words, the expression profiles of TF and its TGs are somewhat interrelated. Consequently, measuring the correlation of the expression profiles represented by microarrays, especially timeseries microarrays, has become a natural consideration.
Some of the previous work has contributed to this task [1]. According to the characters of the models, the algorithms can be broadly classified into several different categories.
First, clustering algorithms are basic and simple methods, based on the similarity of the expression levels of TF and its TGs [2, 3]. Meanwhile, some graph models are used, such as classical graphical Gaussian models [4], and the coexpression graphbased approach [5]. Secondly, a series of network models has been widely used, such as Boolean network [6–8], naïve Bayesian network [9], and dynamic Bayesian network [10–12]. These methods are considered to be the mainstream gene regulatory constructing methods. Besides, ODE (ordinary differential equation) [13], NDE (nonlinear differential equation) [14] and pDE (partial differential equations) have also been introduced, which can adjust the parameters of differential equations manually to the biological process. The mutual information method is gaining popularity [15, 16] and is used to measure the entropy of the whole system.
However, most of these complex models and algorithms focus on global properties, and some of them could not offer parameters. This leaves us wondering whether some simple but potentially probable basic characteristics could be uncovered.
As the first step, we tested the basic Pearson correlation coefficient, PCC [17] to outline the relationship between gene regulatory relationships and the expression level represented by microarray. Generally, the existence of regulatory relationship is more likely while the absolute value of the PCC is larger. There are, however, some exceptions (see Additional file 1). In order to handle this problem, several indicators or parameters should be introduced. During correlation analysis, we found it necessary to take into account the variation of expression levels. Therefore, we measured the variation characters using both expression level differences (ELD) and differences of average  standard deviation (ΔmeanΔδ) parameter groups. These parameters, when combined with PCC, can effectively improve the accuracy of prediction.
Moreover, we considered expression level vectors mapped on TFTG expression level space. Given the property of timeseries gene expression relationships, it is naturally assumed that different expression relationships might relate to the vectors in different quadrants. Therefore we calculated the sum vectors in different quadrants respectively and chose the representative vector as the main vector of one TFTG pair. Afterwards, the modulus  argument (xθ) parameter group was fixed.
Results
Construction of multiple features to mine regulatory relationships
Variation of the expression levels
To deal with exceptions during the measurement by Pearson correlation coefficient, other elements besides PCC should be considered. Taking into account the levels of expression strength, we supposed that the similarity may relate to the distribution range of the variation of expression levels. e.g., the smaller the variation range of the expression levels of tested TFTG is, the more the pair is likely to be strongly correlated (See Additional file 1).
To further study the statistic feature of expression levels, the standard deviation and their mean value were both used. The differences in the mean value and the standard deviation between TF and its TG were calculated, as in the case of another parameter group ΔmeanΔδ.
Modulus and angle of expression level vector by vector analysis
Vector analysis was used to compare gene expression responses between different experimental backgrounds[18] according to a simple principle. The change of the gene expression against the two experimental backgrounds is represented by a vector. Both up or down regulation and the regulatory intensity can be showed. The various sectors of the Cartesian plane will correspond to various prototypical behaviours of genes.
Here we considered an extension of vector analysis, mapping expression levels under each condition or time point in the coordination system of TFTG’s expression levels. Moreover, we attempted to infer the expression patterns such as correlation or inversion with or without timeshift.
GO coannotation score
If more than two leaves are derived from one divergent node, a weight score will be subtracted.
Timedelay character for predicting of the direction of regulatory relationships
During the transmission of expression perturbation from the TF to TG, a time delay might occur. Therefore, a suitable parameter that can describe the existence of the time delay might help to fix the direction of the regulatory relationship.
The point which has the extremum amplitude is a character that is to be considered. If the expression profile can be regarded as the continuous function of time, i.e., y = f(t), the extremum exists at y’ = Δy = dy/dt = 0. For discrete points such as the microarray timeseries expression profile, this problem can be solved by difference, with the popularization of the differentiation. Boundary condition Δy = 0 is often not available, so condition Δy(t)·Δy(t  1) < 0 might be more appropriate.
Multiple features can be used to predict regulatory relations
Variation of expression levels
As shown in Figure 5, a significant classification could be found. The green points represent the nonexisting pairs relatively concentrated in the region while the blue ones represent the existing regulatory pairs in this region. Judging by PCC alone without considering the parameter ELD would lead to more false negative judgments. Now the typical nonideal cases described in the Additional file 1 C and D might refuse to be classified falsely by PCC alone.
Furthermore, differences in standard deviation and mean value have a widely accepted definition and the accuracy of this parameter group is acceptable. So we also take the parameters Δmean  Δδ as indicators. Generally, Δmean – Δδ correspond to the likelihood ratio distribution represented by PCC. The accuracy of this parameter group is as high as that of ELD verified by J48 classification tree. However, it also has some cons, i.e. the parameters are pairwise, and these two parameters should be used at the same time.
Modulus and angle of expression levels vector by vector analysis
Given the properties of timeseries gene expression relationships, it is natural to calculate the sum vectors in different quadrants respectively before choosing the vector which has the largest modulus as the main representative vector of the TFTG pair. And the sumvector should be mapped on the modulusargument spaces (See Figure 6b).
Compared with nonquadrant vector analysis, quadrant vector analysis can yield a much more significant classification result. Additional file 2 shows the sum vectors of regulation pairs in the experiment group coloured by correlation of the expression level (PCC). In line with the meaning of PCC, generally, main vectors in different quadrants indeed represent the different expression patterns of regulatory pairs.
GO Coannotation
As we expected, the frequency of the nonexisting control cases declined while GO score increased. Meanwhile, the existing gene regulatory pairs showed increasing frequencies, suggesting that gene pairs with true regulatory relationship are more likely to emerge in the same biological process.
Likelihood ratio
The distribution of experimental and control groups according to the GO scores are perceptibly different, and it is easy to prove the existence of differences in distribution by statistical methods. However, the underlying meaning of the distribution has been left undisclosed. The positive likelihood ratio is a good option, for it could indicate the probability of the existence of the existing regulatory relationship and be used for integration of the parameters with Bayesian model.
As shown in Figure 8a, corresponding to the GO score, LR s are higher in those bars where GO scores are higher, indicating that the probability is higher when GO score is relatively high as the positive likelihood ratio represents the probability. This result corresponds with the analysis above.
The Bayesian model integrating multiple evidences is proved to be highly efficient
Joint likelihood ratio
Since some parameters we introduced are paired, the respective use of their LR s might be unreasonable. Therefore we can combine the paired parameters to calculate the joint likelihood ratio of the joint parameter groups.
Besides, calculating joint likelihood ratio of ELD and PCC can save the trouble of proving the linear independence of PCC and ELD. And this method can be used for parameters groups e.g. modulusangel groups naturally.
Integration by Bayesian model
Time delay character for the prediction of the direction of regulatory relationships
Different numbers of peaks are tested respectively. Both mean and median values have been considered.
Comparison with other methods
Typical existing methods include clustering algorithms, Bayesian networks, mutual information theory, as well as ordinary differential equations. Bansal et al. have compared these representative algorithms based on the simulated datasets [20].
Result of our approach on simulated dataset
Conditions  # of genes  # of samples  Direction  Se priority  Balanced  PPV priority  

PPV  Se  PPV  Se  PPV  Se  
10  100  u  0.85  0.92  0.87  0.87  0.91  0.26  
Global  d  0.41  0.89  0.51  0.78  0.71  0.63  
100  1000  u  0.20  0.95  0.30  0.28  0.99  0.05  
d  0.11  0.93  0.25  0.16  0.93  0.09  
10  10  u      0.58  0.85      
Local  d  0.57  0.93  0.78  0.78  0.94  0.63  
100  100  u  0.25  0.92  0.57  0.50  0.95  0.10  
d  0.18  0.89  0.33  0.60  0.97  0.10  
10  100  u  0.47  0.97  0.67  0.79  0.93  0.63  
Dynamic  d  0.49  0.96  0.64  0.67  0.92  0.48  
100  1000  u  0.21  0.90  0.42  0.23  0.55  0.10  
d  0.11  0.91  0.39  0.21  0.94  0.11 
Compared with algorithms tested by [20], our approach is better for dynamic (timeseries) expression datasets. As shown in Table 1, in all these cases, while PPVs of our approach are equal to or slightly greater than those of the methods in [20], the corresponding Sensitivities are greater. For steadystate expression of global perturbation, our result is comparable with methods in [20]. Besides, our method performed on smaller network is somewhat better than that on the larger one. And the undirected predictions are slightly better than the directed ones.
Discussion
In our research, besides the commonly used PCC, we proposed ELD, ΔmeanΔδ, and xθ as new parameters based on dynamic variation, as well as vector analysis. The parameter ELD represents the variation range character of the expression levels, and may prevent nonideal cases from false classification by PCC alone. Then, ΔmeanΔδ has an acceptable definition and accuracy. In vector analysis, we found that even the subvectors of a true regulatory gene pair with perfectly synchronous expression profiles are still distributed in different quadrants, and the sum vector might counteract partly. Vectors of one regulatory gene pair in different quadrants might represent different patterns. Therefore, we calculated the sum vectors in different quadrants respectively, and then chose the vector which has the largest modulus as the main representative vector of one TFTG pair. Compared with nonquadrant vector analysis, the difference of distributions of modulus and argument is significant.
Also, we analyzed the functional coannotation of transcription factors and their target genes, and then selected GO score as another parameter. As expected, the frequency of the nonexisting control cases declined while GO score increased. Meanwhile, the existing gene regulatory pairs showed increasing frequencies, suggesting that gene pairs with true regulatory relationship have better chance of emerging in the same biological process.
Subsequently, we considered the Bayesian model for the likelihood ratio integration. Then the result was fairly acceptable. In our cases, some parameters we introduced are paired. We therefore combined the parameters to calculate the joint likelihood ratio of the joint parameter groups. The joint likelihood ratios of paired parameters make the LR s seems reasonable and there is no need to prove the linear independence for the parameters.
Our approach is mainly based on several novel parameters, which could be intuitive indicators. We introduced these parameters to describe characters of microarray expression data of regulating gene pairs. These features include the variation of expression level, the divergence of statistical characters, and the consistent degree of representative measurements. Additionally, our approach is much less costly than some mainstream methods. Therefore, our approach can serve as a fast preprocess strategy for microarray data analysis.
Some papers argue that inferring regulatory relationship based on microarray has inherent faults. First, the similarity of the expression profile suggests nothing more than a statistical dependency between two genes, not a direct causal relationship. The verification of the relationship requires other evidences, such as ChIPchip data, Y2H or other wet experiments. Second, essential genes [21] which are always expressed in the cell cannot be disturbed by knockdown or knockout. Therefore microarray experiments do not work well on these essential genes. Third, microarray is a kind of highthroughput analysis technology after all, so it cannot be very precise. Genes with a slim expression level can hardly be detected accurately.
Recently a series of reports indicates that the microarray might be replaced by fast highthroughput sequencing [22, 23], which, however, cannot be made as inexpensive and efficient as microarray now. In the future, microarray might be used to meet more specific research needs, such as fast elementary filter or test. Therefore complex models might not be suited to the fast measurement of the microarray. Though our approach is more or less rough and far from perfect, we still believe some simple indicators based on uncomplicated characters would reveal complex behaviour.
Conclusions
With the rapid deposition of the microarray data in recent years, microarray data have become an increasingly important data source for bioinformatics research. On the basis of microarray data, constructing gene regulatory networks has also become a hotspot. By constructing the gene regulatory network, we can identify the complicated regulatory relationships, uncover the regulatory patterns in the cell, and gain the global view of the biological process. In this paper, we present some novel parameters to uncover potential characters of regulatory relationships. In addition to routine description of the similarity of the expression levels, our proposed parameters measured range of the variation and the statistic feature of expression levels, consistency of subvectors of the expression level, as well as functional coannotation of regulating pairs. Unlike other global expression profiles computational methods, our approach is mainly based on several novel parameters, which could be intuitive indicators. And our parametric approach can serve as a fast approach for regulatory relationship mining.
Materials and methods
Datasets
As a simple but important organism, yeast Saccharomyces cerevisiae is a proper target of research. First we set up an experiment group of regulatory element pairs with existing (true) regulatory relationship. These existing pairs were obtained from published literature [24, 25]. In addition, we constructed a control group for training dataset. The pairs in control group were randomly selected and known existing regulatory pairs had been excluded. During the research, we observed that the ratio of existing and nonexisting pairs in the training set would affect the result. The increases of negative data in training set induce a decrease of positive prediction value with the fixed sensitivity. It indicates that suitable ratio of positive and negative must be noticed. The result is shown in Additional file 4. Time series datasets derived from cell cycle experiments were downloaded from GEO dataset in PubMed. The GO annotations are retrieved by GOfact [26].For a proper comparison with other methods, artificial datasets is an appropriate choice. In silico data could control the noise levels of the data. Here, we used the datasets in [20], which was generated an artificial dataset by linear ODEs, with the mean of white noise 0 and standard deviation 0.1.
Bayesian model
Bayesian model has been widely used for integrating proofs [27, 28]. Likelihood ratio is the probability of observing an existing gene pair in predictive datasets divided by the probability of observing the nonexisting gene pair in predictive datasets [29]. Here prior odds are the chance of choosing a pair of regulatory genes from all candidate gene pairs.
f_{i} means the number of gene pairs in the dataset i. And the Bayesian method is considered
O_{ post } = O_{ prior } × LR(f_{1}…f_{ n })
In this formula, LR means likelihood ratio, and positive means gold standard positive dataset of gene pairs where real regulatory relationship exists. And negative is the gold standard negative dataset in which no gene pair has any regulation.
This is also known as the naïve Bayesian network.
Notes
List of abbreviations
 AUC:

area under the curve
 ELD:

expression level differences
 GEO:

Gene Expression Omnibus
 GO:

Gene Ontology
 LR:

likelihood ratio
 ODE:

ordinary differential equation
 PCC:

Pearson correlation coefficient
 PPV:

positive predictive value
 ROC:

receiver operating characteristic
 Se:

sensitivity
 TF:

transcriptional factor
 TG:

target gene
Declarations
Acknowledgements
This work was supported by the Chinese National Key Program of Basic Research [grant numbers 2006CB910803, 2010CB912700 and 2011CB910202], the National High Technology Research and Development Program of China [grant number 2006AA02A312], National Science and Technology Major Project [grant numbers 2008ZX10002016, 2009ZX09301002], National Natural Science Foundation of China [grant numbers 30621063, 30800200].
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 11, 2010: Proceedings of the 21st International Conference on Genome Informatics (GIW2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S11.
Authors’ Affiliations
References
 Yu H, Luscombe NM, Qian J, Gerstein M: Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet 2003, 19(8):422–427. 10.1016/S01689525(03)001756View ArticlePubMed
 Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci U S A 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMed
 Amato R, Ciaramella A, Deniskina N, Del Mondo C, di Bernardo D, Donalek C, Longo G, Mangano G, Miele G, Raiconi G, et al.: A multistep approach to time series analysis and gene expression clustering. Bioinformatics 2006, 22(5):589–596. 10.1093/bioinformatics/btk026View ArticlePubMed
 Toh H, Horimoto K: Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling. Bioinformatics 2002, 18(2):287–297. 10.1093/bioinformatics/18.2.287View ArticlePubMed
 Yan X, Mehan MR, Huang Y, Waterman MS, Yu PS, Zhou XJ: A graphbased approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics 2007, 23(13):i577–586. 10.1093/bioinformatics/btm227View ArticlePubMed
 Huang S: Gene expression profiling, genetic networks, and cellular states: an integrating concept for tumorigenesis and drug discovery. J Mol Med 1999, 77(6):469–480. 10.1007/s001099900023View ArticlePubMed
 Martin S, Zhang Z, Martino A, Faulon JL: Boolean dynamics of genetic regulatory networks inferred from microarray time series data. Bioinformatics 2007, 23(7):866–874. 10.1093/bioinformatics/btm021View ArticlePubMed
 Kim H, Lee JK, Park T: Boolean networks using the chisquare test for inferring largescale gene regulatory networks. BMC Bioinformatics 2007, 8: 37. 10.1186/14712105837PubMed CentralView ArticlePubMed
 Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. J Comput Biol 2000, 7(3–4):601–620. 10.1089/106652700750050961View ArticlePubMed
 Pe'er D, Regev A, Elidan G, Friedman N: Inferring subnetworks from perturbed expression profiles. Bioinformatics 2001, 17(Suppl 1):S215–224.View ArticlePubMed
 Husmeier D: Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks. Bioinformatics 2003, 19(17):2271–2282. 10.1093/bioinformatics/btg313View ArticlePubMed
 Zou M, Conzen SD: A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 2005, 21(1):71–79. 10.1093/bioinformatics/bth463View ArticlePubMed
 Bansal M, Gatta GD, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 2006, 22(7):815–822. 10.1093/bioinformatics/btl003View ArticlePubMed
 Vu TT, Vohradsky J: Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae. Nucleic Acids Res 2007, 35(1):279–287. 10.1093/nar/gkl1001PubMed CentralView ArticlePubMed
 Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7(Suppl 1):S7. 10.1186/147121057S1S7PubMed CentralView ArticlePubMed
 Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000, 418–429.
 Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18(Suppl 2):S231–240.View ArticlePubMed
 Breitling R, Armengaud P, Amtmann A: Vector analysis as a fast and easy method to compare gene expression responses between different experimental backgrounds. BMC Bioinformatics 2005, 6: 181. 10.1186/147121056181PubMed CentralView ArticlePubMed
 Baldi P, Brunak S, Chauvin Y, Andersen CA, Nielsen H: Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 2000, 16(5):412–424. 10.1093/bioinformatics/16.5.412View ArticlePubMed
 Bansal M, Belcastro V, AmbesiImpiombato A, di Bernardo D: How to infer gene networks from expression profiles. Mol Syst Biol 2007, 3: 78.PubMed CentralView ArticlePubMed
 Mnaimneh S, Davierwala AP, Haynes J, Moffat J, Peng WT, Zhang W, Yang X, Pootoolal J, Chua G, Lopez A, et al.: Exploration of essential gene functions via titratable promoter alleles. Cell 2004, 118(1):31–44. 10.1016/j.cell.2004.06.013View ArticlePubMed
 Ledford H: The death of microarrays? Nature 2008, 455(7215):847. 10.1038/455847aView ArticlePubMed
 Shendure J: The beginning of the end for microarrays? Nat Methods 2008, 5(7):585–587. 10.1038/nmeth0708585View ArticlePubMed
 Lee TI, Rinaldi NJ, Robert F, Odom DT, BarJoseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al.: Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 2002, 298(5594):799–804. 10.1126/science.1075090View ArticlePubMed
 Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M: Genomic analysis of regulatory network dynamics reveals large topological changes. Nature 2004, 431(7006):308–312. 10.1038/nature02782View ArticlePubMed
 Li D, Li J, Ouyang S, Wang J, Xu X, Zhu Y, He F: An Integrated Strategy for Functional Analysis in Largescale Proteomic Research by Gene Ontology. Progress in Biochemistry and Biophysics 2005, 32(11):1026–1029.
 Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 2003, 302(5644):449–453. 10.1126/science.1087361View ArticlePubMed
 Rhodes DR, Tomlins SA, Varambally S, Mahavisno V, Barrette T, KalyanaSundaram S, Ghosh D, Pandey A, Chinnaiyan AM: Probabilistic model of the human proteinprotein interaction network. Nat Biotechnol 2005, 23(8):951–959. 10.1038/nbt1103View ArticlePubMed
 Eddy SR: What is Bayesian statistics? Nat Biotechnol 2004, 22(9):1177–1178. 10.1038/nbt09041177View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.