The identification of informative genes from multiple datasets with increasing complexity
 S Yahya Anvar^{1, 2}Email author,
 Peter AC 't Hoen^{2} and
 Allan Tucker^{1}
https://doi.org/10.1186/147121051132
© Anvar et al; licensee BioMed Central Ltd. 2010
Received: 11 August 2009
Accepted: 15 January 2010
Published: 15 January 2010
Abstract
Background
In microarray data analysis, factors such as data quality, biological variation, and the increasingly multilayered nature of more complex biological systems complicates the modelling of regulatory networks that can represent and capture the interactions among genes. We believe that the use of multiple datasets derived from related biological systems leads to more robust models. Therefore, we developed a novel framework for modelling regulatory networks that involves training and evaluation on independent datasets. Our approach includes the following steps: (1) ordering the datasets based on their level of noise and informativeness; (2) selection of a Bayesian classifier with an appropriate level of complexity by evaluation of predictive performance on independent data sets; (3) comparing the different gene selections and the influence of increasing the model complexity; (4) functional analysis of the informative genes.
Results
In this paper, we identify the most appropriate model complexity using crossvalidation and independent test set validation for predicting gene expression in three published datasets related to myogenesis and muscle differentiation. Furthermore, we demonstrate that models trained on simpler datasets can be used to identify interactions among genes and select the most informative. We also show that these models can explain the myogenesisrelated genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. Finally, after further evaluating our results on synthetic datasets, we show that our approach outperforms a concordance method by Lai et al. in identifying informative genes from multiple datasets with increasing complexity whilst additionally modelling the interaction between genes.
Conclusions
We show that Bayesian networks derived from simpler controlled systems have better performance than those trained on datasets from more complex biological systems. Further, we present that highly predictive and consistent genes, from the pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. We conclude that networks trained on simpler controlled systems, such as in vitro experiments, can be used to model and capture interactions among genes in more complex datasets, such as in vivo experiments, where these interactions would otherwise be concealed by a multitude of other ongoing events.
Background
Highthroughput gene expression profiling experiments have increased our understanding of the regulation of biological processes at the transcriptional level. In bacteria [1] and lower eukaryotes, such as yeast [2], modeling of regulatory interactions between large numbers of proteins in the form of regulatory networks has been successful. A regulatory network represents relationships between genes and describes how the expression level, or activity, of genes can affect the expression of other genes. The network includes causal relationships where the protein product of a gene (e.g. transcription factor) directly regulates the expression of a gene but also more indirect relationships. Modeling has been less successful for more complex biological systems such as mammalian tissues, where models of regulatory networks usually contain many spurious correlations. This is partly attributable to the increasingly multilayered nature of transcriptional control in higher eukaryotes, e.g. involving epigenetic mechanisms and noncoding RNAs. However, a potential major reason for the decreased performance is due to biological complexity of datasets which can be defined as the increase of biological variation and the presence of different cell types, which is not compensated by an increase in the number of replicate data points available for modeling. There is an urgent need to identify regulatory mechanisms with more confidence to avoid wasting laborious and expensive wetlab followup experiments on false positive predictions.
The main paradigms of this paper are that regulatory interactions that are consistently found across multiple datasets are more likely to be fundamentally involved and that these regulatory interactions are easier to find in datasets with less biological variation. In the end, regulatory networks trained on less complex biological systems could thus be used for the modeling of the more complex biological systems. We do this using a novel computational technique that combines Bayesian network learning with independent test set validation (using error and variance measures) and a ranking statistic. Whilst Bayesian networks and Bayesian classifiers have been used with great success in bioinformatics [3, 4], an important weakness has been that, when trying to build models that reveal genuine underlying biological processes, a highly accurate predictive model is not always enough [5]. The ability to generalize to other datasets is of greater importance [6]. Simple crossvalidation approaches on a single dataset will not necessarily result in a model that reflects the underlying biology and therefore will not generalize well. Our approach is to exploit multiple datasets of increasingly complex systems in order to identify more informative genes reflecting the underlying biology.
Bayesian networks have been an important concept for modeling uncertain systems [7–10]. In the last decade several researchers have examined methods for modeling gene expression datasets based on Bayesian network methodology [2–4]. These networks are directed acyclic graphs (DAG) that represent the joint probability distribution of variables efficiently and effectively [11]. Each node in the graph represents a gene, and the edges represent conditional independencies between genes. Bayesian networks are popular tools for modeling gene expression data as their structure and parameters can easily be interpreted by biologists.
Bayesian classifiers are a family of Bayesian networks that are specifically aimed to classify cases within a data set through the use of a class node. The simplest is known as the naïve Bayes classifier (NBC) where the distribution for every variable is conditioned upon the class and assumes independence between the variables. Despite this oversimplification, NBCs have been shown to perform very competitively on gene expression data in classification and feature selection problems [5, 12, 13]. Other Bayesian classifiers, which often have higher model complexity as they contain more parameters, involve learning different networks such as trees between the variables and therefore relax the independence assumption [11]. The logical conclusion is the general Bayesian Network Classifier (BNC) which simply learns a structure over the variables including the class node. In this paper, we explore the use of the NBC, and the BNC for predicting expression on independent datasets in order to identify informative genes using classifiers of differing complexity.
Accordingly, in order to optimize the classifier and choose the best method, we need to consider the classifiers' bias and variance. Since bias and variance have an inverse relationship [12], which means decreasing in one increases the other, crossvalidation methods can be adopted in order to minimize such an effect. The kfold crossvalidation [12, 14] randomly splits data into k folds of the same size. A process is repeated k times where k1 folds are used for training and the remaining fold is used for testing the classifier. This process leads to a better classification with lower bias and variance [15] than other training and testing methods when using a single dataset. In this paper, we exploit bias and variance using both crossvalidation on a single dataset and also independent test data in order to learn models that better represent the true underlying biology. In the next section we provide a description of the gene identification algorithm for identifying gene subsets that are specific to a single simple dataset as well as subsets that exist across datasets of all biological complexity. We used den Bulcke et al. [16] proposed model for generating synthetic datasets to validate our findings on real microarray data. Moreover, we evaluate the performance of our algorithm by comparing the ability of this model in identifying the informative genes and underlying interactions among genes with the concordance model. Finally, we present the conclusion and summary of our findings in the last section.
Methods
MultiData Gene Identification Algorithm
The algorithm involves taking multiple datasets of increasing biological complexity as input and a repeated training and testing regime. Firstly, this involves a kfold crossvalidation approach on the single simple dataset (from now on we refer to this as the crossvalidation data) where Bayesian networks are learnt from the training set and tested on the test set for all k folds. These folding arrangements have been used again for assessing a final model. The Bayesian Network learning algorithm is outlined in the next section.
The Sum Squared Error (SSE) and variance is calculated for all genes over these folds by predicting the measured expression levels of a gene given the measurements taken from others. Next, the same models from each k fold are tested on the other (more complex) datasets (the independent test data) and SSE and variance are again calculated. These SSE and variances are used to rank the genes according to their informativeness (which represents the most predictive and influential genes). Those that are ranked highly in the singledataset crossvalidation experiments will be informative, specific to the single datasets experiment, whereas those that are ranked highly on the independent datasets should be informative in a more general sense in that they are predictive (low SSE) and consistent (low variance) across datasets of all complexity. We evaluate the statistical significance of these rankings using a method proposed by Zhang et al. [17]. The full details are outlined in Algorithm 1 where TrainD represents the training data (crossvalidation data, here the relatively simple datasets), and TestD_{ 1 } ... TestD_{ M } represent the more complex test datasets, independent test data.
Bayesian Network Structure Learning
The goal of learning gene regulatory networks using Bayesian network approaches is to establish the structure of the network and then to parameterize the conditional probability tables [18]. As the number of possible network structures is huge, learning the structure of a network has a high computational cost. Since the effective learning of network structure engages a tradeoff of bias vs. variance, the necessity of designing an algorithm in which it can generate an ideal structure for a given dataset, with a degree of biological complexity, is crucial [19]. In this study, instead of using well studied but unrealistic and sometimes not effective classifiers such as NBC and Tree Augmented Networks (TAN), we use an optimization approach that uses a simulated annealing search and the Bayes Information Criterion (BIC) as a scoring metric [20]. The advantage of simulated annealing over other methods (like greedy searches or hill climbing) is that it aims to avoid local maxima [11]. We have chosen the BIC as a fitness function as it is less prone to overfitting through the use of a penalizing term for overly complex models.
Bayesian networks with more connections between their nodes require a higher number of parameters and as a result increase the complexity of the models exponentially [21]. Therefore, we explore three different classes of model learning: the Selective Naïve Bayes (SNB) where only links between a class node representing differentiation status and a gene are explored, a search that explores structures with links between genes but limiting each gene to having only one parent (1PB). Limiting the number of parents in a Bayesian network is common practise but can be considered a crude approach to reducing parameters. As a result we also explore a full unlimited structure learning (NPB) and learn these structures using the simulated annealing with the BIC scoring metric (which naturally penalises overly complex networks). In this study, the initial state of the structure is an empty DAG with no link. In order to alter the network structures, three operators have been used within the simulated annealing. These operators are adding, removing, or swapping links to generate a new network for validation. These alterations can be either accepted or rejected. The outline of this procedure can be found in Algorithm 2.
Prediction and Ranking
Zhang et al. [17] proposed a method to convert a set of gene rankings into position pvalues to evaluate the significance of a given gene. However, this involved working with resampling techniques upon a single dataset. Here, we use the ranking lists according to the model's average SSE and variance for both the original simple dataset and the independent test sets in order to generate position pvalues. This requires us to include, a number of random genes which can be counted as uninformative genes. By comparing the actual ranking of the gene with the null distribution we can calculate the position pvalues. In this paper we are using three independent datasets so we do not need to use resampling in order to generate more gene rankings as Zhang et al. [17] did in their experiments. In addition, the different rankings will have different interpretations as some are based purely on the simple dataset whilst others are influenced by error and variance on the more biologically complex independent data.
Datasets
With the aim of investigating the influence of the complexity of a gene expression dataset on the performance of classifiers in identifying the gene regulatory network, three gene expression datasets (with increasing biological variation) have been chosen for this study (GSE3858 [22], GSE1984 [23], and GSE989 [24]). These three datasets are all concerned with the differentiation of cells into the muscle (Myogenic) lineage. During this process, mononucleated precursor cells stop to proliferate, differentiate and fuse with each other to become elongated multinucleated myotubes or myofibres. This invitro system mimics the formation of new muscle fibres invivo. The cell types differ between the different datasets:

GSE3858: Embryonic fibroblasts (EF)

GSE989 and GSE1984: C2C12 tumor cell line that has the potential for differentiation into different mesodermic lineages (mainly muscle and bone)
Also methods to drive cells into myogenic differentiation differ:

GSE3858: Exogenous expression of the myogenic transcription factors are Myod and Myog.

GSE989 and GSE1984: Serum Starvation
Specification of three muscle differentiation datasets
Dataset  Cell Type  Platform  Samples  Time Points 

Tomczak  C2C12  Affy U74A  24  8 
Cao  EF  Affy 430.2  36  4 
Sartorelli  C2C12  Affy U74A  32  6 
Data Processing and Analysis
The raw microarray data were normalized and summarized with the RMA method [25], using the affy package in R. Only the 8904 probesets common to the Affymetrix U74A and 430.2 used in mentioned studies were considered in the analysis. All datasets were standardized to mean 0 and the standard deviation 1 across the genes. For the scope of this paper, first, we selected for each dataset a subset of 100 genes most affected by the induction of differentiation. These genes were identified with Student's ttest which compared samples from undifferentiated and differentiated cell cultures, disregarding the time of differentiation. An additional 50 genes were randomly selected to be able to calculate ranking pscores described above and using the KolmogorovSmirnov test. For crossvalidation we divided Cao dataset into 9 folds, Sartorelli into 8 folds, and Tomczak into 6 folds based upon the number of samples in each dataset. Simulated annealing has three attributes which should be set before starting the learning phase. It is crucial to set an appropriate initial temperature, sufficient number of iterations, and a convenient fitness function. In this study, the initial temperature has been set to 10 and it terminates at 0.001. The number of iterations has been set to 1000 for the first set of experiments only using most informative genes (top 100) and then we set the number of iterations to 1500 since we added 50 uninformative genes to the network. The code is implemented in Matlab 2007a using the Bayes Net toolbox [26] to generate gene regulatory networks.
Analysis of myogenesisRelated genes
Myogenesisrelated genes are defined as genes associated with the Gene Ontology term "Muscle Development" supplemented with all genes strongly associated with Myogenesis in the biomedical literature, as determined with the literature analysis tool Anni v2.0 [27] with the association score greater than 0.02.
Analysis of Synthetic datasets
The use of datasets in which the underlying network is known enables us to validate the new algorithms that have been developed to identify gene regulatory networks and capture the most informative genes. den Bulcke et al. [16] proposed a new methodology to generate synthetic datasets where the network structure is known and biological, experimental, and model complexity can be manipulated. However, a disadvantage of this approach is that the generated networks can contain some overlapping pieces of the known network which may weaken the models being probabilistically independent [28]. Whilst SynTReN uses resampling from potentially overlapping networks, the generated data undergoes a robust statistical crossvalidation regime ensuring that any prediction is applied to unseen data. The focus of this paper is upon the prediction of increasingly complex datasets, sampled from some underlying biological process. Consequently, these synthetic datasets can be used for validating the performance of our methodology in identifying the informative genes and the interactions among them in real microarray data. SynTReN [16] generates networks with more realistic topological characteristics and since we use this application to investigate the impacts of biological, experimental, and model complexity on identifying informative genes using the same subnetwork is an advantage. Three datasets have been generated on the welldescribed network structure of E. coli [29] which contains 1330 number of nodes and 2724 interactions. These datasets have been generated in a manner that they can match the key characteristics of real microarray datasets we used in this study (for instance, limiting the number of genes that were selected for modelling to 150). This enables us to investigate the possibility of reproducing similar results on synthetic data which can be easily corrected for differences such as number of samples and time points per dataset (see Additional file 1) and avoid weakening the probabilistically independent assumption of the generated datasets.
Analysis of Concordance between datasets
The study of the concordance between microarray datasets has increased considerably in the past few years [30]. However, a robust statistical method for examining the concordance or discordance among microarray experiments carried out in different laboratories is yet to develop. Methods such as multiplication of gene pvalues in order to generate a list of rankings for concordance genes showed bias towards datasets with higher significance level [31]. Lai et al. [32] proposed a promising methodology (which we call concordance model) to investigate the concordance or discordance between two largescale datasets with two responses. This method uses a list of zscores, generated using a statistical test of differential expression, as an input to evaluate the concordance or discordance of two datasets by calculating the mixture model based likelihoods and testing the partial discordance against concordance or discordance. Additionally, the statistical significance of a test is being evaluated by the parametric bootstrap procedure and a list of gene rankings is being generated which can be used for integrating two datasets efficiently. In this paper we are using a set of gene rankings generated by this method to evaluate the performance of our model in identifying informative genes from multiple datasets with increasing complexity.
Results
The average correlations between replicates and number of differentially expressed genes (based on BH corrected pvalues) in each dataset
Genes with a Pvalue (BH) less than  

Dataset  Correlation  0.05  0.01 
Tomczak  0.975  4602  3604 
Cao  0.971  3668  2623 
Sartorelli  0.964  1199  458 
Comparison of classifiers and network analysis
According to Mac Nally [33] simple models should be sought for various reasons. Firstly, simple models are more stable and capable of not overfitting to noise in the data which will influence the performance of classifier with future data. Secondly, they tend to provide a better insight into causality and interactions among genes. Finally, reducing the number of parameters will decrease the cost of validating a model for current and future data. However, we need a model that matches the complexity of data sets. Considering this argument along with our first set of results, we chose 1PB as a model that can capture the interactions among genes and does not overfit to noise. In order to understand the impacts of using different datasets for gene selection and training 1PB classifier (which will be discussed in the next section), we need to analyse the performance of the 1PB classifier on the top 100 (most informative) genes in more detail.
Additional file 1, Figure S7 represents the comparison of the error rate of the 1PB classifier on crossvalidation versus the independent test. It is shown that the 1PB classifier trained on Tomczak performed significantly better on crossvalidation and Sartorelli shows the lowest differentiation between crossvalidation and the independent test with almost the same average error rate on the crossvalidation set compared to Cao. Although the differentiation of average error rate on the crossvalidation set and independent test set is high in Tomczak, this model produced the best models in terms of the lowest overall error rate. This figure raises the idea that Tomczak is the most informative dataset since it can model any dataset, regardless of the gene selection method, significantly better than the other alternatives. This will be discussed in more detail in the Extraction of infotmative genes section.
Comparison of gene selections with differing informativeness
It is important to adopt a methodology that can generate an accurate gene regulatory network, moreover, it is crucial to generate a model that can capture the significant genes and distinguish informative genes from uninformative ones. For this purpose, we added 50 randomly selected genes with high pvalues (which imply less relatedness to Myogenesis) from the distribution. This also has the effect that it will increase the complexity of the datasets.
It is crucial to investigate if these findings are reproducible and are not prone to the number of samples and time points per dataset. Therefore, we applied our model on three synthetic datasets that have been generated by manipulating the biological, experimental, and model complexity of their known network structure using SynTReN application [16]. Additional file 1, Figure S9 illustrates that we can see a very similar pattern as we have seen on a real data where there is an increase on the average error rate of models learnt on multiple synthetic datasets with increasing biological variability. In the next section, before examining if these models can help us to capture the interactions in more complex datasets, we will investigate how well these models separate the informative genes from uninformative ones.
Extraction of informative genes
In order to test the ability of classifiers to separate informative genes from uninformative ones, we have looked at the result of the KolmogorovSmirnov test (KS test) on the ranking of genes according to their average error rate using a given model. Using this algorithm, we calculated the pvalue, KS test, and the result of investigating the differentiation hypothesis along with the models' bias or variance. The results of this investigation are displayed in Additional file 1, Table S1 where Cao and Tomczak performed very well on crossvalidation both in terms of bias and variance. However, models learnt on Sartorelli fail to separate between informative genes and uninformative genes as the scores are generally very low.
Generally, Tomczak outperformed Sartorelli and Cao and can be chosen as the most informative dataset in this study. Models learnt on Tomczak generated the lowest bias and variance and produced the best separation. In contrast, Sartorelli is the noisiest and less informative dataset while it failed to handle any increases in complexity (both biological and model wise) and generates models with highest bias and variance which also cause disability to separate informative genes from the others. Now the question is whether we can use a simpler and cleaner dataset to model more complex ones. In the next section we show how we tackled this question.
Analysis of the use of simpler dataset to model more complex one
Myh7 and Tor3a are two examples of significant improvements in Sartorelli dataset. Myh7, which originally ranked 101, improved 96 places to rank 5 (rank 55 in concordance model). During the learning phase it has been linked to four other genes of which three of them are myogenesisrelated. These genes, in both datasets, have direct correlations and can represent each other in terms of prediction and validation. However, Tor3a has a very low rank in both dataset and yet improved 107 places from 128 to 21 (rank 31 in concordance model). It has been linked to Prune which also improved 106 places (from 131 to 25, 100 in concordance model). All three genes mentioned above have been selected as informative genes from Tomczak and yet placed into the bottom 50 due to the quality of Sartorelli dataset. These were some examples of the ability of model to pull out informative genes from a distribution (figures S10a and S10b, provided in the Additional file 1).
Although the overall improvement on myogenesisrelated genes is significantly high, we were concerned why this model failed to improve the rank of some genes like Id3 which dropped from rank 1 in Sartorelli to 133 (rank 51 in concordance model). In the learning process, Id3 has been linked to 4 genes which are: Fabp3, Rbm38, X99384, and Slco3a1. Now in order to answer the question, firstly, we validate the relatedness of these genes to Id3 in Tomczak dataset to investigate if they are significant and can represent Id3. Secondly, we study the expression level of these genes in Sartorelli to identify the reason why this model failed dramatically in predicting the Id3 value.
Additional file 1, Figure S11 demonstrates the expression level of Id3 along with its parent/children in both Tomczak and Sartorelli datasets. In Tomczak we can clearly see that there is an inverse relationship between Id3 and the other 4 genes which is very significant. While the differentiation state changes, Id3 drops from the expression level of approximately 11 to 8.5 and similarly its relatives show an increase of about 2 points in their expression values. This supports the assumption of the relatedness of these genes to Id3 in the learning process on Tomczak dataset. However, considering that Id3 is still very significant in Sartorelli, Id3 parent/children show no variation and simply are not significant. As a conclusion, this model failed to predict Id3 expression value and as a result the rank of Id3 dropped 132 places most probably due to the quality and biological variation of Sartorelli dataset. Since we aim to overcome the lack of overlap on the gene regulatory network studies across species and platforms, the natural extension of the work in this paper would be to explore how this model can be used on datasets from multiple biological systems with increasing complexity. Moreover, it would be valuable to consider methods such as model averaging [34] that has been shown better generalization in classifier's accuracy. Consequently, it improves the performance of classifiers in identifying the most informative genes and avoids deterioration of cases like Id3. Furthermore, dynamic Bayesian networks can be adopted when learning from timeseries data in order to handle autoregulation and feedback loops, two key components of regulatory networks in biological data [35, 36].
Conclusions
In this study, we have investigated a number of different Bayesian classifiers and datasets for identifying firstly, subsets of genes that are related to myogenesis and muscle differentiation, and secondly the use of cleaner and more informative datasets in modelling more biologically complex datasets. We have shown that an appropriate combination of simpler and more informative datasets produce very good results, whereas models learnt on genes selected from more complex datasets performed poorly. We concluded that simpler datasets can be used to model more complex ones and capture the interactions among genes. Moreover, we have described that highly predictive and consistent genes, from a pool of differentially expressed genes, across independent datasets are more likely to be fundamentally involved in the biological process under study. In three published datasets, we have demonstrated that these models can explain the myogenesisrelated genes (genes of interest) significantly better than others (P < 0.004) since the improvement in their rankings is much more pronounced. These results imply that gene regulatory networks identified in simpler systems can be used to model more complex biological systems. In the example of muscle differentiation, a myogenesisrelated gene network may be difficult to derive from in vivo experiments directly due to the presence of multiple cell types and inherently higher biological variation, but may become evident after initial training of the network on the cleaner in vitro experiments. In order to validate our approach, firstly, we evaluated our model on synthetic datasets and secondly we performed comparisons between our approach and the method of Lai et al. [32] which we call concordance model. It is shown that our model performs comparably in improving the ranks of informative genes and deteriorating the ranks of uninformative ones, but that the improvement of ranks for myogenesisrelated genes is much more pronounced whilst additionally modelling the interactions among genes. However, it is necessary to develop other statistical measures so that the model can be quantified to distinguish different degrees of complexities and platforms whilst handling the autoregulation and feedback loops within the network.
Algorithm 1  Multi Data Gene Identification Algorithm
Input: {TrainD, TestD_{ 1 },...TestD_{ M }, folds}
for k = 1:folds
Learn BN using Algorithm 2 on training folds of
TrainD
Score SSE on test fold k of TrainD
Score SSE on all independent test datasets
{TestD_{ 1 }...TestD_{ M }}
end for
Calculate variance of SSE over all k folds
on TrainD and {TestD_{ 1 }...TestD_{ M }}
Create gene rankings: trainR_SSE, train_var,
{testR_SSE_{ 1 }...testR_SSE_{ M }} and
{testR_var_{ 1 }...testR_var_{ M }} by ordering the genes
on the respective SSE and variance scores
Output:: trainR_SSE, train_var,
{testR_SSE_{ 1 }...testR_SSE_{ M }}
{testR_var_{ 1 }...testR_var_{ M }}
Algorithm 2  Simulated Annealing Structure Learning
Input: t_{ 0 }, maxfc, D
fc = 0, t = t_{ 0 }, t_{ n } = 0.001
c = (t_{ n }/t_{ 0 })^{ 1/maxfc }
Initial bn to a Bayesian classifier with no intergene links
results = bn
oldscore = score(bn)
while fc < maxfc do
for each operator do
apply operator to bn
newscore = score(bn)
fc = fc + 1
dscore = newscoreoldscore
if newscore>oldscore then
result = nbc
else if r(0,1) < e^{ dscore/t } then
Undo the operator
end if
end for
t = t × c
end while
Output: result
Declarations
Acknowledgements
This work was funded in part by a Royal Society International Joint Projects Grant (2006/R3).
Authors’ Affiliations
References
 Bockhorst J, Craven M, Page D, Shavlik J, Glasner J: A Bayesian approach to operon prediction. Bioinformatics 2003, 19: 1227–1235. 10.1093/bioinformatics/btg147View ArticlePubMedGoogle Scholar
 Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N: Module networks: identifying regulatory modules and their condition specific regulators from gene expression data. Nature Genetics 2003, 34: 166–176. 10.1038/ng1165View ArticlePubMedGoogle Scholar
 Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian networks to analyze expression data. Proceeding of the 4th International Conference on Computational Molecular Biology 2000, 127–135.Google Scholar
 Xu X, Wang L, Ding D: Learning module networks from genomewide location and expression data. FEBS Letters 2004, 587: 297–304. 10.1016/j.febslet.2004.11.019View ArticleGoogle Scholar
 Grossman D, Domingos P: Learning Bayesian network classifiers by maximizing conditional likelihood. Proceedings of the 21st International Conference on Machine Learning 2004, 69: 46–54.Google Scholar
 Peña JM, Björkegren J, Tegnér : Learning dynamic Bayesian network models via crossvalidation. Pattern Recognition Letters 2005, 26: 2295–2308. 10.1016/j.patrec.2005.04.005View ArticleGoogle Scholar
 Pearl J: Fusion, propagation, and structuring in belief networks. Artificial Intelligence 1986, 29: 241–288. 10.1016/00043702(86)90072XView ArticleGoogle Scholar
 Buntine WL: A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering 1996, 8: 195–210. 10.1109/69.494161View ArticleGoogle Scholar
 Heckerman D: A tutorial on learning with Bayesian networks. In Learning in graphical models. Edited by: Jordan MI. Dordrecht: Kluwer Academic Publishers; 1998:301.View ArticleGoogle Scholar
 Friedman N, Koller D: Being Bayesian about network Structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning 2003, 50: 95–125. 10.1023/A:1020249912095View ArticleGoogle Scholar
 Friedman N, Geiger D, Goldszmidt M: Bayesian network classifiers. Machine Learning 1997, 29: 131–163. 10.1023/A:1007465528199View ArticleGoogle Scholar
 Fielding AH: Introduction to classification. In Cluster and classification techniques for the Biosciences. 1st edition. Cambridge: Cambridge University Press; 2007:86.Google Scholar
 Tobler JB, Molla MN, Nuwaysir EF, Green RD, Shavlik JW: Evaluating machine learning approaches for aiding probe selection for geneexpression arrays. Bioinformatics 2002, 18: S164S171.View ArticlePubMedGoogle Scholar
 Stone M: Crossvalidatory choice and assessment of statistical predictions (with discussion). Journal of the Royal Statistical Society B 1974, 36: 111–147.Google Scholar
 Kohavi R: Wrapper for performance enhancement and oblivious decision graphs. PhD thesis. Stanford University, Computer Science Department; 1995.Google Scholar
 Bulcke T, Van Leemput K, Naudts B, Van Remortel P, Ma H, Verschoren A, De Moor B, Marchal K: SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics 2006, 7: 43. 10.1186/14712105743View ArticlePubMedPubMed CentralGoogle Scholar
 Zhang C, Lu X, Zhang X: Significance of gene ranking for classification of microarray samples. IEEE Transactions on Computational Biology and Bioinformatics 2006, 3: 312–320. 10.1109/TCBB.2006.42View ArticlePubMedGoogle Scholar
 Su J, Zhang H: Full Bayesian network classifiers. Proceedings of the 23rd International Conference on Machine Learning 2006, 148: 897–904. full_textGoogle Scholar
 Chickering DM, Heckerman D, Meek C: Largesample learning of Bayesian networks is NPHard. Machine Learning Research 2004, 5: 1287–1330.Google Scholar
 Schwarz G: Estimating the dimension of a model. The Annals of Statistics 1978, 6: 461–464. 10.1214/aos/1176344136View ArticleGoogle Scholar
 Lam W, Bacchus F: Learning Bayesian belief networks (an approach based on the MDL principle). Computational Intelligence 1994, 10: 1–31. 10.1111/j.14678640.1994.tb00166.xView ArticleGoogle Scholar
 Cao Y, Kumar RM, Penn BH, Berkes CA, Kooperberg C, Boyer LA, Young RA, Tapscott SJ: Global and genespecific analyses show distinct roles of Myod and Myog at a common set of promoters. The EMBO Journal 2006, 25: 502–511. 10.1038/sj.emboj.7600958View ArticlePubMedPubMed CentralGoogle Scholar
 Iezzi S, Di Padova M, Serra C, Caretti G, Simone C, Maklan E, Minetti G, Zhao P, Hoffman EP, Puri PL, Sartorelli V: Deacetylase inhibitors increase muscle cell size by promoting Myoblast recruitment and fusion through induction of Follistatin. Developmental Cell 2004, 6: 673–684. 10.1016/S15345807(04)001078View ArticlePubMedGoogle Scholar
 Tomczak KK, Marinescu VD, Ramoni MF, Sanoudou D, Montanaro F, Han M, Kunkel LM, Kohane IS, Beggs AH: Expression profiling and identification of novel genes involved in myogenic differentiation. The FASEB Journal 2004, 18: 403–405.PubMedGoogle Scholar
 Irizarry RA, Hobbs B, Collin F, BeazerBarclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
 Murphy KP: The Bayes net toolbox for Matlab. Computing Science and Statistics: Proceedings of the Interface 2001., 33:Google Scholar
 Jelier R, Schuemie MJ, Veldhoven A, Dorssers LC, Jenster G, Kors JA: Anni 2.0: a multipurpose textmining tool for the life sciences. Genome Biology 2008, 9: R96. 10.1186/gb200896r96View ArticlePubMedPubMed CentralGoogle Scholar
 Haynes BC, Brent MR: Benchmarking regulatory network reconstruction with GRENDEL. Bioinformatics 2009, 25: 801–807. 10.1093/bioinformatics/btp068View ArticlePubMedPubMed CentralGoogle Scholar
 Ma H, Kumar B, Ditges U, Gunzer F, Buer J, Zeng A: An extended transcriptional regulatory network of Escherichia coli and analysis of its hierarchical structure and network motifs. Nucleic Acids Res 2004, 32: 6643–6649. 10.1093/nar/gkh1009View ArticlePubMedPubMed CentralGoogle Scholar
 Miron M, Woody OZ, Marcil A, Murie C, Sladek R, Nadon R: A methodology for global validation of microarray experiments. BMC Bioinformatics 2006, 7: 333. 10.1186/147121057333View ArticlePubMedPubMed CentralGoogle Scholar
 Rhodes DR, Barrette TR, Rubin MA, Ghosh D, Chinnaiyan AM: MetaAnalysis of Microarrays: Interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Research 2002, 62: 4427–4433.PubMedGoogle Scholar
 Lai Y, Eckenrode SE, She J: A statistical framework for integrating two microarray data sets in differential expression analysis. BMC Bioinformatics 2009, 10: S23. 10.1186/1471210510S1S23View ArticlePubMedPubMed CentralGoogle Scholar
 Mac Nally R: Regression and modelbuilding in conservation biology, biogeography and ecology: the distinction between  and reconciliation of  'predictive' and 'explanatory' models. Biodiversity and Conservation 2000, 9: 655–671. 10.1023/A:1008985925162View ArticleGoogle Scholar
 Madigan D, Raftery AE: Model selection and accounting for model uncertainty in graphical models using Occam's window. Journal of the American Statistical Association 1994, 89: 1535–1546. 10.2307/2291017View ArticleGoogle Scholar
 ShenOrr SS, Milo R, Mangan S, Alon U: Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics 2002, 31: 64–68. 10.1038/ng881View ArticlePubMedGoogle Scholar
 Lee TI, Rinaldi NJ, Robert F, Odom DT, BarJoseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, Zeitlinger J, Jennings EG, Murray HL, Gordon DB, Ren B, Wyrick JJ, Tagne JB, Volkert TL, Fraenkel E, Gifford DK, Young RA: Transcriptional regulatory networks in Saccharomyces cereviciae. Science 2002, 298: 799–804. 10.1126/science.1075090View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.