Selecting highdimensional mixed graphical models using minimal AIC or BIC forests
 David Edwards^{1}Email author,
 Gabriel CG de Abreu^{1} and
 Rodrigo Labouriau^{1}
DOI: 10.1186/147121051118
© Edwards et al; licensee BioMed Central Ltd. 2010
Received: 4 June 2009
Accepted: 11 January 2010
Published: 11 January 2010
Abstract
Background
Chow and Liu showed that the maximum likelihood tree for multivariate discrete distributions may be found using a maximum weight spanning tree algorithm, for example Kruskal's algorithm. The efficiency of the algorithm makes it tractable for highdimensional problems.
Results
We extend Chow and Liu's approach in two ways: first, to find the forest optimizing a penalized likelihood criterion, for example AIC or BIC, and second, to handle data with both discrete and Gaussian variables. We apply the approach to three datasets: two from gene expression studies and the third from a genetics of gene expression study. The minimal BIC forest supplements a conventional analysis of differential expression by providing a tentative network for the differentially expressed genes. In the genetics of gene expression context the method identifies a network approximating the joint distribution of the DNA markers and the gene expression levels.
Conclusions
The approach is generally useful as a preliminary step towards understanding the overall dependence structure of highdimensional discrete and/or continuous data. Trees and forests are unrealistically simple models for biological systems, but can provide useful insights. Uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.
Background
Recent years have seen intense interest in representing complex biological systems as networks, and a new research discipline, network biology, has arisen. In particular, Markov networks and Bayesian networks have been applied in many domains [1–3]. The former are based on undirected graphs, and the latter on DAGs (directed acyclic graphs). A key challenge in deriving such networks from the highdimensional data typical of the genomics era is computational efficiency: model selection algorithms that perform well for small or moderate dimensions may be intractable for high dimensions. The approach of Chow and Liu [4], which predates much of the development of probabilistic graphical models, is particularly efficient, being quadratic in the number of variables.
The ChowLiu algorithm
Suppose that we have a dataset with N observations of p discrete random variables X = (X_{ v })_{v∈Δ}. We call the possible values a discrete variable may take its levels, and label these 1,...X_{ v }, so that X_{ v } is the number of levels of X_{ v }. We write a generic observation (or cell) as x = (x_{1},..., x_{ p }), and the set of possible cells as χ. We assume that the observations are independent and are interested in modelling the probabilities p(x) = Pr(X = x) for x ∈ χ.
n(x_{ u }, x_{ v }) being the number of observations with X_{ u }= x_{ u }and X_{ v }= x_{ v }. The quantity I_{u, v}is called the mutual information. It follows that if we use the I_{u, v}as edge weights on the complete graph with vertex set X, and apply a maximum spanning tree algorithm, we obtain the maximum likelihood tree.
In statistical terms, I_{u, v}is one half of the usual likelihood ratio test statistic for marginal independence of X_{ u }and X_{ v }, that is G^{2} = 2 ln Q = 2I_{u, v}, calculated using the table of counts {n(x_{ u }, x_{ v })} formed by crosstabulating X_{ u }and X_{ v }. Under marginal independence G^{2} has an asymptotic distribution, where k = (X_{ u }  1)(X_{ v }  1). The degrees of freedom k is the number of additional free parameters required under the alternative hypothesis, compared with the null hypothesis.
where is the sample correlation between X_{ u }and X_{ v }. As before the likelihood ratio test statistic G^{2} = 2 ln Q = 2I_{u, v}. Under marginal independence G^{2} has a distribution.
Algorithms to find the maximum weight spanning tree of a arbitrary undirected connected graph with positive edge weights have been studied thoroughly. The following simple and efficient algorithm is due to Kruskal [5]. Starting with the null graph, repeat this step: among the edges not yet chosen, add the edge with the largest weight that does not form a cycle with the ones already chosen. When p  1 edges have been added, the maximum weight spanning tree of has been found. The algorithm can be implemented to run in O(p^{2} ln p) time.
As mentioned above, is here taken to be the complete graph on X with edge weights given by {I_{u, v}}_{u, v∈X}. In practice the task of calculating these p(p  1)/2 edge weights dominates the time usage, so the complexity of the ChowLiu algorithm may be taken to be O(p^{2}). Methods to improve computational efficiency have been described [6, 7].
Chow and Liu's approach has been extended to more general classes of graphs than trees: to thin junction trees [8]; to polytrees [9]; to bounded treewidth networks [10], and to mixtures of trees [11]. The approach has also been extended to treebased models for Gaussian processes [12] and discretevalued time series [13]. The consistency of the algorithm has been shown [14].
Results and Discussion
Extension to minimal AIC/BIC forests
A disadvantage with selecting a tree based on maximum likelihood is that it will always include the maximum number of edges, irrespective of whether the data support this or not. It is desirable to take account of the number of model parameters in some fashion. In the machine learning literature it is customary to penalize the likelihood using the minimum description length principle [15], whereas in the statistical literature the use of information criteria is wellestablished, particularly AIC (the Akaike information criterion [16]) and BIC (the Bayesian information criterion [17]). The former is defined as 2 ln L + 2r, where L is the maximized likelihood under the model and r is the number of parameters in the model, and the latter as 2 ln L + ln(N)r. Discussions of the relative merits of these criteria are available [18] and need not be repeated here.
First, suppose that Kruskal's algorithm is applied using penalized mutual information quantities = I_{u, v} k_{u, v}or = I_{u, v} ln(N)k_{u, v}/2, where k_{u, v}is the degrees of freedom associated with I_{u, v}, as described above. Then it is easily seen that the tree with the minimum AIC or BIC is obtained. Note that for Gaussian data this will be identical to the maximum likelihood tree, since all edges have the same degrees of freedom. For discrete data with varying numbers of levels, the maximum likelihood tree and the minimal AIC/BIC tree will generally differ.
Second, given a graph = (V, E) with both positive and negative edge weights, consider the problem of finding the maximum weight forest, that is, the acyclic subgraph on vertex set V with maximum weight. Let be the graph derived from by omitting all edges with negative weights. For any forest with vertex set V, removing all edges with negative weights would increase the total weight and not introduce any cycles. It follows that we can construct the maximum weight forest by finding the maximum weight spanning tree for each connected component of . We can do this simply by applying Kruskal's algorithm to : it is not necessary to find the connected components explicitly.
So it is easy to find the minimal AIC or BIC forest by using penalized mutual information quantities as weights. This approach is attractive with highdimensional data, since if the selected forest does consist of multiple connected components these may then be analyzed separately  allowing a dimension reduction. We show below that the connected components of the minimal AIC/BIC forest are also connected components of the minimal AIC/BIC decomposable model, providing further justification for this procedure.
That using penalized likelihood with the ChowLiu algorithm leads to forests rather than trees appears to be known in the machine learning literature [19]; also, [20] finds the Bayesian MAP tree/forest in a similar way, but we have found no published references in the computational biology or statistics research literatures. We believe that it is a useful method that deserves to be far more widely known.
A numerical illustration
where Pr(x_{ a }) = (0.5, 0.5)', and either (i) or (ii) .
Note that X_{ a }and X_{ b }are strongly associated but there is weak or no association between X_{ a }and X_{ c }.
Simulation Results
Case (i)  Case (ii)  

Algorithm 






ML tree  826  0  174  5  0  995 
min AIC forest  1000  0  0  94  897  9 
min BIC forest  995  5  0  0  1000  0 
In case (i), the ML tree algorithm incorrectly identifies about 17% of time; otherwise it correctly identifies . Penalizing with AIC or BIC increases the success frequencies to almost 100%. In case (ii) the true model is a forest rather than a tree, so the ML tree algorithm cannot select it. Note that it almost always selects : since 2I_{b, c}~ and 2I_{a, c}~ , the former is almost always greater than the latter. Penalizing using AIC and BIC increases the success frequencies to 90% and 100%, respectively. For insight into the relative performance of AIC and BIC in this example, see [18].
Extension to mixed discrete and Gaussian data
The second extension we consider is to data with both discrete and Gaussian variables. Our approach uses the class of undirected mixed graphical models [21–23]. Consider a data set with N observations of p discrete random variables X = (X_{1},... X_{ p }), and q continuous random variables Y = (Y_{1},... Y_{ q }). The models are based on the conditional Gaussian distribution, that is to say, the conditional distribution of Y given X = x is multivariate Gaussian with mean, and possibly also variance, depending on x. Models in which the variance depends on x are termed heterogenous, otherwise, they are called homogeneous.
A forbidden path is a path between two nonadjacent discrete vertices passing through continuous vertices. Since trees and forests are acyclic, they are triangulated, and since they contain at most one path between any two vertices, we can simplify the criterion as follows: A tree or forest dependence model is strongly decomposable if and only if it contains no path between discrete vertices passing through continuous vertices. We call such a tree (or forest) an SDtree (or SDforest). In an SDtree the discrete vertices induce a connected subgraph.
To apply the algorithm we need to derive the mutual information between a discrete variable X_{ u }and a continuous variable Y_{ v }. The marginal model is a simple ANOVA model (section 4.1.7 of [21]). Let s_{0} = ∑_{ k }(y^{(k)} )^{2}/N, and write the sample cell counts, means and variances as . In the homogeneous case, the mutual information is I_{u, v}= N ln(s_{0}/s)/2, where . There are k = X_{ u }  1 degrees of freedom. In the heterogeneous case, the mutual information is I_{u, v}= N ln(s_{0})/2  , with k = 2(X_{ u }  1) degrees of freedom. The expressions given here assume that all parameters are estimable: when this is not so, they need to be modified slightly, but we omit the details.
We also need to modify Kruskal's algorithm. As before an undirected graph with positive weights is given. Starting with the null graph, we repeatedly add the edge with the largest weight that does not form a cycle or a forbidden path. It is shown below that this returns the maximum weight SDforest.
About the forbidden path restriction
where pa(x_{ u }) denotes the parents (here, parent) of x_{ u }in the DAG. Models corresponding to the DAG are constructed by specifying a marginal distribution Pr(x_{ r }) and a series of conditional models for Pr(x_{ u }pa(x_{ u })).
First consider the pure case, that is, when all variables are either discrete or continuous. In the discrete case, we can construct a model for the DAG by specifying a multinomial distribution for X_{ r }and arrays of transition probabilities for the conditional models. In the continuous case, X_{ r }is Gaussian and the conditional models are simple linear regressions. When X_{ u }and X_{ v }are both discrete or both continuous, the mutual information I_{u, v}is symmetric, and is consistent with the conditional models for both Pr(x_{ v }x_{ u }) and Pr(x_{ u }x_{ v }). It follows that a DAG model in the Markov equivalence class is essentially a reparametrization of the tree model, and so has the same maximized likelihood and penalized likelihood scores. So in the pure case the algorithm identifies a Markov equivalence class of DAGs, just like other Bayesian network selection algorithms. Note that the search space is restricted to singleparent DAGs.
In the mixed case, however, the mutual information between a discrete X_{ u }and a continuous X_{ v }is asymmetric, and corresponds to an ANOVAtype conditional model for Pr(x_{ v }x_{ u }) but not for Pr(x_{ u }x_{ v }). So a DAG model in the Markov equivalence class is a reparametrization of the tree model only if the DAG contains no edges pointing from continuous to discrete nodes. If the tree has a forbidden path, no such DAG will exist: see for example Figure 2(iii). If the tree has no forbidden paths, then a DAG generated in the above way will have this property if and only if its root is discrete. So in the mixed case the algorithm identifies a subset of a Markov equivalence class of DAGs, those generated using discrete roots. That only a subset is identified is due to a limitation of the model apparatus, not to any evidence in the data. The limitation is unproblematic provided that the discrete variables are prior to the continuous variables.
All this has two broad implications. The first is that, when interpreted causally, the tree and forest models allow at most one determinant of each variable. The second is that the approach implicitly assumes that discrete variables are prior to continuous ones.
A marginality property
In some cases the global optimality of the selected model holds under marginalization. The following result is shown below in the methods section. Suppose that is the maximum likelihood tree (or minimal AIC or BIC forest) for a variable set V and let the connected components of be C_{1},... C_{ k }, say. Then (the marginal subgraph induced by A ⊆ V) is the maximum likelihood tree (respectively, minimal AIC or BIC forest) for the variable set A provided that is connected, for each component C_{ i }.
For example, consider a genetics of gene expression study involving a set of discrete DNA markers Δ and a set of continuous gene expression variables Γ. A central tenet is that DNA can affect gene expression but not vice versa. Suppose that the minimal AIC/BIC forest for V = (Δ, Γ) is . The forbidden path restriction implies that for each connected component C_{ i }of , is connected. Hence is the minimal AIC/BIC forest for the discrete data alone. It follows that can be regarded as a chain graph model [22] with two blocks, Δ and Γ, with Δ prior to Γ, consistent with the tenet.
Some applications of the algorithm
We show the results of applying the algorithm to three datasets.
Study of leucineresponsive protein (Lrp) in E. coli
The first dataset stems from a previously reported gene expression study [29]. The stated purpose of this was to identify the network of genes that are differentially regulated by the global E. coli transcription factor, leucineresponsive regulatory protein (Lrp), during steady state growth in a glucose supplemented minimal salts medium. Lrp has been reported to affect the expression of approximately 236 genes [30]. Gene expression in two E. coli bacteria strains, labelled lrp+ and lrp, were compared using eight Affymetrix ecoli chips. The lrp+ strain is the control or wild type, and the lrp strain is the experimental type, with the Lrp gene knockedout. Four chips were hybridized with RNA from the lrp+ strain, and four chips with RNA from the lrp strain. The raw data were preprocessed using standard methods and the algorithm applied to the derived data. The dataset had N = 8 observations and 7313 variables, comprising 7312 continuous variables (the logtransformed gene expression values) and one discrete variable, strain.
Our implementation of the algorithm (see below) took about 2 minutes on a laptop running Windows XP to find the minimal BIC forest. This is too large to display here, so instead we examine an interesting subgraph.
The regulatory system of E. coli has been wellstudied, and it is interesting to note that other studies confirm that serA and gltD are targets of Lrp [30, 32]. Indeed, Lrp has many targets: 138 Lrpbinding sites have been identified [30], so it is certainly not true that Lrp only targets serA and gltD. We have not been able to find other reports that the five distant genes  ndk, pnt, ptsG, nupG and atpG  should be directly or indirectly regulated by Lrp.
The minimal BIC forest provides a provisional causal model for the effect of Lrp, and in this sense more directly addresses the stated goal of the study than a conventional analysis of differential expression. However, given the small number of observations in the study, it is clear that the network identification and any interpretations based on this are highly uncertain.
Gene expression profiling in breast cancer patients
The second dataset comes from another gene expression study [33], whose purpose was to compare the gene expression profiles in tumours taken from two groups of breast cancer patient, those with and those without a mutation in the p53 tumour suppression gene. A dataset containing a subset of the study data is supplied along with the R library gRbase. The dataset has N = 250 observations and 1001 variables, comprising 1000 continuous variables (the logtransformed gene expression values) and the class variable. There are 58 cases (with a p53 mutation) and 192 controls (without the mutation). The gene expression variables were filtered from a larger set, and all exhibit differential expression between the two groups. They have been standardized to zero mean and unit variance, but since the mixed graphical models used here are location and scale invariant, this does not affect the analysis.
Genetics of gene expression using HapMap data
The third dataset comes from a large multinational project to study human genetic variation, the HapMap project http://www.hapmap.org/. The dataset concerns a sample of 90 Utah residents with northern and western European ancestry, the socalled CEU population, and contains information on genetic variants and gene expression values for this sample. The subjects are not unrelated (they comprise parentsibling trios), but the analysis ignores this. The genetic variants are SNPs (single nucleotide polymorphisms). Datasets containing both genomic and gene expression data enable study of the the genetic basis for differences in gene expression. This dataset is supplied along with the R library GGtools.
The main component of the SDforest consists of a large connected block of SNPs, attached to most of the gene expression nodes via SNP number 87 at the bottom of the figure. There are also 30 or so gene expression nodes adjacent to the SNPs as singletons, and a component of nine gene expression variables connected to SNP number 54 in the centre of the graph. SNP number 130 is possibly a gene expression hotspot and there are several potential hub nodes among the gene expression values.
The SDforest does not allow study of the joint effect of SNPs on gene expression values since, as we have seen, in trees and forests variables may have most one determinant. The minimal BIC forest obtained can be regarded as a special case of a chain graph model with two blocks, with the SNP data in the first block and transcript abundance data in the second block, as mentioned above. This framework would be wellsuited for further analysis of the data, allowing study of the joint action of SNPs on gene expression values.
Discussion
Deriving networks from highdimensional data is a key challenge in many disciplines, and many different approaches have been proposed: for example, using approximation techniques [34] or loworder conditional independence tests [35, 36]. One broad approach is to consider restricted classes of graphs, for example triangulated graphs [37], interval graphs [38] and others mentioned above, for which faster algorithms can be applied. The ChowLiu algorithm falls into this class. Its utility is due to its remarkable computational efficiency, which reflects the simplicity of the graphs used. At the other end of the spectrum, it has been shown that selecting general Bayesian networks by maximizing a score function is NPhard [39].
In this paper we have described some simple extensions to Chow and Liu's method that enable forests with minimal AIC or BIC to be found, and allow datasets with both discrete and Gaussian variables to be handled. In the previous section we demonstrated that useful insights into various highdimensional datasets may be obtained by this method.
Trees and forests are too simple to be realistic models of complex biological systems. Nevertheless we believe that they can give a preliminary understanding of the overall dependence structure, and can be put to a number of practical uses.
Firstly, we can use the selected model as a start model in a search algorithm based on richer, but more computationally demanding, model classes. Since trees are triangulated, the class of (strongly) decomposable models is a natural choice for highdimensional data. As described above, trees and forests represent Markov equivalence classes of DAGs, so the minimal AIC/BIC forest can also be used as start model in Bayesian network search procedures.
Secondly, we can regard properties of the selected model as proxies for corresponding properties of the true, underlying network. Properties that can be used in this way include connectivity, path length and degree. Provided we can assume that the data are generated by a joint undirected model, we can model the connected components of the selected forest separately. This may allow substantial dimension reduction. It is natural to use the selected forest to identify neighborhoods of interesting variables for more detailed analysis: in effect, this uses path length in the forest as a proxy for minimum path length in the unknown true network. Similarly, we can identify interesting features such as hub nodes  nodes of high degree  that may play a special role in the true network.
Recently there has been interest in network motifs  patterns of interconnections between small numbers of nodes that occur significantly more often than could be expected by chance [40]. For a review of motif discovery algorithms, see [41]. Many of these motifs, such as the feedforward or biparallel motifs, will not appear in trees due to the singleparent restriction discussed above. For this reason trees and forests appear to be too restrictive for motif discovery.
As pointed out by a referee, there are some similarities between the ChowLiu algorithm and the ARACNE algorithm [42]. Like the ChowLiu algorithm, this algorithm initially computes the mutual information quantities I_{u, v}for all node pairs (although ARACNE uses the Gaussian kernel method of [43]). It forms an initial graph by including all edges for which the I_{u, v}exceeds a given threshold. The dataprocessing inequality states that if X_{ u }and X_{ w }are conditionally independent given X_{ v }, then I_{u, w}< min(I_{u, v}, I_{v, w}). This is used to prune all complete triplets in , that is, all triplets X_{ u }, X_{ v }, X_{ w }with all three edges present in , by removing the edge with the least mutual information. Since the condition given in the dataprocessing inequality is sufficient but not necessary, that the inequality holds does not imply that the condition is true, and the authors acknowledge that the process may incorrectly remove edges.
Nevertheless the heuristic is reported to perform well when the true graph is a tree or is treelike [42].
Although mixed graphical models have been studied for some time [21–23], their adoption by the machine learning community seems to have been limited. As illustrated above, some natural application areas include comparative microarray studies, to model the effect of an intervention or class variable on gene expression, and genetics of gene expression studies, involving both discrete DNA markers (SNPs) and continuous responses (gene expression values). In both cases the discrete variables are clearly prior to the continuous variables. The conditional Gaussian assumption is a distributional assumption that is not necessarily fulfilled for all continuous variables; but logtransformed gene expression values have been found to be approximately Gaussian, and this assumption provides the basis for conventional analyses of differential expression.
An attractive aspect of the algorithm is that it allows different measures of mutual information to be used  for example, measures based on specific genetic models. However, we consider it a key advantage of the models described here that they are embedded in a broader class of models for more general dependence structures, which provides an inferential framework for systematic model diagnostics and development.
Conclusion
The approach is generally useful as a preliminary step towards understanding the overall dependence structure of highdimensional discrete and/or continuous data. Trees and forests are unrealistically simple models for biological systems, but can nevertheless provide useful insights. In microarray studies the method supplements lists of differentially regulated genes, by suggesting a possible network of interrelationsships between these. Other uses include the following: identification of distinct connected components, which can be analysed separately (dimension reduction); identification of neighbourhoods for more detailed analyses; as initial models for search algorithms with a larger search space, for example decomposable models or Bayesian networks; and identification of interesting features, such as hub nodes.
Methods
Modifying Kruskal's algorithm to find the maximum weight spanning SDforest
We take as given an undirected graph = (V, ) with positive edge weights, whose vertices are marked as either discrete and or continuous. We assume that the weights are distinct so that there is a unique spanning SDforest with maximum weight. We consider the following modification of Kruskal's algorithm.
Starting with the null graph, repeatedly add the edge with the largest weight that does not form a cycle or a forbidden path. We claim that this finds the maximum weight SDforest.
To prove this, let T = (V, E_{ T }) be the maximum weight spanning SDforest, and let the edges chosen by the algorithm be a_{1} ... a_{ k }. Let A_{ i }= (V, E_{ i }) be the SDforest consisting of edges a_{1} ... a_{ i }, so that E_{ i }= ∪_{1≤j≤i}{a_{ j }}. Suppose that T ≠ A_{ k }. Then either or both of (i) E_{ k }⊈ E_{ T }and (ii) E_{ T }⊈ E_{ k }must hold.
Suppose that (i) holds, and let a_{ i }be the first edge of A_{ k }which is not in E_{ T }. The addition of a_{ i }to T must result in a cycle or a forbidden path. Let a_{ i }= (u, v) and let the connected components (trees) of T containing u and v be S_{ u }and S_{ v }.
Suppose first that S_{ u }≠ S_{ v }. Addition of an edge between distinct components cannot create a cycle, but may create a forbidden path. Addition of an edge between discrete vertices cannot create a forbidden path, so one or both of u and v must be continuous. Suppose that u is discrete and v is continuous. Then (V, E_{ T }∪ a_{ i }) contains a unique forbidden path of the form u, v, v_{1} ... v_{ m }, w for some m ≥ 0 where v_{1} ... v_{ m }are continuous and w is discrete. It is unique because the existence of two such paths would imply the existence in S_{ v }of a cycle (if the paths have the same w) or a forbidden path (if they have different w's). Since A_{ i }is an SDforest at least one edge in this path, say e, must be absent from A_{ i }. Then (V, E_{i1}∪ e) is a SDforest since it is contained in T. So the weight of e must be less than that of a_{ i }. Consider (V, E_{ T }\e). The removal of e from S_{ v }results in two subtrees, the one with v containing continuous vertices only. Hence (V, E_{ T }∪ a_{ i }\e) is an SDforest. But the weight of (V, E_{ T }∪ a_{ i }\e) is greater than that of T, contradicting the definition of T. The proof when both u and v are continuous is similar.
Suppose now that S_{ u }= S_{ v }. Then (V, E_{ T }∪ a_{ i }) contains exactly one cycle, and may also contain a forbidden path. The cycle must contain a_{ i }and also some edge e which is not in A_{ k }. Then (V, E_{ T }∪ a_{ i }\e) is a forest. Suppose that (V, E_{ T }∪ a_{ i }) contains no forbidden path. Then (V, E_{ T }∪ a_{ i }\e) is an SDforest. Since (V, E_{i1}∪ e) is contained in T, it is an SDforest, so the weight of e is less than that of a_{ i }. But then the weight of (V, E_{ T }∪ a_{ i }\e) is greater than that of T, contradicting the definition of T.
Suppose now that (V, E_{ T }∪ a_{ i }) contains a forbidden path, and let a_{ i }= (u, v). Suppose that u is discrete and v continuous. Then (V, E_{ T }∪ a_{ i }) contains a unique forbidden path of the form u, v, v_{1} ... v_{ m }, w for some m ≥ 0 where v_{1} ... v_{ m }are continuous and w is discrete. Let w, w_{1} ... w_{ n }, u for some n ≥ 0 be the unique path in S_{ u }between w and u. Since S_{ u }is an SDtree w_{1} ... w_{ m }are discrete. Then the unique cycle in (V, E_{ T }∪ a_{ i }) takes the form u, v, v_{1} ... v_{ m }, w, w_{1} ... w_{ n }, u. Since A_{ i }is an SDforest at least one edge in the path u, v, v_{1} ... v_{ m }, w, say e, must be absent from A_{ i }. Removal of e from (V, E_{ T }∪ a_{ i }) breaks the cycle and the forbidden path, so (V, E_{ T }∪ a_{ i }\e) is an SDforest. As before the weight of e is less than that of a_{ i }, so the weight of (V, E_{ T }∪ a_{ i }\e) is greater than that of T, contradicting the definition of T. The proof when both u and v are continuous is similar.
Hence E_{ k }⊆ E_{ T }.
Suppose now that (ii) holds. But any edge e ∈ E_{ T }\E_{ k }would give rise to a cycle or a forbidden path if added to E_{ k }. Since E_{ k }⊆ E_{ T }this implies that T contains a cycle or forbidden path, contradicting its definition. It follows that E_{ T }⊆ E_{ k }and hence T = A_{ k }as required.
Two theoretical properties of minimal AIC or BIC forests
In this section we prove the two theoretical properties of the selected models discussed above.
Firstly, suppose that we apply the algorithm to find the minimal AIC or BIC forest, say . Then the connected components of are identical to the connected components of the minimal AIC/BIC strongly decomposable model. To see this, consider the connected components (that is, trees) of . Then any intercomponent edge either corresponds to a negative penalized mutual information or would generate a forbidden path (since adding such an edge cannot form a cycle).
Suppose that we construct a global model * by using the strongly decomposable model with minimal AIC/BIC for each connected component of . It follows from decomposition properties of undirected graphical models [22] that adding an intercomponent edge to * would result in the same change in AIC/BIC as when added to . Furthermore, if adding such an edge to would generate a forbidden path it would do the same when added to *. So * is, at least locally, a minimal AIC/BIC strongly decomposable model.
Secondly, in some cases the global optimality of the selected model holds under marginalization. That is to say, if is the maximum likelihood tree (or minimal AIC or BIC forest) for a variable set V, then for some variable subsets A ⊆ V, the induced marginal subgraph of on A, written , is the maximum likelihood tree (respectively, minimal AIC or BIC forest) for the variable set A. It is useful to characterize precisely the sets A for which this property holds in general.
Suppose initially that is connected, that is, a tree. We claim that the property holds precisely for those sets A for which is connected. Write = (A, E_{ A }) and consider application of the algorithm to A, that is, to the subset of the (possibly penalized) mutual information quantities that pertain to A. Suppose that this generates the graph ℋ = (A, E*). We need to show that when the algorithm is applied to V, the inclusion of an edge between vertices in A cannot create a cycle or forbidden path involving edges not in A. If this occurs during the course of the algorithm, it will also occur when added to , so it is sufficient to consider . If is connected then precisely one vertex in each connected component of is adjacent to precisely one vertex of . So clearly the addition of an edge in A cannot create a cycle with edges not in A. Suppose it creates a forbidden path involving vertices not in A. This must link two discrete variables, say u and v, in distinct connected components of . Since is an SDtree, all vertices in the unique path between the two vertices in must be discrete. This path must include the two vertices, say w and x, that are adjacent to a vertex in the connected components. If inclusion of an edge in A creates a forbidden path between u and v, then this must pass through w and x. But then the forbidden path lies in A, contrary to assumption. It follows that . Conversely, if is not connected but is, the inclusion of intercomponent edges may give rise to cycles when the algorithm is applied to V but not when it is applied to A. Hence in general ℋ and will differ.
When the minimal AIC or BIC variants of the algorithm are used, may be a forest. Let the connected components of be C_{1},... C_{ k }, say. Using a similar logic we obtain that is the minimal AIC (or BIC) forest for the variable set A provided that is connected, for each i.
Availability
The analyses were performed using the R library gRapHD which we have made available to the R community via the CRAN repository (de Abreu GCG, Labouriau R, Edwards D: Highdimensional Graphical Model Search with gRapHD R package, submitted to J. Stat. Software).
Declarations
Acknowledgements
DE was supported by the Danish National Advanced Technology Foundation through the ILSORM project. GCGA was financed by SABRETRAIN Project, funded by the Marie Curie Host Fellowships for Early Stage Research Training, as part of the 6th Framework Programme of the European Commission. RL was partially supported by the project "Metabolic programming in Foetal Life", Danish Research Agency, Ministry of Science Technology and Innovation.
Authors’ Affiliations
References
 Friedman N: Inferring cellular networks using probabilistic graphical models. Science 2004, 303(5659):799–805. 10.1126/science.1094068View ArticlePubMedGoogle Scholar
 Larrañaga P, Inza I, Flores J: A Guide to the Literature on Inferring Genetic Networks by Probabilistic Graphical Models. In Data Analysis and Visualization in Genomics and Proteomics. Edited by: Azuaje F, Dopazo J. Wiley, New York; 2005:215–238. full_textView ArticleGoogle Scholar
 AndradeCetto L, Manolakos E: A Graphical Model Formulation of the DNA BaseCalling Problem. Proc. IEEE Workshop on Machine Learning for Signal Processing 2005, 369–374. full_textGoogle Scholar
 Chow C, Liu C: Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions 1968, 14(3):462–467. 10.1109/TIT.1968.1054142View ArticleGoogle Scholar
 Kruskal J: On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Proc Am Math Soc 1956, 7: 48–50. 10.2307/2033241View ArticleGoogle Scholar
 Meila M: An accelerated Chow and Liu algorithm: fitting tree distributions to high dimensional sparse data. Proceedings of the 16th International Conference on Machine Learning 1999.Google Scholar
 Pelleg D, Moore A: Dependency Trees in sublinear time and bounded memory. The International Journal on Very Large Databases 2006, 15: 250–262. 10.1007/s0077800501708View ArticleGoogle Scholar
 Bach F, Jordan M: Thin Junction Trees. In Advances in Neural Information Processing Systems. Volume 14. Edited by: Dietterich TG, GZ Becker S. Cambridge, MA: MIT Press; 2002:569–576.Google Scholar
 Ouerd M, Oommen B, Matwin S: A formal approach to using data distributions for building causal polytree structures. Information Sciences 2004, 111–132.Google Scholar
 Srebro N: Maximum likelihood bounded treewidth Markov networks. Artificial Intelligence 2003, 143: 123–138. 10.1016/S00043702(02)003600View ArticleGoogle Scholar
 Meila M, Jordan MI: Learning with mixtures of trees. J Mach Learn Res 2001, 1: 1–48. 10.1162/153244301753344605Google Scholar
 Sudderth E, Sudderth E, Wainwright M, Willsky A: Embedded trees: estimation of Gaussian Processes on graphs with cycles. IEEE Transactions on Signal Processing 2004, 52(11):3136–3150. 10.1109/TSP.2004.836539View ArticleGoogle Scholar
 Kirshner S, Smyth P, Robertson AW: Conditional ChowLiu tree structures for modeling discretevalued vector time series. In AUAI '04: Proceedings of the 20th conference on Uncertainty in artificial intelligence. Arlington, Virginia, United States: AUAI Press; 2004:317–324.Google Scholar
 Chou C, Wagner T: Consistency of an estimate of treedependent probability distribution. IEEE Transactions on Information Theory 1973, IT19: 369–371. 10.1109/TIT.1973.1055013View ArticleGoogle Scholar
 Rissanen J: Stochastic Complexity. J Royal Stat Soc B 1987, 49: 223–239.Google Scholar
 Akaike H: A new look at the statistical identification problem. IEEE Trans Auto Control 1974, 19: 716–723. 10.1109/TAC.1974.1100705View ArticleGoogle Scholar
 Schwarz G: Estimating the Dimension of a Model. Annals of Statistics 1978, 6: 461–464. 10.1214/aos/1176344136View ArticleGoogle Scholar
 Burnham KP, Anderson DR: Multimodel Inference: Understanding AIC and BIC in Model Selection. Sociological Methods Research 2004, 33: 261–304. 10.1177/0049124104268644View ArticleGoogle Scholar
 Liang P, Srebro N: Methods and experiments with bounded treewidth Markov networks. In Tech rep. MIT; 2004.Google Scholar
 Panayidou K: Estimation of Tree Structure for Variable Selection. PhD thesis. University of Oxford; to appear. to appear.
 Edwards D: Introduction to Graphical Modelling. second edition. New York: SpringerVerlag; 2000.View ArticleGoogle Scholar
 Lauritzen SL: Graphical Models. Oxford: Clarendon Press; 1996.Google Scholar
 Lauritzen S, Wermuth N: Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann Statist 1989, 17: 31–57. 10.1214/aos/1176347003View ArticleGoogle Scholar
 Frydenberg M, Lauritzen S: Decomposition of maximum likelihood in mixed graphical interaction models. Biometrika 1989, 76: 539–555. 10.1093/biomet/76.3.539View ArticleGoogle Scholar
 Verma T, Pearl J: Equivalence and synthesis of causal models. In UAI '90: Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence. New York, NY, USA: Elsevier Science Inc; 1991:255–270.Google Scholar
 Andersson SA, Madigan D, Perlman MD: On the Markov Equivalence of Chain Graphs, Undirected Graphs, and Acyclic Digraphs. Scandinavian Journal of Statistics 1997, 24: 81–102. 10.1111/14679469.t01100050View ArticleGoogle Scholar
 Spirtes P, Glymour C, Scheines R: Causation, Prediction and Search. New York. 1993. [Reprinted by MIT Press]. [Reprinted by MIT Press].View ArticleGoogle Scholar
 Heckerman D, Geiger D, Chickering DM: Learning Bayesian Networks: The combination of knowledge and statistical data. Machine Learning 1995, 20: 197–243.Google Scholar
 Hung S, Baldi P, Hatfield G: Global Gene Expression Profiling in Escherichia coli K12. Journal of Biological Chemistry 2002, 477: 40309–40323. 10.1074/jbc.M204044200View ArticleGoogle Scholar
 Cho BK, Barrett CL, Knight EM, Park YS, Palsson B: Genomescale reconstruction of the Lrp regulatory network in Escherichia coli. Proc Natl Acad Sci USA 2008, 105(49):19462–19467. 10.1073/pnas.0807227105View ArticlePubMedPubMed CentralGoogle Scholar
 Smyth GK: Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 2004, 3: Article3.PubMedGoogle Scholar
 Ligi P, Blumenthal R, Matthews R: Activation from a Distance: Roles of Lrp and Integration Host Factor in Transcriptional Activation of gltBDF. Journal of Bacteriology 2001, 183: 3910–3918. 10.1128/JB.183.13.39103918.2001View ArticleGoogle Scholar
 Miller LD, Smeds J, George J, Vega VB, Vergara L, Ploner A, Pawitan Y, Hall P, Klaar S, Liu ET, Bergh J: An expression signature for p53 status in human breast cancer predicts mutation status, transcriptional effects, and patient survival. Proc Natl Acad Sci USA 2005, 102(38):13550–13555. 10.1073/pnas.0506230102View ArticlePubMedPubMed CentralGoogle Scholar
 Friedman J, Hastie T, Tibshirani R: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 2008, 9(3):432–441. 10.1093/biostatistics/kxm045View ArticlePubMedPubMed CentralGoogle Scholar
 Kalisch M, Buhlmann P: Estimating Highdimensional Directed Acyclic Graphs with the PCAlgorithm. Journal of Machine Learning Research 2007, 8: 613–636.Google Scholar
 Castelo R, Roverato A: Reverse engineering molecular regulatory networks from microarray data with qpgraphs. J Comput Biol 2009, 16(2):213–227. 10.1089/cmb.2008.08TTView ArticlePubMedGoogle Scholar
 Wermuth N: Model Search among Multiplicative Models. Biometrics 1976, 32(2):253–263. 10.2307/2529496View ArticleGoogle Scholar
 Thomas A, Camp NJ: Graphical modeling of the joint distribution of alleles at associated loci. Am J Hum Genet 2004, 74(6):1088–1101. 10.1086/421249View ArticlePubMedPubMed CentralGoogle Scholar
 Chickering DM: Learning Bayesian networks is NPcomplete. In Learning from Data: Artificial Intelligence and Statistics V. New York Edited by: Fisher D, Lenz HJ. 1996, 121–130.View ArticleGoogle Scholar
 Milo R, ShenOrr S, Itzkovitz S, Kashtan N, Chkovskii D, Alon U: Network Motifs: Simple Building Blocks of Complex Networks. Science 2002, 298: 824–827. 10.1126/science.298.5594.824View ArticlePubMedGoogle Scholar
 Ciriello G, Guerra C: A review on models and algorithms for motif discovery in proteinprotein interaction networks. Briefings in Functional Genomics and Proteomics 2008, 7(2):147–156. 10.1093/bfgp/eln015View ArticlePubMedGoogle Scholar
 Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, Califano A: ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 2006, 7(Suppl 1):S7. 10.1186/147121057S1S7View ArticlePubMedPubMed CentralGoogle Scholar
 Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 2002, 18(Suppl 2):S231S240.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.