### Graphs and graphical models

The following notation and terminology is mainly based on [9]. A graph is defined as a pair \mathcal{G}=(V,E), where *V* is a set of *vertices* or *nodes* and *E* is a set of *edges*. Each edge is associated with a pair of nodes, its *endpoints*. Here only undirected graphs are considered, that is, with graphs undirected edges only. Two vertices *α* and *β* are said to be *adjacent*, written *α*∼*β*, if there is an edge between them. The *neighbours* of a vertex is the set of nodes that are adjacent to it. A subset *A*⊆*V* is *complete* if all vertex pairs in *A* are connected by an edge. A *clique* is a maximal complete subset, that is to say, a complete subset that is not contained in a larger complete subset.

A *path* (of length *n*) between vertices *α* and *β* in an undirected graph is a set of vertices \alpha ={\alpha}_{0},{\alpha}_{1},\dots ,{\alpha}_{n}=\beta where *α*_{i−1}∼*α*_{
i
} for i=1,\dots ,n. If a path \alpha ={\alpha}_{0},{\alpha}_{1},\dots ,{\alpha}_{n}=\beta has *α*=*β* then the path is said to be a *cycle* of length *n*. If a cycle \alpha ={\alpha}_{0},{\alpha}_{1},\dots ,{\alpha}_{n}=\alpha has adjacent elements *α*_{
i
}∼*α*_{
j
} with *j*∉{*i*−1,*i*+1} then it is said to have a *chord*. If it has no chords it is said to be *chordless*. A graph with no chordless cycles of length ≥4 is called *triangulated* or *chordal*.

A subset *D*⊂*V* in an undirected graph is said to *separate* *A*⊂*V* from *B*⊂*V* if every path between a vertex in *A* and a vertex in *B* contains a vertex from *D*. The graph {\mathcal{G}}_{0}=({V}_{0},{E}_{0}) is said to be a *subgraph* of \mathcal{G}=(V,E) if *V*_{0}⊆*V* and *E*_{0}⊆*E*. For *A*⊆*V*, let *E*_{
A
} denote the set of edges in *E* between vertices in *A*. Then {\mathcal{G}}_{A}=(A,{E}_{A}) is the *subgraph induced by* *A*.

The *boundary* bd(*A*) of a vertex set *A*⊆*V* is the set of vertices adjacent to a vertex in *A* but not in A, that is ,.

\text{bd}(A)=\{v\in V:v\sim w\phantom{\rule{.5em}{0ex}}\text{for some}\phantom{\rule{.5em}{0ex}}w\in A\}\setminus \mathrm{A.}

Let \mathcal{G}=(V,E) be an undirected graph with cliques {C}_{1},\dots {C}_{k}. Consider a joint density *f*() of the variables in *V*. If this admits a factorization of the form

f({x}_{V})=\prod _{i=1}^{k}{g}_{i}({x}_{{C}_{i}})

for some functions {g}_{1}()\dots {g}_{k}() where *g*_{
j
}() depends on *x* only through {x}_{{C}_{j}} then *f*() is said to factorize according to \mathcal{G}. If all the densities in a model factorize according to \mathcal{G}, then the model is said to be \mathcal{G}-Markov. When this is true \mathcal{G} encodes the conditional independence structure of the model, through the following result (the *global Markov property*): whenever sets *A* and *B* are separated by a set *C* in \mathcal{G}, *A* and *B* are conditionally independent given *C* under the model. This is written as *A*⊥ ⊥*B* | *C*. A *decomposable* graphical model is one whose dependence graph is triangulated.

### Selecting graphical models for chromosome-wide LD

Suppose that *N* observations of *p* SNPs from the same chromosome are sampled from some population. The variable set is written V=({v}_{1},\dots {v}_{p}), and it is assumed that these are ordered by physical position on the chromosome. The variables may be either observed genotypes or inferred haplotypes, if these have been imputed: the former are trichotomous and the latter binary. Here an algorithm to use these data to select a graphical model for the distribution of *V* is described. It is based on a penalized likelihood criterion

\mathit{\text{IC}}(\mathcal{G})=-2{\widehat{\ell}}_{\mathcal{G}}+\alpha \phantom{\rule{2pt}{0ex}}dim(\mathcal{G})

(1)

where {\widehat{\ell}}_{\mathcal{G}} is the maximized log-likelihood under \mathcal{G}, dim(\mathcal{G}) is the number of free parameters under \mathcal{G}, and *α* is a penalization constant. For the AIC, *α*=2 and for the BIC, *α*= log(*N*). The latter penalizes complex models more heavily and so selects simpler models. Under suitable regularity conditions, the BIC is consistent in the sense that for large *N* it will select the simplest model consistent with the data ([25], Sect. 2.6).

A technical but nonetheless important point is that calculation of the correct model dimension is not straightforward for high-dimensional models, since not all parameters may be estimable. A simple expression exists however for the difference in model dimension between two decomposable models that differ by a single edge ([10], pp. 37-40). This is useful when the search space is restricted to decomposable models and the search algorithm only involves comparison of models differing by single edges (as here).

A forward-backward approach to estimate the graphical model for *V* is used. The first (forward) step is based on a greedy forward search algorithm. To take account of the physical ordering of the SNPs, the algorithm starts from a model {\mathcal{G}}_{0}=(V,{E}_{0}) where *E*_{0} is the set of edges between physically adjacent SNPs: this model is called the *skeleton*. To seek the minimum BIC (or AIC) model, the algorithm repeatedly adds the edge associated with the greatest reduction in BIC (AIC): only edges whose inclusion results in a decomposable model are considered. The process continues until no more edges improve the criterion. The search space in this step is the class of decomposable models that include the skeleton. Note that this algorithm - as with almost all greedy algorithms — is not guaranteed to find the global optimum.

There are several advantages to initially constraining the search space to include the skeleton. An unconstrained search starting off from the null model (with no edges) would not take the physical ordering into account. Since the graphs considered are consistent with this ordering, they are conveniently displayed as LD maps, as illustrated below. Because decomposable models contain no chordless cycles of length ≥4, two distal loci cannot be linked by an edge unless there are sufficiently many intermediate (chordal) edges to prevent the occurrence of such cycles. Thus the algorithm proceeds by filling out around the skeleton by incrementally adding edges. The effect of this is that only local LD is modelled. Since both association and (implicitly) proximity are required, the edges included are more likely to be for real (that is, less likely to be due to chance), so in this sense the model selection process is more reliable. In addition, restricting the search space in this way improves computational efficiency. The linear-time algorithm described in the following section also builds on the physical vertex ordering.

In the second (backward) step the graph found in the first step is pruned, again using a greedy algorithm that seeks to minimize the BIC (AIC) criterion by removing edges, without requiring that the skeletal edges are retained. Keeping these in the model would be appropriate if adjacent SNPs are in high LD, but this is not always the case. For example, there may be recombination hotspots, or genome assembly errors may have led to errors in positioning the SNPs. The graphs may be used to examine whether this has occurred.

To display the resulting graph, a suitable graph layout is needed. After some experimentation, one such was obtained by modifying the horizontal dot layout in Rgraphviz [26], by exponentially smoothing the *y*-coordinates and replacing the *x*-coordinates with the SNP-number. In this way the *y*-coordinates are chosen so as to clarify the graphical structure: a consequence of this is that nodes with large absolute *y*-coordinates tend to signal genomic regions of high structural complexity.

For some purposes it is helpful to use more objective measures of complexity, and two such measures are used here. Consider a chromosome with *p* SNPs. For each i=1,\dots p-1, the *height* *h*_{
i
} of the interval between SNPs *i* and *i*+1 is defined to be the number of edges of the form (*j*,*k*), where *j*≤*i* and *i*+1≤*k*, and the *width* *w*_{
i
} is defined to be the maximum value of |*k*−*j*| of such edges. Note that by the global Markov property, when *h*_{
i
}=0 or equivalently *w*_{
i
}=0, the SNPs *V*_{≤i}={*v*_{
j
}:*j*≤*i*} are independent of *V*_{>i}={*v*_{
j
}:*j*>*i*}. Similarly, when *h*_{
i
}=1 or equivalently *w*_{
i
}=1, *V*_{<i}⊥⊥*V*_{>i}|*v*_{
i
}.

As a measure of haplotype diversity, the *entropy*[27] is used here. Suppose that in some genomic interval there are *k* distinct haplotypes with relative frequencies {f}_{1},\dots {f}_{k}. Then the entropy is defined as H=-{\sum}_{i=1\dots k}{f}_{i}log{f}_{i}. It is large when there are many distinct haplotypes that are equally frequent.

A useful way to display marker data over short genomic intervals is the *sample tree*[23]. This summarizes a set of discrete longitudinal data of the form {x}^{(v)}=({x}_{1}^{(v)},\dots {x}_{q}^{(v)}) for v=1\dots N. A rooted tree is constructed in which the non-root nodes represent partial outcomes of the form ({x}_{1},\dots {x}_{k}) for *k*≤*q* present in the sample data. Edges may be coloured with the marker value and labelled with the corresponding sample counts, or drawn with width proportional to the sample counts.

### A fast selection algorithm

For high-dimensional models, the first step of the algorithm described above can be computationally demanding. To address this a much faster algorithm for the same task is now described. This involves combining a series of overlapping marginal models of lower dimension. Call the block length (model dimension) *L* and the overlap *K*. Thus the first block is {V}_{1}=\{{v}_{1},\dots {v}_{L}\}, the second is {V}_{2}=\{{v}_{L-K+1}\dots {v}_{2L-K}\} and so on. In the applications described here *L*=100 and *K*=20 are used.

Suppose that the true model for V=({v}_{1},\dots {v}_{p}) is \mathcal{G}, and that \widehat{\mathcal{G}} is the estimate of \mathcal{G} obtained by performing the first step of the algorithm described in the previous section. The goal is to construct an approximation \stackrel{~}{\mathcal{G}} to \widehat{\mathcal{G}} by combining models {\widehat{\mathcal{G}}}_{i}=({V}_{i},{\xca}_{i}) obtained by applying that same algorithm to blocks *V*_{
i
} for i=1,2\dots \phantom{\rule{0.3em}{0ex}}.

A way to construct a model {\stackrel{~}{\mathcal{G}}}_{12} for *V*_{1}∪*V*_{2} by combining {\widehat{\mathcal{G}}}_{1} and {\widehat{\mathcal{G}}}_{2} is now described. Let {m}^{\ast}=max\{w:\exists \phantom{\rule{.5em}{0ex}}(v,w)\in {\xca}_{1}\phantom{\rule{.5em}{0ex}}\text{with}\phantom{\rule{.5em}{0ex}}v\le L-K\}. Then {\stackrel{~}{\mathcal{G}}}_{12} is defined as {\stackrel{~}{\mathcal{G}}}_{12}=({V}_{1}\cup {V}_{2},{\xca}_{1}^{0}\cup {\xca}_{2}^{0}), where {\xca}_{1}^{0}=\{(v,w)\in {\xca}_{1}:w\le {m}^{\ast}\} and {\xca}_{2}^{0}=\{(v,w)\in {\xca}_{2}:w>{m}^{\ast}\}.

The rationale for this is that marginal models may include spurious associations on the boundaries. For example, let {\mathcal{G}}_{1} and {\mathcal{G}}_{2} be the subgraphs of \mathcal{G} induced by *V*_{1} and *V*_{2}, respectively. Then the marginal distribution of *V*_{2} will not in general be {\mathcal{G}}_{2}-Markov, but it will be {\mathcal{G}}_{2}^{\ast}-Markov for a graph {\mathcal{G}}_{2}^{\ast} derived from {\mathcal{G}}_{2} by completing the boundary of each connected component of {\mathcal{G}}_{V\setminus {V}_{2}} in \mathcal{G}[28]. So {\widehat{\mathcal{G}}}_{2} will tend to contain edges not in {\mathcal{G}}_{2}. To estimate the boundary of *V*_{1}∖*V*_{2} in \mathcal{G}, {\widehat{\mathcal{G}}}_{1} is used: the boundary is estimated to be contained in the set L-K+1,\dots ,{m}^{\ast}. Hence by only including edges (v,w)\in {\xca}_{2} with *w*>*m*^{∗} edges in this boundary are omitted. Similarly, the boundary of *V*_{2}∖*V*_{1} in {\mathcal{G}}_{1} may contain spurious associations, so {\xca}_{1}^{0} may contain unnecessary edges. To avoid this the overlap length *K* is chosen to be sufficiently large so that *m*^{∗}< min{*v*:∃ (*v*,*w*)∈*E*_{2} with *w*≥*L*+1}. If the maximum width was known in advance this inequality could be ensured by setting *K* to be twice the maximum width, but so large a value appears to be unnecessary in practice.

The algorithm proceeds in the obvious fashion combining {\stackrel{~}{\mathcal{G}}}_{12} with {\widehat{\mathcal{G}}}_{3} and so on. Since the construction may result in chordless cycles it is necessary to triangulate the resulting graph \stackrel{~}{\mathcal{G}}. This can be done using the maximum cardinality search algorithm [29] which can be implemented in linear time.

Assuming that the time for deriving the estimates {\widehat{\mathcal{G}}}_{i} for fixed *L* is bounded, the algorithm described here is by construction linear in *p*. But the proportionality constant can be expected to be depend on various factors, such as the density of the minimum AIC/BIC model.

### Implementation

The methods are implemented in a set of R functions which are provided as Additional files 1 and 2 to the paper. The selection algorithms build on the forward search algorithm implemented in the stepw function in the gRapHD package [30]. Currently this function requires that there are no missing data. The backward search algorithm was developed by the author in the course of preparing this paper, and has subsequently been implemented in the gRapHD package. The functions to perform the selection algorithms, and others to work with and display the graphs, build on the following packages: igraph [31], graph [32], Rgraphviz [26] and gRapHD. The triangular heatmaps were produced using the LDheatmap package [33]. Simulation from a fitted decomposable model was carried out using the gRain package [34]. The package rJPSGCS provides an interface between R and the FitGMLD program which was used to perform the algorithm of [15].