Graphs and graphical models
The following notation and terminology is mainly based on [9]. A graph is defined as a pair , where V is a set of vertices or nodes and E is a set of edges. Each edge is associated with a pair of nodes, its endpoints. Here only undirected graphs are considered, that is, with graphs undirected edges only. Two vertices α and β are said to be adjacent, written α∼β, if there is an edge between them. The neighbours of a vertex is the set of nodes that are adjacent to it. A subset A⊆V is complete if all vertex pairs in A are connected by an edge. A clique is a maximal complete subset, that is to say, a complete subset that is not contained in a larger complete subset.
A path (of length n) between vertices α and β in an undirected graph is a set of vertices where αi−1∼α
i
for . If a path has α=β then the path is said to be a cycle of length n. If a cycle has adjacent elements α
i
∼α
j
with j∉{i−1,i+1} then it is said to have a chord. If it has no chords it is said to be chordless. A graph with no chordless cycles of length ≥4 is called triangulated or chordal.
A subset D⊂V in an undirected graph is said to separate A⊂V from B⊂V if every path between a vertex in A and a vertex in B contains a vertex from D. The graph is said to be a subgraph of if V0⊆V and E0⊆E. For A⊆V, let E
A
denote the set of edges in E between vertices in A. Then is the subgraph induced by A.
The boundary bd(A) of a vertex set A⊆V is the set of vertices adjacent to a vertex in A but not in A, that is ,.
Let be an undirected graph with cliques . Consider a joint density f() of the variables in V. If this admits a factorization of the form
for some functions where g
j
() depends on x only through then f() is said to factorize according to . If all the densities in a model factorize according to , then the model is said to be -Markov. When this is true encodes the conditional independence structure of the model, through the following result (the global Markov property): whenever sets A and B are separated by a set C in , A and B are conditionally independent given C under the model. This is written as A⊥ ⊥B | C. A decomposable graphical model is one whose dependence graph is triangulated.
Selecting graphical models for chromosome-wide LD
Suppose that N observations of p SNPs from the same chromosome are sampled from some population. The variable set is written , and it is assumed that these are ordered by physical position on the chromosome. The variables may be either observed genotypes or inferred haplotypes, if these have been imputed: the former are trichotomous and the latter binary. Here an algorithm to use these data to select a graphical model for the distribution of V is described. It is based on a penalized likelihood criterion
(1)
where is the maximized log-likelihood under , is the number of free parameters under , and α is a penalization constant. For the AIC, α=2 and for the BIC, α= log(N). The latter penalizes complex models more heavily and so selects simpler models. Under suitable regularity conditions, the BIC is consistent in the sense that for large N it will select the simplest model consistent with the data ([25], Sect. 2.6).
A technical but nonetheless important point is that calculation of the correct model dimension is not straightforward for high-dimensional models, since not all parameters may be estimable. A simple expression exists however for the difference in model dimension between two decomposable models that differ by a single edge ([10], pp. 37-40). This is useful when the search space is restricted to decomposable models and the search algorithm only involves comparison of models differing by single edges (as here).
A forward-backward approach to estimate the graphical model for V is used. The first (forward) step is based on a greedy forward search algorithm. To take account of the physical ordering of the SNPs, the algorithm starts from a model where E0 is the set of edges between physically adjacent SNPs: this model is called the skeleton. To seek the minimum BIC (or AIC) model, the algorithm repeatedly adds the edge associated with the greatest reduction in BIC (AIC): only edges whose inclusion results in a decomposable model are considered. The process continues until no more edges improve the criterion. The search space in this step is the class of decomposable models that include the skeleton. Note that this algorithm - as with almost all greedy algorithms — is not guaranteed to find the global optimum.
There are several advantages to initially constraining the search space to include the skeleton. An unconstrained search starting off from the null model (with no edges) would not take the physical ordering into account. Since the graphs considered are consistent with this ordering, they are conveniently displayed as LD maps, as illustrated below. Because decomposable models contain no chordless cycles of length ≥4, two distal loci cannot be linked by an edge unless there are sufficiently many intermediate (chordal) edges to prevent the occurrence of such cycles. Thus the algorithm proceeds by filling out around the skeleton by incrementally adding edges. The effect of this is that only local LD is modelled. Since both association and (implicitly) proximity are required, the edges included are more likely to be for real (that is, less likely to be due to chance), so in this sense the model selection process is more reliable. In addition, restricting the search space in this way improves computational efficiency. The linear-time algorithm described in the following section also builds on the physical vertex ordering.
In the second (backward) step the graph found in the first step is pruned, again using a greedy algorithm that seeks to minimize the BIC (AIC) criterion by removing edges, without requiring that the skeletal edges are retained. Keeping these in the model would be appropriate if adjacent SNPs are in high LD, but this is not always the case. For example, there may be recombination hotspots, or genome assembly errors may have led to errors in positioning the SNPs. The graphs may be used to examine whether this has occurred.
To display the resulting graph, a suitable graph layout is needed. After some experimentation, one such was obtained by modifying the horizontal dot layout in Rgraphviz [26], by exponentially smoothing the y-coordinates and replacing the x-coordinates with the SNP-number. In this way the y-coordinates are chosen so as to clarify the graphical structure: a consequence of this is that nodes with large absolute y-coordinates tend to signal genomic regions of high structural complexity.
For some purposes it is helpful to use more objective measures of complexity, and two such measures are used here. Consider a chromosome with p SNPs. For each , the height h
i
of the interval between SNPs i and i+1 is defined to be the number of edges of the form (j,k), where j≤i and i+1≤k, and the width w
i
is defined to be the maximum value of |k−j| of such edges. Note that by the global Markov property, when h
i
=0 or equivalently w
i
=0, the SNPs V≤i={v
j
:j≤i} are independent of V>i={v
j
:j>i}. Similarly, when h
i
=1 or equivalently w
i
=1, V<i⊥⊥V>i|v
i
.
As a measure of haplotype diversity, the entropy[27] is used here. Suppose that in some genomic interval there are k distinct haplotypes with relative frequencies . Then the entropy is defined as . It is large when there are many distinct haplotypes that are equally frequent.
A useful way to display marker data over short genomic intervals is the sample tree[23]. This summarizes a set of discrete longitudinal data of the form for . A rooted tree is constructed in which the non-root nodes represent partial outcomes of the form for k≤q present in the sample data. Edges may be coloured with the marker value and labelled with the corresponding sample counts, or drawn with width proportional to the sample counts.
A fast selection algorithm
For high-dimensional models, the first step of the algorithm described above can be computationally demanding. To address this a much faster algorithm for the same task is now described. This involves combining a series of overlapping marginal models of lower dimension. Call the block length (model dimension) L and the overlap K. Thus the first block is , the second is and so on. In the applications described here L=100 and K=20 are used.
Suppose that the true model for is , and that is the estimate of obtained by performing the first step of the algorithm described in the previous section. The goal is to construct an approximation to by combining models obtained by applying that same algorithm to blocks V
i
for .
A way to construct a model for V1∪V2 by combining and is now described. Let . Then is defined as , where and .
The rationale for this is that marginal models may include spurious associations on the boundaries. For example, let and be the subgraphs of induced by V1 and V2, respectively. Then the marginal distribution of V2 will not in general be -Markov, but it will be -Markov for a graph derived from by completing the boundary of each connected component of in [28]. So will tend to contain edges not in . To estimate the boundary of V1∖V2 in , is used: the boundary is estimated to be contained in the set . Hence by only including edges with w>m∗ edges in this boundary are omitted. Similarly, the boundary of V2∖V1 in may contain spurious associations, so may contain unnecessary edges. To avoid this the overlap length K is chosen to be sufficiently large so that m∗< min{v:∃ (v,w)∈E2 with w≥L+1}. If the maximum width was known in advance this inequality could be ensured by setting K to be twice the maximum width, but so large a value appears to be unnecessary in practice.
The algorithm proceeds in the obvious fashion combining with and so on. Since the construction may result in chordless cycles it is necessary to triangulate the resulting graph . This can be done using the maximum cardinality search algorithm [29] which can be implemented in linear time.
Assuming that the time for deriving the estimates for fixed L is bounded, the algorithm described here is by construction linear in p. But the proportionality constant can be expected to be depend on various factors, such as the density of the minimum AIC/BIC model.
Implementation
The methods are implemented in a set of R functions which are provided as Additional files 1 and 2 to the paper. The selection algorithms build on the forward search algorithm implemented in the stepw function in the gRapHD package [30]. Currently this function requires that there are no missing data. The backward search algorithm was developed by the author in the course of preparing this paper, and has subsequently been implemented in the gRapHD package. The functions to perform the selection algorithms, and others to work with and display the graphs, build on the following packages: igraph [31], graph [32], Rgraphviz [26] and gRapHD. The triangular heatmaps were produced using the LDheatmap package [33]. Simulation from a fitted decomposable model was carried out using the gRain package [34]. The package rJPSGCS provides an interface between R and the FitGMLD program which was used to perform the algorithm of [15].