The Cluster Variation Method for Efficient Linkage Analysis on Extended Pedigrees

Background Computing exact multipoint LOD scores for extended pedigrees rapidly becomes infeasible as the number of markers and untyped individuals increase. When markers are excluded from the computation, significant power may be lost. Therefore accurate approximate methods which take into account all markers are desirable. Methods We present a novel method for efficient estimation of LOD scores on extended pedigrees. Our approach is based on the Cluster Variation Method, which deterministically estimates likelihoods by performing exact computations on tractable subsets of variables (clusters) of a Bayesian network. First a distribution over inheritances on the marker loci is approximated with the Cluster Variation Method. Then this distribution is used to estimate the LOD score for each location of the trait locus. Results First we demonstrate that significant power may be lost if markers are ignored in the multi-point analysis. On a set of pedigrees where exact computation is possible we compare the estimates of the LOD scores obtained with our method to the exact LOD scores. Secondly, we compare our method to a state of the art MCMC sampler. When both methods are given equal computation time, our method is more efficient. Finally, we show that CVM scales to large problem instances. Conclusion We conclude that the Cluster Variation Method is as accurate as MCMC and generally is more efficient. Our method is a promising alternative to approaches based on MCMC sampling.


Background
The goal of genetic linkage analysis is to link phenotype to genotype. Pedigrees are collected where a trait or disease is believed to have a genetic component. The individuals in the pedigree are genotyped for a number of markers on the chromosome. The markers are at known relative recombination frequencies, so that from the genotypes a distribution over inheritances can be inferred. Linkage of the trait to a specific location in the marker map then is quantified by the extent to which the distribution over inheritances as inferred from the markers can explain the observed phenotypes in the pedigree.

Parametric linkage analysis
In this article we compute linkage likelihoods with the parametric LOD score (log odds ratio) proposed by Morton [1]. The LOD score is the log ratio of the likelihoods of the hypothesis that the disease locus is linked to the marker loci at a specific location and the hypothesis that it is unlinked to the marker loci. The LOD score requires specification of the disease frequency and penetrance values and therefore falls into the category of parametric scoring functions.

Exact computations
Several methods for exact computations are in use.
Lander et al. [2] introduced a Hidden Markov Model (HMM) where the meiosis indicators are the unobserved variables. This method is linear in the number of loci, but exponential in 2n -f, where n is the number of nonfounders and f the number of founders. Kruglyak et al. [3] optimized the method in the program Genehunter.
Elston et al. [4] developed an algorithm that is efficient on pedigrees that have little inbreeding. This method is linear in the number of individuals (in case there is no inbreeding) but scales with the number of possible multi-locus genotypes. The method was made computationally efficient in the package Vitesse [5].
Both of these methods exploit particular independence properties of the statistical model. Within the framework of Bayesian networks, this approach has been generalized in the junction tree algorithm [6,7]. In the computer program Superlink [8] this approach is implemented for the application of linkage analysis and is the first program to make use of Bayesian networks for computing exact linkage likelihoods.
Although exact algorithms have been substantially improved over the years, the fact remains that they require an exponential number of operations and have limited applicability.

The Cluster Variation Method
The Cluster Variation Method originated with the work of Bethe [9] and was extended to non-pair wise marginals by Kikuchi [10] to compute properties of magnetic materials, such as Ising models. In later years, the method has been extended and reformulated [11,12]. Recently the method has been introduced into the machine learning community [13,14]as a method for approximate inference in Bayesian networks and undirected graphical models.
The Cluster Variation Method approximates an intractable probability distribution in terms of marginal probability distributions on clusters of variables. These clusters of variables are chosen such that exact computations are feasible on each cluster. We make explicit use of the formulation of linkage analysis in terms of a Bayesian network to choose which variables will be contained in the clusters. In contrast with MCMC the approximation is deterministic and yields estimates of the pedigree likelihood.

CVM and linkage analysis
As large complex pedigrees with individuals genotyped at a large number of locations become increasingly available, along comes the need for methods of estimating likelihoods on pedigrees where exact computations are not possible.
In this article we describe in detail how the Cluster Variation Method can be applied to the problem of genetic linkage analysis on pedigrees without inbreeding. We discuss extension of our approach to inbred pedigrees.

Results
We compare the estimates of the LOD score obtained with our method to exact scores as computed with Vitesse [5]. We also compare our method to Markov Chain Monte Carlo (MCMC) simulations. For this we have used version 2.5 of the Morgan sampler [15]. This MCMC sampler is optimized for pedigrees for which exact single locus computations are possible. To our knowledge this is the most advanced sampler for the pedigrees we consider.
We consider CVM converged if the marker marginals change by no more than 10 -3 . We use the following settings for the Morgan sampler: the number of prior samples and burn-in samples are set to respectively 50 % and 10 % of the number of samples used for the actual estimates.
We performed all experiments on a Pentium-IV 2.8 GHz with 1 GB of physical memory running Linux.

Simulations
We start by motivating the use of approximate methods with an example. On the pedigree shown in figure 1, we have simulated a dominant disease with penetrance values f = (0.02, 0.98, 0.98) and trait allele frequencies t = (0.98, 0.02), at 0 cM. This pedigree can be handled by Genehunter [3]. We have simulated 25 pedigrees, where half of the individuals has genotypic and phenotypic data. 14 bi-allelic markers were simulated with marker allele frequencies m = (0.4, 0.6) for all 14 markers. The marker spacings are 1 cM.
In figure 2 we now compare exact LOD scores computed with all of the 14 available markers to exact LOD scores computed with only a subset of the markers. The figure shows that significant power is lost when markers are excluded from the multi-point analysis. The solid line represents exact LOD scores computed with all available markers. The LOD score peaks at 0 cM, where the disease was indeed simulated. The dotted line represents the LOD score based on 5 markers: for each location of the trait locus, a LOD score is computed by doing a multi-point calculation with the 2 markers to the left of the trait locus, 2 markers to the right of the trait locus, and the marker at which the trait locus is located. If the trait locus is located on the first marker in the marker map, no markers to the left of this marker are available so that the first 5 markers are used to calculate the LOD score for this location of the trait locus. This approach can be characterized as a sliding window approach.
In this example the pedigree was small so that exact scores can be computed with Genehunter for a virtually unlimited number of markers. However, as the size of the pedigree increases, the number of markers that can be analyzed simultaneaously drops rapidly. In that case significant power may be lost. Thus, an accurate approximate method that can take into account all markers is desirable.
We now compare the estimates of the CVM and MCMC to the exact scores. The results are obtained on the pedigree shown in figure 3. There are 48 individuals of which 10 are founders. The number of children per nuclear family increases from two in the second generation to five in the third generation. We simulate phenotypes and genotypes according to this pedigree. We consider a dominant disease with penetrance values f = (0.02, 0.90, 0.90). The disease allele frequency has been set to 2 %, so that t = (0.98, 0.02). We assume that for each individual in the pedigree the affection status is known. For a marker spacing of 5 cM, we simulated 25 pedigrees with 3 markers and at least 15 affected individuals per pedigree. The number of alleles is 5 per marker with equal frequencies. 70 % of the individuals in the last two generations is genotyped for all markers. The individuals in the first two generations are not genotyped.
In figure 4 we compare the quality of the approximation resulting from two cluster choices C 1 and C 2 . These cluster choices are specified in figure 5. The error is defined as the absolute difference between the exact LOD score and the CVM estimate of the LOD score, averaged over all positions of the trait locus. We see that the error of the larger  clusters of C 1 is small and an order of magnitude smaller than the error of the smaller clusters of C 2 .

Power of analysis
This result demonstrates two points. First, it is important to include the interactions between meiosis variables on adjacent loci into at least one cluster. Second, the accuracy of the approximation can be adapted by increasing the number of variables per cluster. Although it is not the case that an increased number of variables per cluster guarantees a higher accuracy of the approximation, in our experience it is generally possible to obtain more accurate estimates by increasing the number of loci and/or the number of generations covered by a cluster.
For a difficult problem in the dataset, i.e. one where MCMC and CVM error are relatively large, we compare our result to MCMC estimates obtained with Morgan. In figure 6 the decrease of the CVM and MCMC error as a function of computation time (i.e. number of samples) is shown. We see that a significant increase of the computation time does not significantly decrease the error and variance of the MCMC estimate. The error of the CVM estimates obtained with cluster choice C 1 is indicated by the dashed line. The CVM computation time is varied by adjusting the value of the convergence criterion. We conclude that our method achieves higher accuracy for a given amount of computation time.
For the other pedigrees in the data set we compare CVM to MCMC, where for each problem MCMC is alotted the computation time required by CVM with cluster choice C 1 . The results are shown in figure 7. We see that the MCMC estimates are less accurate. The average CVM computation time is 700 seconds, although there is a considerable degree of variance in the order of 100 seconds. Memory requirements vary between 100 and 250 MB, depending on the informativeness of the markers. We did not find a correlation between CVM computation time and the absolute error with respect to the exact distribu-tion. Also the outliers in the figure are not explained by large CVM computation times that consequently lead to an improvement of the MCMC estimate, as the MCMC computation time is fixed to the CVM computation time.
Since both methods will theoretically converge to the exact solution in the limit of infinite time resources, this is the only fair comparison. Additional simulations (not shown) indicate, in agreement with the results reported in figure 6, that the MCMC estimates are sufficiently converged. The CVM estimates are reproducible; variance is in the order of the convergence criterion.
We now demonstrate that the method scales to larger problem instances. We therefore vary the number of  markers, since Vitesse can handle very large pedigrees with no loops, but only a small number of markers. We have simulated a dominant disease on the pedigree of figure 3 and 32 bi-allelic markers with equal allele frequencies. We have simulated one pedigree where all individuals are genotyped for all 32 markers. From this instance we create 16 problems by selecting a subset of the markers of the original problem.

Comparison of cluster choices
In figure 8 we show that CVM computation time scales approximately linearly with the number of markers, as do memory requirements (not shown). The varying informativeness of the markers explains the fluctuations. Vitesse cannot handle more than 10 markers, because memory requirements exceed the available 1 GB. Memory requirement of CVM for 32 markers is 150 MB. In this case, MCMC estimates take several hours to reach convergence. We conclude that our method scales to large problem instances.

Discussion
We compared our method to the MCMC implementation of the Morgan sampler, which is to our knowledge the most advanced program for this problem. There are packages that can handle more general pedigrees than Morgan, such as SIMWALK2 [16,17], but here we have investigated only pedigrees without inbreeding. Preliminary results indicate the CVM approximations based on the cluster choices presented in this article can give good results on inbred pedigrees. Extension to inbred pedigrees is possible and a direction for further research.
Theoretically, if the sampler is irreducible, MCMC estimates should ultimately converge to the exact score. However, in practice this may require extremely long computation times. The Cluster Variation Method does not guarantee that for a fixed choice of clusters the approximate marginals will converge to marginals of the exact distribution, but the same argument that holds for MCMC also holds for CVM: we can increase the cluster size and consequently computation time to improve the quality of the approximation.
In this article we have proposed cluster choices that generally give good results. Sometimes when many individuals are untyped the estimates can be inaccurate. Interestingly, on these problems the Morgan sampler also experienced severe difficulties.
In the approach we have taken, we can define a heuristic to detect errors in the approximation. Suppose we have three markers; then the LOD score for marker 2 can be computed either from the marginals defined on the nuclear families on the first and second marker, or from the marginals defined on the nuclear families on the second and third marker. If these LOD scores differ significantly, one should be very careful in interpreting the estimate and the number of variables per cluster must be increased. We cannot guarantee that if the LOD scores are consistent, the approximation is accurate. An obvious and useful extension would be an automatic procedure that gives the optimal set of clusters. However such a procedure is far from trivial, and the guideline to choose the clusters as large as available memory permits seems to work well in practice.
In the current implementation we have applied a number of preprocessing techniques to improve the efficiency. We expect that even better efficiency can be obtained by applying more preprocessing techniques such as genotype Figure 5 Cluster choices. Nodes on neighboring markers l and l + 1 (shaded) that form a cluster. Large node represents the genotype of an individual, a small node the meiosis indicator an individual. In the CVM approximation, only marginals of clusters on the marker loci are computed. A shows cluster choice C 1 . Genotype nodes of parents and children and meiosis nodes of children in the nuclear family of adjacent marker loci form a single cluster. B shows cluster choice C 2 . Genotype nodes of parents and children and meiosis nodes of children in a nuclear family on one marker locus form a cluster. The meiosis nodes of the children form a separate cluster.

Cluster choices
) ) ) ) ) ) elimination [18,19], and techniques specific to Bayesian networks such as value abstraction [20] and evidence based compiling [21]. Also, preliminary simulations indicate that smaller clusters can give equally accurate estimates with reduced memory requirements.
Other applications fit naturally into the framework presented here. Since the Cluster Variation Method is able to estimate pedigree likelihoods directly, the method presented here can be used directly to estimate recombination frequencies and marker ordering errors with a maximum likelihood approach. Maximum likelihood haplotyping on general pedigrees also is very promising in this framework.

Conclusion
In this article we have demonstrated the feasibility of a new approach to compute linkage likelihoods for linkage problems that are beyond the reach of exact computations. Previous methods that are suited to deal with these intractable problems relied on sampled estimates. We have shown that a deterministic approach based on the Cluster Variation Method is able to obtain accurate estimates of LOD scores and generally is more efficient than MCMC methods.

A Bayesian network formulation
We briefly describe the Bayesian network that will enable us to compute likelihoods. Any probability distribution can be represented with a Bayesian network [22]. Therefore the use of Bayesian networks is merely a matter of convenient representation of a probability distribution, and is irrelevant to the issue of Bayesian versus frequentist statistics. Bayesian networks in the context of genetics have first been applied by Jensen et al. [23]. Their approach was extended by Thomas et al. [24]. The use of Bayesian networks for exact computations has been proposed by Fishelson et al. [8]. An extensive discussion of Bayesian networks in the context of genetics is given by Sheehan et al. [25]. These articles have demonstrated the power of Bayesian networks for linkage analysis.
The transmission of alleles from parents to children is clearly a directed process. A Bayesian network represents a probability distribution in terms of a directed graph, i.e. a graph where the links between the variables are directed. A Bayesian network is therefore particularly suited to model the probability distribution associated with the problem of multi-point linkage analysis. By specifying conditional probability tables for each variable, the formalism of Bayesian networks guarantees that the corresponding probability distribution is consistent and normalized [22].  While the Bayesian network is the most convenient for model specification, it is not possible to apply the Cluster Variation Method directly to a Bayesian network. In order to perform inference, the Bayesian network is converted to an undirected graphical model by the procedure of moralization [7]. Moralization removes the directions of the links and adds links between the parent variables of a variable, i.e. all variables with a link directed towards a given variable. The undirected graphical model represents exactly the same probability distribution as the directed graphical model; the addition of extra links ensures that the correlations encoded in the conditional probability tables will be correctly taken into account by the inference method. The formal procedure of converting the Bayesian network is standard practice and used by all of the above mentioned methods that make use of Bayesian networks.

Comparison of CVM and MCMC error
The Cluster Variation Method requires specification of which variables are contained in each cluster, and this specification becomes very transparent when the dependencies between these variables are modeled with a Bayesian network. The Bayesian network consists of a number of marker loci, with known relative recombination frequencies θ, and a single trait locus linked to the markers at a given position λ T . The purpose of linkage analysis is to determine the most likely position of the trait locus relative to the markers. To that end a Bayesian network is constructed for each possible location λ T of the trait locus, so that the likelihood of the trait phenotypes and marker genotypes can be computed for that location of the trait locus. The ratio of this likelihood and the likelihood of the trait locus unlinked to the markers then gives the LOD score for location λ T .

Single locus model
First we define the Bayesian network for a single locus. The inheritance model is shown in figure 9A. Each variable is represented graphically by a node. A conditional probability table for a variable is defined by the variable itself and all variables (or, equivalently, nodes) that have a link which points to that variable. In the figure, the variables are the genotypes and the meiosis indicators.  We note that the genotypes of all non-founders are completely determined by the genotypes of the founders and the meiosis indicators v. The meiosis indicators com- ,   Collecting all conditional probability tables, the definition of the full probability distribution is given by: Here the founder and non-founder individuals are denoted by F and NF, respectively. The index of the trait locus depends on the position of the trait locus, so that l T = l T (λ T ). The recombination frequency θ l,l' depends on λ T if either l = l T or l' = l T , so that θ l,l' = θ l,l' (λ T ). Otherwise, θ l,l' is independent of λ T .
This distribution is normalized to one, by construction. Computing marginal distributions is generally intractable because the structure of the corresponding Bayesian network can be too complex due to loops. Loops are caused by inbreeding in the pedigree and through the coupling of the meiosis indicators between different loci.

Calculating LOD scores
The LOD score of parametric linkage analysis is defined as the log ratio of the likelihood that the trait locus is linked to the marker loci at location λ T and the likelihood that the trait locus is unlinked, denoted by λ T = ∞: The denominator can be rewritten as P (T, M|f, t, m, θ, λ T = ∞) = P (T|f, t) P (M|m, θ), giving The denominator has to be computed only once and acts as a normalization constant. We will use the Cluster Variation Method to approximate both likelihoods independently. The likelihood in the numerator has to be estimated for each position of the trait locus.

The Cluster Variation Method
In this section we describe how the Cluster Variation Method can be used to obtain approximations of marginal distributions of the exact distribution. In order to apply the Cluster Variation Method more conveniently, we will make a slight change in notation.

The probability distribution of a Bayesian network is of the general form
Where π (i) are the nodes with a link directed towards node i, x i is the value assumed by node i and x is a vector of values assumed by all nodes in the Bayesian network. If there are no nodes that have a link pointing to node i, we have We consider evidence to be the observation that a node is clamped to a state, e.g. an individual is affected or has marker genotype (1, 2). Suppose we have evidence that node n is clamped to state , denoted by e = {x n = }, then For genetic linkage analysis the evidence is on marker genotypes M and trait phenotypes T, and we wish to compute the likelihood of these observations given model parameters.
We now define and e = {e 1 , ..., e n }. Here δ (·) is the delta function, which serves to clamp a node to its observed value. Using these definitions, we can rewrite the likelihood of the evidence e as so that the probability distribution over nodes without evidence x i\e conditional on nodes with evidence x e is given by  We now have reformulated the probability distribution of the Bayesian network in terms of so called potential funtions ψ which do not reflect any longer how the links between the nodes were originally directed. Also, the potential functions ψ i (x i , x π(i) ) contain both node i and the parents of node i. As a result, in the undirected graph associated with these potential functions all parents of node i are connected to each other. This formal procedure is called moralization [7] and is essential to the application of all inference methods to Bayesian networks. Equation 5 specifies the same distribution as equation 3, but will be more convenient to apply the Cluster Variation Method to.

Obtaining the exact distribution from a variational principle
The exact distribution P can be derived from a variational principle: subject to the constraint that P is normalized to one, where and KL is the Kullback-Leibler divergence. The solution is readily given by P = . However, the sum in equation Essential to the CVM approximation is that the clusters α are defined on overlapping subsets of nodes: a single node, or even a subset of nodes, can occur in several of the basic clusters α ∈ B. Although the product 2 of the cluster marginals may not be a good approximation of the full distribution, the approximation is designed such that the cluster marginals Q α (x α ) are accurate approximations of the exact marginals P (x α ).  In this example, we have chosen the following clusters α ∈ B that will determine the approximation:  Approximate free energies We now discuss the optimization problem of equation 6, which is to be redefined in terms of the cluster marginals. Inserting the CVM approximation corresponding to equation 7 into expression 6, we obtain where the minimization is subject to normalization constraints     Again, a γ are the Moebius numbers. The consistency constraints ensure that if two clusters have a non-empty intersection, the marginal distributions on the nodes in the overlap are consistent.
In equation 9 we have introduced the free energy F γ (Q γ ) of cluster γ ∈ U: where Ψ γ contains all conditional probability tables that are defined on subsets of nodes in cluster γ. The optimization problem in equation 9 is now in terms of the distributions Q γ (x γ ), which are tractable by choice. The intractable optimization problem of equation 6 has been turned into a tractable optimization problem, by substituting the exact distribution P the product Q defined in equation 7. In the next section we will discuss how the optimization problem of equation 9 can be solved efficiently.
Returning to the example of figure 11, we can now write down the corresponding free energy which is to be minimized with respect to the cluster marginals Q α (x α ) and Q β (x β ): Here the Moebius numbers are in boldface. The minimization is subject to normalization and consistency constraints. For example, the consistency constraint between clusters α 1 and β 1 : In the example, we have put all meiosis nodes into different clusters. In practice this gives inaccurate approximations; it turns out to be neccessary to join all paternal and maternal meiosis indicators in one cluster, because of the strong correlations between these variables. The reason is that if the phase in one of the genotypes of the parents is reversed, for a given state of the meiosis indicators different alleles are transmitted to the children.

Minimizing the CVM free energy
Minimizing the CVM free energy is difficult, since the functional F CVM (Q) is high-dimensional and generally non-convex. Yedidia et al. [13] derived an inference algorithm based on the Cluster Variation Method, called Generalized Belief Propagation (GBP). This fixed point iteration algorithm is not guaranteed to converge, because of the non-convexity of the CVM free energy. Convergent algorithms were proposed by Rangarajan et al. [26] and Teh et al. [27] and more recently by Heskes et al. [28]. These so-called double loop algorithms minimize F CVM (Q γ ) by iteratively improving a convex upper bound on the non-convex functional F CVM (Q γ ) that can be minimized by fixed point iteration. The double loop algorithm always converges to a (local) minimum of the free energy.
We use the double loop approach described in [28].
Although single loop algorithms [13] in some cases may converge, often they require damping of the fixed point equations and it can be difficult to find a good trade-off between efficiency and robustness of the algorithm. Double loop algorithms can be slower when single loop algorithms converge, but the setting of the parameters of the double loop algorithm is less critical and convergence is guaranteed in theory.
We will give an outline of the algorithm; for full details we refer to [28]. The starting point is the issue of the non-convexity of F CVM (Q). The free energy of each cluster, F γ (Q γ (x γ )), is convex in terms of the approximate marginals Q γ (x γ ). This can be seen by writing it out: Here we have introduced the energy E (Q γ ) and entropy S (Q γ ). These names stem from statistical physics, where the Cluster Variation Method is used to compute properties of certain metals that can be described as systems of interacting magnetic spins. The energy term is linear in the marginal distribution Q γ . By differentiation it can be seen that the minus entropy has a positive second derivative, and therefore -S (Q γ ) is convex.
We now take a look at the CVM free energy again and identify the convex and concave terms:

page number not for citation purposes)
Here M + is the set of clusters β ∈ M : a β > 0 and M -∈ M : a β < 0. Since all free energies are convex, clusters β ∈ Mwith negative Moebius numbers have concave contributions to the total free energy F CVM (Q), which therefore becomes non-convex.
The double loop algorithm is based on the following idea.
Since the fixed point iterations as employed in GBP converge if the free energy is convex, a convergent algorithm can be constructed by iteratively minimizing and improving convex upper bounds to the CVM free energy. Let's denote the convex upper bound by F conv (Q, Q'). Define as the collection of marginal distributions (i.e. cluster marginals) that are normalized and satisfy all constistency constraints between overlapping marginal distributions. Following [28], if the upper bound is at least twice differentiable and satisfies the following properties: then the algorithm with Q n the approximate marginals at iteration n, is guaranteed to converge to a local minimum of the CVM free energy F CVM (Q) under the appropriate constraints. The free energy decreases with each iteration, since where the first inequality follows from condition 1 (upper bound) and the second from the definition of the algorithm. Condition 2 (touching) in combination with differentiability ensures that the algorithm is only stationary in points where the gradient of F CVM is zero. By construction, Q n ∈ for all n.
A convex upper bound can be obtained easily by bounding the concave contributions to the free energy. Since the energy term in the free energy of each cluster is already convex, only the concave entropy terms of clusters with negative Moebius number need to be bounded. A convex upper bound on a concave entropy term can be achieved by linearizing it: which directly follows from KL (Q β , ) ≥ 0. Putting this into expression 10, we obtain for the convex upper bound: We now see that the both the energy E β (Q β ) and the bounded entropy S β (Q β , ) are linear in the cluster marginal Q β (x β ). We can therefore simplify expression 11 by redefining the energies of the basic clusters α ∈ B: The convex upper bound becomes This upper bound can be minimized using the single loop algorithm described in [13].
Thus, the double loop algorithm consists of an outer loop and an inner loop: ( ) 12 Inner loop : minimize F conv (Q, Q'), using single loop fixed point interations, yielding Q n+1 .
The procedure is illustrated in figure 12.
We have described the double loop algorithm with the case where all subclusters with negative Moebius numbers are bounded. However it is possible to bound convex entropy contributions of sub-clusters with positive Moebius numbers as well. This tends to sharpen the bound because then the bounding of the convex entropy terms counters the effect of bounding concave entropy terms. An advantage of this bound is that the inner loop iteration scheme becomes much simpler. This is the bound that we have used for the simulations in this article. Even tighter bounds can be obtained by not bounding all concave entropy terms such that F conv is convex on the constraint subset. We refer to [28] for more details on the specific conditions.

Applying the Cluster Variation Method to linkage analysis
In this section we describe how we apply the Cluster Variation Method to estimate LOD scores. We outline the algorithm for the case that the pedigrees is not inbred, which is the case for which we have performed simulations.
From the definition of the Bayesian network in equation 3 it follows that the exact likelihood of the phenotypes T conditional on the marker genotypes M can always be rewritten as: The first factor on the right hand side concerns the likelihood of the trait data given an inheritance vector v T on the trait locus. The second factor is a distribution over trait locus inheritance vectors, conditional on marker loci inheritance vectors, v M . The last term is the distribution over marker loci inheritance vectors conditional on the marker data M, the marker allele frequencies and the recombination frequencies θ specified by the marker map.
Essentially this decomposition is possible because the marker and trait loci are connected only through the meiosis indicators; the model can be viewed as a Hidden Markov Model where the meiosis indicators v are the hidden variables. Step 1 Simulations indicate that approximating the full distribution P (T, M|f, t, m, θ, λ T ) with the Cluster Variation Method can give bad results when the inheritance implied by the trait data T is very different from the inheritance implied by the marker data M for a given location of the trait locus. Therefore we choose to approximate the distribution over marker loci inheritance vectors P (v M , G M |M, m, θ) independently of the trait data T.

Illustration of the double loop algorithm
In terms of the conditional probability tables of the Bayesian network, we have Here the subscript l runs only over marker loci. This is the multi-locus Bayesian network described previously, but without the trait locus. We make the CVM approximation Now consider for the remainder of this section the case where the set of basic clusters B consists of the clusters where c i are the children in nuclear family i in the pedigree, π i are the parents in nuclear family i, G and v are the corresponding genotype and meiosis nodes and (l, l + 1) represent two adjacent marker loci in the marker map. The subscript i runs over all nuclear families in the pedigree, and the subscript l = (1,..., L M -1), with L M the number of marker loci. This is exactly cluster choice C 1 of figure 5A.
Given the conditional probability tables that define the Bayesian network in equation 16, the approximate marginals can be obtained by minimizing the CVM free energy F CVM (Q) of equation 9 corresponding to the clusters defined in expression 17. The minimization is done using the double loop algorithm.
Step 2 In this step the likelihood of trait data is computed using the approximate distribution over inheritance vectors v M on the marker loci. This computation entails the summation Because of factorization assumed by the CVM approximation with the clusters defined in expression 17, this computation can be done efficiently. Suppose we would like to calculate the likelihood of the trait data T for the case where the trait locus is located between the fourth and fifth marker: