- Methodology article
- Open access
- Published:
Estimating genealogies from linked marker data: a Bayesian approach
BMC Bioinformatics volume 8, Article number: 411 (2007)
Abstract
Background
Answers to several fundamental questions in statistical genetics would ideally require knowledge of the ancestral pedigree and of the gene flow therein. A few examples of such questions are haplotype estimation, relatedness and relationship estimation, gene mapping by combining pedigree and linkage disequilibrium information, and estimation of population structure.
Results
We present a probabilistic method for genealogy reconstruction. Starting with a group of genotyped individuals from some population isolate, we explore the state space of their possible ancestral histories under our Bayesian model by using Markov chain Monte Carlo (MCMC) sampling techniques. The main contribution of our work is the development of sampling algorithms in the resulting vast state space with highly dependent variables. The main drawback is the computational complexity that limits the time horizon within which explicit reconstructions can be carried out in practice.
Conclusion
The estimates for IBD (identity-by-descent) and haplotype distributions are tested in several settings using simulated data. The results appear to be promising for a further development of the method.
Background
There are several fundamental questions in statistical genetics for which the answer would ideally require knowledge of the ancestral pedigree and of the gene flow therein. In practice, however, one will usually have available only partial information from the pedigree, and hardly any information on the accompanying ancestral allelic histories. Examples of intrinsic questions of this kind are: haplotype estimation from pedigree data [1] or from general population samples [2], pairwise estimation of the degree of relatedness between individuals in natural populations [3] or for forensic purposes [4], study of allele-sharing among affected individuals [5], generation of simulated data in a way which is compatible with observed marker genotypes (genotypic elimination; see [6, 7]), estimation of population structure using multilocus genotype data [8–10], estimation of the number of founder chromosomes for given loci [11], gene mapping by combining pedigree and linkage disequilibrium information [12, 13], and tracing genotyping errors in pedigrees [14].
Due to the shared generating process of inheritance pattern on the pedigree, it is not surprising that the current methods for addressing the above questions also have some similarities. One of the similarities is that several methods first attempt to generate an identity-by-descent (IBD) distribution between the individuals in the study sample, although the particular solutions for carrying this out then differ significantly from each other. There are also papers focusing only on the estimation of the IBD-distribution based on known pedigree information [1, 15–17], known haplotypes [13] or known population history [18]. For population based data, there are some approaches that approximate and model the genealogical history of a sample of chromosomes from a population by using ideas from the coalescent theory [19] and its extension incorporating recombinations, expressed in terms of an ancestral recombination graph [20, 21]. Applications of these ideas include haplotyping [22] gene mapping [23–25], estimating population parameters [26], and recombination rates [27]. However, in spite of the progress (see also [28, 29]), the development of effective MCMC sampling methods for ancestral recombination graphs in general has been relatively slow.
This paper extends the genealogy estimation method of Gasbarra et al. [10] to the case of linked markers. Our starting point is a sample of individuals from a natural population, each being genotyped at certain marker loci, but without any direct information on their pedigree or interrelations. Our target is to provide an explicit reconstruction, in terms of a probability model, of the recent history of the genealogy connecting the sampled individuals, conditionally on the observed genotype data and available information on the demography of the population. The model is specified by the following parameters: (1) time in generations since the founding of the population, (2) the marker allele frequencies in the founder population, (3) two mating parameters α to the case of linked markers. Our starting point is a sample of individuals from a natural population, each being genotyped at certain marker loci, but without any direct information on their pedigree or interrelations. Our target is to provide an explicit reconstruction, in terms of a probability model, of the recent history of the genealogy connecting the sampled individuals, conditionally on the observed genotype data and available information on the demography of the population. The model is specified by the following parameters: (1) time in generations since the founding of the population, (2) the marker allele frequencies in the founder population, (3) two mating parameters a and β controlling the mating behaviour [30], (4) the number of males and females in each generation, and (5) the genetic distances on the marker map. Combined with an algorithm for drawing Monte Carlo samples from the conditional distribution of genealogies, this modelling framework can be applied to address, within limits of computation, all the "intrinsic questions" mentioned above.
Methods
Prior distribution on the configuration space
The configuration space of possible ancestral histories (Ω) has three components: the ancestral graph (or pedigree) specifying the relationships between individuals, the paths of alleles of these individuals at the marker loci, and the types of the founder alleles introduced into the ancestral graph via the founder individuals. The probability model on the configuration space is similar to the one described by Gasbarra et al. [10] except that in this study the linkage between marker loci is allowed. Due to the similarities we give here only a brief summary of the model.
Ancestral graph
For pedigrees we use the probability model introduced by Gasbarra et al. [30]. The model considers an isolated population with non-overlapping generations indexed backwards in time by t = 0, 1, ..., T with t = 0 referring to the present and t = T to the founder generation. The population is characterized by four sets of parameters: , , α t and β t , for t = 1, ..., T. The parameters and describe respectively the number of males and females belonging to generation t of the population. Parameter α t controls the differences of reproductive success between males in generation t: large values of α t imply nearly equal numbers of children for each male whereas with small values of α t there will be a few dominant males who are mainly responsible for the reproduction. Parameter β t tunes the degree of monogamy (of males) in generation t: large values of β t lead to random mating and small values of β t introduce more permanent family structures into the pedigree. Naturally the roles of males and females can be changed in the model. We denote this probability measure on pedigree graphs by (·).
Flow of alleles through the ancestral graph
We assume a fixed marker map with L loci, and denote the recombination fractions between loci by ρ = (ρ(l, l') : 1 ≤ l <l' ≤ L). Note that several chromosomes can be modelled simultaneously using the recombination fraction p(l, l') = to indicate that markers l and l' lie in different linkage groups.
By definition, the genome of each individual in the ancestral graph consists of a pair of paternal and maternal haplotypes. The flow of alleles through the pedigree is determined by the grandparental origins which for haplotype i are denoted by ψ i = (ψ i (1), ..., ψ i (L)) ∈ {0, 1}L. The convention used here is that ψ i (l) = 0 if the allele at locus l of haplotype i is of grandmaternal origin, and ψ i (l) = 1 in the case of grandpaternal origin.
If an allele carried by an individual in generation t > 0 is transmitted to some individual in the present generation, we say that the allele is ancestral, and otherwise that it is censored. Since we are actually interested only in the paths of the ancestral alleles we set ψ i (l) = ∅ if the allele at locus l of haplotype i is censored.
The probability of a set of grandparental origins of nonfounder haplotypes on the pedigree is given by
where Λ i = {l : ψ i (l) ≠ ∅}, j(ψ i , l) denotes the last uncensored locus of haplotype i before l, with the convention that j(ψ i , l) = -∞, if l is the first uncensored locus of the haplotype i, ρ(-∞, l) = and Δ i (l) = |ψ i (l) - ψ i (j(ψ i , l))| with the convention that ψ i (-∞) = 0.
Types of founder alleles
Denote by g k = (g k (l) : l = 1, ..., L) the unordered genotype of individual k and let A = (g k : k ∈ ) be the vector of founders' genotypes. Assuming linkage equilibrium at the founder generation, the probability of the founder alleles is given by
where the population genotype frequencies fr(·; l) at each marker locus l are assumed given. (If Hardy-Weinberg equilibrium is assumed, we can use the population allele frequencies instead.) The genotype frequencies are extended to partially or totally censored genotypes in the obvious way. Note that the ordered founder alleles together with the grandparental origins of the nonfounder haplotypes determine the flow of alleles in the pedigree.
Prior distribution
Given the pedigree parameters (, , β t , α t , T), the population genotype frequencies and the recombination fractions between the marker loci, a configuration ω consisting of a pedigree G and a geneflow with founder alleles A and grandparental origins Ψ, is assigned (prior) probability
Data and posterior distribution
Suppose that we observe the genotype data D = (g k (l) : l ≤ L, k ≤ n(0)) of n(0) individuals in the current generation. The posterior probability of configuration ω is simply
where ⊆ Ω is the set of configurations that are compatible with the observed genotype data.
As it seems impossible to sample independent realizations from the posterior (see Appendix A.1) we shall use a Markov chain Monte Carlo method to perform the computations.
Markov chain Monte Carlo algorithm
The general idea of MCMC methods and the details of our algorithm are given in the appendices, whence here we only sketch the main ideas of devising an efficient proposal distribution for a Metropolis-Hastings algorithm.
In a typical proposal move of our algorithm, a group of children in the ancestral graph try to change one or both of their parents to other possible parents of the population. This is done either by selecting the children uniformly at random from the ancestral graph, or by considering all children of a randomly chosen parent. It is necessary that the proposed new paths for the ancestral alleles carried by the children are compatible with the genes carried by the new ancestors. In order to obtain such a compatible configuration with a reasonable probability, our proposal distribution given in (8) for choosing the new parents takes into account sequentially the children's genotypes and the transmission probabilities of the alleles of prospective candidate parents at all marker loci. After choosing the new parents, we use the transmission probabilities of the new ancestors to resample the paths of the alleles carried by the children (10). Finally the new configuration is accepted or rejected according to the Metropolis-Hastings rule.
In the simplified setting of our earlier work [10] the transmission probabilities were calculated by assuming free recombinations between marker loci. When the markers are linked, the situation becomes much more difficult. Although the proposal based on independent transmission probabilities produces configurations compatible with the genetic data with high probability, a drawback is that typically these configurations contain unrealistically many recombinations. In the case of several tightly linked markers this leads to low recombination likelihood scores, and consequently low acceptance rates and poor mixing of the Markov chain. In order to avoid that, we have to include at least partially the recombination likelihood into the proposal distribution for the allelic paths. This is computationally demanding but possible through the Viterbi-Baum algorithm for hidden Markov models. In Appendix B we describe how to sample the allelic phases of an individual, jointly at all marker loci, by taking into account his/her genotypes, his/her parents' transmission probabilities, and the likelihood of the recombination pattern on the haplotypes of his/her children. This step is used sequentially to generate the new allelic paths.
Using the algorithm of Kruglyak and Lander [31], we also construct a joint sampling distribution for the allelic phases of a group of siblings and of their parents at all marker loci, combining the parental transmission probabilities with the recombination likelihood on the haplotypes of the children (see Appendix A.7.2).
The computational complexity of these sampling steps grows linearly in the number of markers, suggesting that it is not an unrealistic task to handle hundreds of linked markers.
Results
The performance of the method was tested on two simulated data sets. The first data set was designed specifically to give information on the method's performance in the problems of haplotyping and relatedness estimation. For the former, comparisons were done with corresponding results obtained with PHASE [32] and for the latter with three existing moment estimators [33–36]. The second data set was generated using concepts from gene mapping. The purpose was to evaluate the potential advantages of using the IBD-information produced by our method over simple IBS-sharing statistics.
Haplotyping and relatedness estimation
Simulated data
We considered a simulated pedigree that extended for 10 generations and contained 439 individuals (Figure 1). This pedigree was also used by Gasbarra et al. [10], as their Example III, and the details of the simulation procedure are given there.
The gene flow on the pedigree was simulated at 20 linked marker loci. All markers were polymorphic with 10 equally frequent alleles at the population level and the neighbouring loci were separated by the recombination fraction of 0.05. The simulation of genetic data was accomplished by sampling the founder alleles from the population allele frequencies and dropping them down through the pedigree in accordance with the recombination model.
Reconstruction
Our task was to provide possible reconstructions of the simulated pedigree and corresponding gene flow using only the unordered genotype data on the youngest generation but having no information on the underlying pedigree structure. The marker map, the population allele frequencies and the size of the population were assumed known in the reconstruction i.e. they were given the same values as in the data simulation. The mating parameters α and β can be estimated by a maximum-likelihood estimator [30] from observed family structures if the size of the base population is known. We used this approach to estimate β on the basis of the family structures between generations 1 and 2 in the simulated pedigree (here 0 refers to the youngest generation), resulting in the value β = 4 × 10-4. For α we used generation dependent values α t = β, where was the number of females belonging to generation t = 1, ..., 9. We ran five independent sample chains, each with a different seed for the random number generator. One of the chains was extended to 1, 000, 000 iterations whereas the other four were stopped after 500, 000 iterations. The longer run took about 10 days on a Pentium-4 2.8 GHz processor. The results for haplotyping and IBD-analyses were saved from every tenth iteration. The monitored statistics behaved very similarly across the different runs suggesting that with these data the method performs consistently regardless of the initial state. To compare the performance of the present model with our earlier model that assumed unlinked markers [10] the corresponding runs were also conducted with the simpler model.
Haplotyping
If there are h ≥ 1 heterozygous loci in the multilocus genotype, then there are 2h-1different ways to do the haplotype assignment. Note that here we do not distinguish between the parental (paternal/maternal) origins of the haplotypes, which would further increase the number of different assignments to 2h. There is only a single correct haplotype configuration, and to measure how much our estimates deviate from it we use the concept of switch distance [2]. We say that two adjacent heterozygous loci are correctly (incorrectly) phased if the corresponding two locus haplotypes are correct (incorrect). The switch distance of the haplotype assignment is defined as the number of incorrectly phased adjacent heterozygous loci. The maximum switch distance of a haplotype configuration is one less than the number of heterozygous loci, and it is zero only for the correct configuration. For example, if the true haplotypes are (111111, 222222) then the switch distance of pair (112222, 221111) is 1 and that of pair (121212, 212121) is 5.
In Figure 2 a path of the sum of switch distances of haplotypes belonging to the current generation is shown. If alleles (at heterozygous loci) were assigned to the two haplotypes randomly, then the switch distance of the haplotype pair of individual i would be distributed according to Binomial(h i - 1, ), where h i > 0 is the number of i's heterozygous marker loci. The sum of switch distances would then be distributed as Binomial(h - n, ), where and n is the number of individuals in the sample. In our simulated data, n = 39 and h = 675; thus, under the null model of random haplotype assignment, the expected sum of switch distances would be 318 and the corresponding standard deviation 12.6. In our five test runs the average initial value of the sum of switch distances was 295 from where it decreased during the iterations to an average value (calculated over the iterations 250,000, ..., 500,000 of the five runs) of 82.8. The corresponding average value over the iterations 500,000, ..., 1,000,000 of the longer run was 61.9. It can also be seen from Figure 2 that our recombination model significantly enhances the results from the ones attained by assuming free recombinations.
There are two kinds of haplotyping software currently available, namely pedigree-based and population-based methods. In actual fact, our data contain some close relatives, but since no relationship information is assumed to be available we are working with a general population sample. For a comparison we analysed the data with the software PHASE (v.2.1) [32]. Usually PHASE and other population-based haplotyping methods are applied to much denser marker maps where recombinations (per meiosis) are rare and the haplotypes are very likely to be inherited as single units through tens of generations. This is not the case here, but since the current version of PHASE is widely used and takes into account also recombinations, we chose it as the reference method among the existing algorithms.
We considered three different chain lengths for PHASE (100 (default), 500, and 1000 iterations) and then ran ten independent test runs for each with different choices of the seed of the random number generator. The burn-in part and the level of thinning were kept at their default values of 100 and 1 iterations, respectively.
PHASE can be requested to give the posterior probability distribution of all possible haplotypes for each individual. We measured the accuracy of these estimates by calculating the expected switch distance for each individual with respect to his/her posterior distribution of the haplotype configurations. The sums of the switch distances over all 39 individuals, averaged over ten independent runs, were 80.1, 76.6 and 76.7 for the chains of length 100, 500 and 1000 iterations, respectively. Since increasing the number of iterations from 500 to 1000 did not seem to enhance the results, we did not try to run PHASE longer. The best value PHASE gave (76.6) is shown with a dotted line in Figure 2. It seems that, at least on these data, PHASE and our algorithm have quite comparable accuracy in estimating the haplotype configurations, even though the underlying models are very different.
A clear advantage of PHASE over our method is in its speed: a single run takes only a couple of minutes whereas our algorithm was run for several days. One of the reasons is that we are modelling explicitly the whole genealogy, not just the haplotypes, and are thus able to address some other questions with the same effort. In the future we could also try using PHASE to give an initial haplotype configuration for our method.
Relatedness estimation
Here we consider relatedness estimation in a similar way as Gasbarra et al. [10]. Two alleles are said to be identical-by-descent (IBD) if they descend from the same ancestral allele within the pedigree. Note that two alleles may be identical-by-state (IBS), i.e. represent the same allelic form, without being IBD if two or more founder alleles happen to be of the same type. The concept of IBD thus indicates whether two contemporary alleles descended from a common ancestor that had existed since the founder generation, but it does not estimate their possible coalescent times more accurately. However, if needed, we could also capture the exact time (in generations) of these coalescing events.
In order to quantify the relatedness we denote by r ij (l) the probability that a randomly chosen allele from locus l of individual i has an IBD-copy in individual j. For individuals i and j we define the locus-specific relatedness coefficients R ij (l) = (r ij (l) + r ji (l)) and the genome-level relatedness coefficient . Note that in the presence of inbreeding we can have r ij (l) ≠ r ji (l). However, always R ij (l) = R ji (l) and R ij = R ji .
As our input data contain no pedigree information, it seems that other methods available for IBD-estimation from such data are based on different formulas that combine the IBS-status of the markers and the known population allele frequencies to an estimate of the IBD-probability (usually R ij ). We have compared the estimates given by our algorithm with three such moment estimators described by Lynch [33] and Li et al. [34] (LL), Lynch and Ritland [35] (LR) and Wang [36] (W).
These three methods assume unlinked loci and then combine the locus-specific results according to some weighting schemes in order to obtain estimates of the genome-level relatedness coefficients. The derivations of both LR and W are also based on the assumption of no inbreeding. Since our data violate these assumptions some additional error may be caused to the moment estimators. Moreover, it is questionable whether these moment estimators actually answer the exact question of IBD-sharing when restricted to the latest ten generations (see also [37]) as their estimates are relative to the base population defined by the allele frequencies and no exact reference point of IBD-sharing can be specified (like the founder generation in our example). On the other hand, polymorphic data sets like the one used here are advantageous for the moment estimators since in these cases IBS-sharing gives already a fairly good approximation of the actual IBD-sharing.
The accuracy of the relatedness estimates was measured by squared error. Namely, we computed the true values of the coefficients R ij for each pair of individuals from the original genealogy and compared then the distribution of quantities (R ij - )2 between our method and the three above mentioned moment estimators. We also included results obtained with our method without modelling the linkage in order to illustrate again that the linkage model enhances the results. The distributions of the errors are shown with boxplots in Figure 3, where the letter G refers to our method. The sums of squared errors over all 741 pairs of individuals were 1.89, 2.43, 3.25, 3.27 and 3.51 for G, G(unlinked), LL, LR and W, respectively. The results for our method were calculated as the average values over the five runs of length 500,000 iterations (burn-in parts were 250,000 iterations).
In gene mapping it is of interest to know the exact locations in the genome in which some group of individuals share alleles IBD (see the next example). In Figure 4 we have chosen six pairs of individuals belonging to the current generation and illustrated both the true IBD-sharing fractions R ij (l) (dotted lines) and the estimated IBD-probabilities (l) (solid lines). It seems that our results do not significantly underestimate the true IBD-profiles. As a very large pedigree would generally result in too low IBD-estimates we may conclude that the reconstruction algorithm has not introduced many extra parents to the pedigree. In those parts of the chromosomes where our estimates exceed the true values, there is a certain amount of additional IBS-sharing for which our algorithm has not been able to completely rule out the possibility of it actually being IBD. Note that the lack of IBS-sharing already implies that there can be no IBD-sharing.
Gene mapping
In this example we applied concepts from gene mapping to further monitor the accuracy of our reconstruction algorithm. We simulated a monogenic trait with a dominant mode of inheritance, and then investigated whether we can trace the position of a trait locus, relative to a set of marker loci, by considering suitable allele or haplotype sharing statistics. The purpose of the example is to compare the estimated IBD and haplotyping results to the simulated "true" ones and to the plain IBS data, whereas a proper extension of the model to gene mapping will be considered elsewhere (see also Discussion).
Simulated data
We considered a population that has grown exponentially by a factor of 1.2 during the 9 most recent generations. The founder level (assumed to be in Hardy-Weinberg and linkage equilibrium) was taken to be the 19th generation (backwards in time). The population was postulated to have maintained a constant size of 200 individuals between the 9th and the 19th generations. The method of Gasbarra et al. [30] was used to simulate a 20-generation pedigree from this population, with 400 individuals at the current (0th) generation. The parameter α was set to 10.0 in order to decrease the relatedness of the individuals belonging to the current generation. The monogamy parameter β was set to 0.001. As a result the pedigree contained 4815 individuals of whom 120 were founders.
A gene flow was simulated on the marker map containing 14 microsatellite markers separated by the recombination fraction of 0.10, and 26 SNP markers located in such a way that between any two adjacent microsatellites there were two SNPs evenly spaced with respect to genetic distance. The allele frequencies were sampled from the Dirichlet distribution with all parameters equal to 1. For microsatellites there were 10 different alleles and the SNPs were biallelic at the founder level.
Having simulated the pedigree and the gene flow we fixed an additional trait locus half way between (SNP) markers 20 and 21 and simulated the segregation of the founder alleles at that locus in accordance with the inheritance patterns of the flanking markers and the Haldane recombination model. We chose one particular founder allele at the trait locus and collected all of its 44 carriers from the current generation to form our sample.
Reconstruction
The question was whether we could spot the trait locus by comparing the values of an IBD-sharing statistic of the sampled individuals in different marker loci. The idea is that the carriers should share more alleles IBD near the trait locus than elsewhere on the chromosome. For computational reasons, and also in order to violate the assumption of the founder generation being in exact Hardy-Weinberg and linkage equilibria, the time horizon was in the reconstruction set to 9 generations instead of 19 that was used in the simulation. Considering the history only 9 generations backwards is likely to produce challenges for the reconstruction, since in the simulated data there were altogether six different copies of the trait allele at the 9th generation, all of whom had descendants among our sample. On the other hand, when extending the analysis for tens of generations backwards in time the exact number of generations being considered is likely to become less important, and also it is less likely that we could find out the exact generation in which each coalescing event actually occurred.
The population allelic frequencies and the population size were considered known. For β the previously estimated value 4 × 10-4 was used and again α t = β where was the number of females belonging to generation t, t = 1, ..., 9 (the current generation has index 0). The algorithm was run for 100, 000 iterations, which took about two days on a Pentium-4 2.8 GHz processor. The results were saved from every tenth iteration and averaged over two independent runs.
Results
Let us denote by the set of founders of a pedigree and by the set of individuals in our sample. For each marker l and each individual i denote by the marker genotype of i at locus l . Enumerate all 2|| founder haplotypes and let be the founder alleles of i at locus l, where (l) is the label of the founder haplotype from which the allele (l) originates (k = 1, 2). We consider the following allele sharing statistic. Let Min(; l) be the size of a smallest set V ⊆ {1, ..., 2||} of founder alleles for which
Note that such a minimal V does not have to be unique, but our interest lies in the unique size of these sets. Finding such a minimal set V is an instance of a well-known NP-complete problem of finding a minimum vertex cover for a given graph. This can be seen by considering the graph where the founder alleles are vertices and for each i ∈ there is an undirected edge between and . Graph theoretic formulation of the problem made it possible to compute the exact values of Min(; l). (A brute-force search over 244 different sets at each locus and on each iteration would not be feasible.) The same statistic (with a different sign) is called in [5] where it was reported to work well for extended pedigrees and dominant traits. In Figure 5 we have plotted the values of Min(; l) at each marker locus, both for the reconstruction (averaged over iterations) and for the true situation with two choices for the founder generation (9th and 19th generations). All three plots have their minimum at the marker locus 20, in good agreement with the trait locus lying half way between the loci 20 and 21.
It seems that generally the number of different founder alleles in the reconstruction is more similar to the original situation in the 19th generation than to that in the 9th generation, even though the reconstruction was actually done for 9 generations. This suggests that, as we reconstruct more generations backwards in time, the assumption of the founder generation being in Hardy-Weinberg and linkage equilibria may become more important than the actual number of generations considered. In other words, the algorithm may try to squeeze the original pedigree and gene flow to the given time horizon. On the other hand, it is unlikely that the postulated genetic equilibria would hold to a very close approximation among a set of founders of a population isolate.
No similar drop in allele numbers can be seen near the trait locus in mere IBS-sharing statistics among the sampled individuals. In Figure 6 we have plotted the IBS-based Min(; l) statistics that were calculated by replacing the founder labels (l) in the definition above with the corresponding allele types (l). Since allele frequencies may have a strong effect on the expected number of different IBS-alleles, we have chosen a control group of 44 individuals randomly among the 356 non-carriers belonging to the current generation of the original pedigree. The lower curve in Figure 6 illustrates the differences Min(; l) - Min(; l). Staying non-negative near locus 20, it does not seem to give a signal of the trait locus. We also compared the entropies of the (IBS) allele distributions of carriers and controls (results not shown) but did not find any excess allele sharing near the trait locus.
As an alternative measure of genetic similarity we can also monitor how long haplotype segments the sampled individuals share in different parts of the genome. For a given locus l the haplotype sharing statistic HSS(l) is computed as the average sharing between all distinct haplotype pairs (h and k) at that locus in the sample. We say that the sharing s hk (l) is zero if the corresponding alleles at locus l are inherited from different founder haplotypes, otherwise s hk (l) is the length (in genetic distance) of the corresponding overlapping IBD-segments. Following [30] the value of the haplotype sharing statistic at locus l in the sampled group of individuals was evaluated using the formula
where n = || (see also the statistics in [38–40]).
In Figure 7 we show the values of the haplotype sharing statistics for both the reconstruction and the true situation, the latter with two different choices of the founder level (the 9th and the 19th generation). The curves calculated from the true allele paths show clearly that the sampled individuals share longer haplotypes near the trait locus than elsewhere in the chromosome, especially in the direction of the 20th marker locus. These facts are also present in the reconstruction, but the signal is much weaker.
The questions that remain can now be stated as follows: (i) How much of the complete information do the mere genotype data on the youngest generation contain, and (ii) how strong a signal one can expect to get from this kind of data with any computational method? We use a long chromosomal segment (1.45 Morgans) with a quite sparse marker map (over 3 cM between adjacent loci), as we are interested in the recombination process. Difficulties with these data may arise because recombinations are frequent and mix up shorter segments in various ways. In addition, the sampled individuals are more distantly related to each other in this example than in the previous one, as can be seen from the average values of R ij with respect to the 9th generation, (0.05 and 0.19, respectively, in this and in the previous example).
Haplotyping is easier for our method when there are siblings or other close relatives in the sample, since their data tend to cluster to the same families also in the reconstruction and thus increase locally the amount of information on those parts of the pedigree. This can be seen, for example, in Figure 4 where the estimates of IBD-sharing are very close to the true values for siblings (the leftmost pictures) and somewhat less accurate for more distant relatives (the rightmost pictures). Some further enhancements in the estimates might also be achieved if we knew some parts of the pedigree and were able to utilize this information in our algorithm. Such possibilities will be considered in our future work.
On the other hand, we cannot completely rule out the possibility that the relatively weak signals in haplotype sharing would be influenced by the slow mixing of the MCMC sampler.
Discussion
The main motivation leading to this study was to provide a common methodological basis for considering a number of inter-related fundamental questions in statistical genetics, by extending our earlier work [10] to linked marker data. Our approach can be seen to complement the methods that are used in the reconstruction of coalescents. In particular, both approaches start from DNA samples taken from present day individuals and then make an attempt to trace back their common genetic origins. In the original coalescent analysis the main focus is on attempts to reconstruct the underlying evolutionary history, driven by mutations [19]. To do this, one proceeds backwards along germ lines looking for Most Recent Common Ancestors (MRCA's), every time coalescing two lines when a common ancestor for them is postulated, and finally ending with the root of the resulting tree structure. Later the coalescent theory has been extended in several directions [41], most notably to ancestral recombination graphs (ARG's) [20, 21] that include the recombination process. Common to the coalescent theory and its extensions is that the relevant time scale for them is usually of the order of thousands of generations or even much longer. On balance, generally one then considers only relatively short aligned sequences of DNA at a time. The present method is analogous to the search for MRCA's in that it, too, can be seen as a search for chromosomal areas shared by some individuals in the sample, and is carried out by sampling explicit hypothetical reconstructions of the past. Here, however, the genealogies are assumed to be driven by mating and meioses and the effect of mutations is ignored. Also the time scale in which these reconstructions are carried out is very much shorter, of the order of tens of generations. But then they are carried out by jointly considering marker loci that cover much wider chromosomal areas such as whole chromosomes, or even the entire genome. Note also that our model builds on an explicit consideration of diploid chromosomes, which is a source of substantial technical complications in the computations. Thus, in trying to resolve questions concerning shared ancestral origins of the marker alleles, we also allow for the possibility of inbreeding, that is, loops in the ancestral graph. The original coalescent process considers the genealogy of a random sample of genes from a very large and randomly mating population whence the sample size has a large effect on the coalescent time. In contrast, if we consider closely related or ascertained (for some phenotype) individuals in an effectively smaller population, then even a small sample is likely to find common ancestry within our framework. Thus the main factor in determining the rate of coalescences in our method is the relatedness structure that the data contain, not necessarily the number of sampled individuals in the data. Additionally, we emphasize that in our model (as in ARG's) the coalescences may involve only small parts of the haplotypes and thus the coalecences are not tied to the number of individuals in the first place. Namely, after a few generations backwards in time, the haplotypes of any sampled individual may have split into tens of parts, each having its own genealogical tree.
Reconstructing plausible pedigrees of the sampled individuals conditionally on the observed marker data always requires some knowledge about the recent history of the population. Here such information is provided in the form of postulated parameter values for controlling population growth and mating, as well as assuming that the members of the founder generation are in linkage and Hardy-Weinberg equilibrium. Unrelated founders in linkage equilibrium seems to be a common assumption in pedigree analyses. This assumption is likely to be unrealistic especially when it is applied to small pedigrees whose founders lie only a few generations backwards in time (as in traditional linkage analysis). On the other hand, extending the founder level to some tens of generations farther back in time, as is done here, allows more realistic modelling of the relatedness structure within a few of the most recent generations (i.e. within those generations that contain the sampled individuals).
When assessing the usefulness of our approach in practical applications based on real data it will be important to assess the sensitivity of the results to variations in the tuning parameters. In the gene mapping example one may also enquire how strongly the results would depend on the number of generations that are considered in the reconstruction. As noted in [10] the relation between the concept of generation in the model and its counterpart in the real population is not straightforward and varies as a function of marker data and parameter values.
Our numerical examples illustrated how one can usefully summarize the relevant posterior information contained in an MCMC sample of ancestral graphs by considering certain statistics of interest, such as those describing the relatedness between a pair of individuals at different marker loci. One can then think of the sampled pedigrees as being merely vehicles that alleles need to find their way through the pedigree from the founders to the study sample, or as nuisance parameters that will ultimately be integrated out from the results. In view of the enormous size of the sample space of ancestral graphs, it is the relative robustness of these summary statistics to the exact pedigree and gene flow information which makes our approach based on MCMC sampling at all feasible.
Our approach becomes soon computationally infeasible as the number of generations in the reconstruction increases since the task of finding suitable proposals requires computations whose complexity grows rapidly with an increasing depth of the pedigree. In our numerical examples we used a single desk computer. However, more computational power is needed to handle larger data sets and/or denser marker panels (e.g. SNP data) in a reasonable time. We have sketched a tempering version of the algorithm in which several chains run in parallel (see also [42]). The idea there is to improve the mixing of the sampler by giving each chain its own "temperature" that then controls the weight of each recombination likelihood in the acceptance ratios of the Metropolis-Hastings updates. The higher the temperature, the more easily the proposals are accepted, but the results are monitored only for the chain at the lowest temperature where the recombination likelihood is not relaxed at all. After a certain prespecified number of iterations the chains at neighbouring temperatures compare their configurations in terms of their respective likelihood values and then may swap the temperatures according to a suitable rule. The idea is that if some chain finds a good configuration, this configuration can move gradually towards the lowest temperature from which the results are then collected.
Another method, also using parallel Markov chains, is provided by the Feynman-Kac-Metropolis algorithm [43]. Given a Metropolis-Hastings algorithm with target distribution π(x), proposal distribution q(x → y) and acceptance probability a(x → y), one can construct a system of N interacting Markov chains as follows. At time t, given the previous state of the system , we sample independently proposal states from the respective proposal distributions q, i = 1, ..., N. Then, for i = 1, ..., N, with probability a( we take , and otherwise sample from the set {} by assigning the probability
to the choice . In this way the N chains interact and the particle system explores the target distribution's landscape more efficiently than a system of N independent Metropolis chains based on the same proposal kernel q(x → y). Although the N-fold product distribution π⊗Nis not the invariant distribution of the N-particle system, it has been shown that as N → ∞ and t → ∞, the empirical distribution converges to the target distribution π.
We conclude by some remarks concerning gene mapping. Often data used in genetic mapping studies consist of a number of nuclear or extended families, each formed by first ascertaining an affected individual (proband) and then collecting marker and/or phenotype data on close relatives of the proband. If this is the case, it is natural to make use of the known family structure and of the marker data that may be available, for example, from the siblings and the parents of the proband. Considering the gene flow within each such small known pedigree, and making use of this information in a phenotype or penetrance model, will then correspond to "ordinary" linkage or QTL-analysis. However, particularly in data sets in which all probands are collected from a genetic isolate, as is often the case, also these small pedigrees can be assumed to share, in the sense of IBD, some part of their ancestry. In such cases, there is growing interest in the current literature in modelling the relatedness between founder individuals (e.g., [12, 13, 44]). In the currently existing approaches the recent shared ancestry is modelled only at the putative QTL-position or estimated separately for each marker/QTL (based on flanking marker loci), whereas here we consider this question jointly at all marker loci. Note also that, by making use of the Haldane map function, this gives us a handle for doing the same, in terms of probabilities, on the intervals between flanking markers. In a near future our plan is to modify the present reconstruction method in a way which allows us to fix the known parts of the pedigree and the corresponding marker information to the extent in which it is known, and then apply the reconstruction algorithm for building "bridges between these islands".
It seems likely that no single method can perform equally well on the whole spectrum of different types of genetic data currently available. Indeed, interplay of several methodological approaches will be crucial during the future gene mapping studies. The main role of our method may be in the initial stage of a genome-wide mapping project when interesting regions are sought using a marker spacing that is measured in centimorgans and the pedigree records are not complete. In the study of complex disease traits our method can be applied to estimate the haplotypes and/or relatedness structure which can then be used as input parameters for subsequent QTL or association mapping (e.g. [45]). In order to provide a more systematic approach to this problem, we are currently planning to build an even larger Bayesian model which would allow us to combine these two stages of analysis. This would involve expanding the present method further by adding to the model one more layer of hierarchy corresponding to the underlying genetic architecture of the trait [45–48]. In such a large integrated method, generation of IBD and haplotype distributions as well as screening of QTL-positions could all be performed as parts of a joint analysis. At least in principle, the number of contributing loci and their positions, the size of their effects, interactions within and between genes and environmental factors, as well as the mode of a gene action, could be analysed by such a method. Of course, the computational requirements for this kind of gigantic model are even larger than for the current method, and the practical implementation will be a major computational challenge.
Conclusion
We have implemented an algorithm for analyzing recent history of linked multilocus genotype data sampled from an isolated diploid population. We model the paths of the observed alleles through tens of generations by explicitly including all ancestral individuals and corresponding meioses into the possible ancestral configurations. Thus we are extending the methods that estimate gene flows on fixed pedigrees to the case where also the pedigrees need to be estimated.
We have tested the method on the problems of haplotyping and IBD-estimation. In both cases the method performs well compared to some widely used existing methods. We have also illustrated how our estimates for IBD-sharing are more informative than a simple IBS-sharing statistic on a tentative example on gene mapping.
Our experiences reported here and in [10] encourage us to develop the method further. Indeed, the current version of the method can be seen as a general tool for estimating genealogical relationships between sample units. In more complex applications, such as gene mapping, it can serve as a basis for extended models.
Appendices
In the following two appendices we describe the MCMC algorithm in more detail.
Appendix A first discusses why we are not able to sample from the posterior without MCMC. Then we briefly summarize the theoretical background of the MCMC methods, describe a way to generate an initial configuration for the chain, and finally explain the block-updates that are used to propose new states for the chain.
Appendix B introduces an algorithm to sample the phases of the parent's genotype given partially observed haplotypes of his/her children. It is used in our block-updates described in Appendix A, but it may also turn out to be useful in other settings.
We use the notation introduced in the Methods section. In addition we index the individuals in the pedigree starting from the present generation and assign labels 2k - 1 and 2k respectively to the paternal and maternal haplotype of individual k. If the allele at locus l of haplotype i is ancestral, we denote its type by h i (l) ∈ E l , otherwise setting h i (l) = ∅, where E l is the set of alleles at locus l. Thus for each individual k and locus l, g k (l) = {h2k-1(l), h2k(l)}.
A Sampling from the posterior distribution
A.1 Possibility of sampling without Markov chain Monte Carlo
We want to study the posterior distribution (1) by using a Monte Carlo method, so we need an algorithm for sampling random realizations from the posterior. In a way, constructing such an algorithm corresponds to time-reversing a Markov process in a discrete state space, where ideally we could write down the generator of the reversed dynamics jointly for the ancestral graphs and the allelic paths, and the posterior distribution of the haplotypes in the present generation given the data, and then sample directly independent realizations of the backward process. However, the number of terms involved in such summations grows extremely rapidly with the sample size n(0), the time horizon T and the number of loci, whence the computation of the reverse generator is not feasible in the real-life data problems we have in mind. Next we briefly examine some alternative ideas.
First, we could proceed naively, by sampling repeatedly from the prior until we obtain a realization ω ∈ . This is problematic when π() is very small, as is the case here, and on average it takes far too many attempts to obtain even one realization from .
Alternatively, since Ω is a finite set for any given (n(0), T, L), we could compute and draw samples directly from the posterior just by summing π(ω)1(ω ∈ ) over ω ∈ Ω. Again, this is not feasible in practice when |Ω| is very large.
As a third possibility, we could hope to have better chances for direct simulation by using hidden Markov model techniques. We discuss briefly this idea, since it is used later in the construction of the MCMC algorithm. Given the ancestral graph G with |G| individuals, we construct, according to the recombination model, a Markov chain (ψ(1), ..., (ψ(L)), where ψ(l) = (ψ2k-1, ψ2k: k ≤ |G| - ||), is the set of founders and ψ i (l) denotes the grandparental origin of the allele (ancestral or censored) at locus l of haplotype i. In order to preserve the Markov property across consecutive loci, the configuration space has to contain also the grandparental origins of the censored alleles. For the individuals in the present generation (t = 0) we also need a random parental phase matrix (ϕ k (l) : k = 1, ..., n(0), l = 1, ..., L) with entries ϕ k (l) ∈ {0, 1} to determine the haplotypes. The random variables (ϕ k (l)) are a priori i.i.d. with P(ϕ k (l) = 0) = P(ϕ k (l) = 1) = and together the pairs (ψ(l), ϕ(l)) form a Markov chain on the finite state space {0, 1}d, where d = 2(|G| - ||) + n(0). At each locus l, the vector Y(l) = (g i (l) : i = 1, ..., n(0)) is observed. The corresponding likelihood contribution from locus l is given by
where the genotypes of the founders are determined by the triple (Y(l), ϕ(l), ψ(l)).
By using the Viterbi algorithm, it is possible to sample directly the random vector ((ψ(l), ϕ(l)) : l = 1, ..., L) conditionally on the data Y. Kruglyak and Lander [31] proposed an efficient implementation of the Viterbi algorithm using Fourier transforms on the commutative group {0, 1}d(see also section A.7.2). The Viterbi algorithm could also be used to integrate out the allelic paths and to obtain the marginal likelihood P(Y(1), ..., Y(L)|G) of the data set Y given the ancestral graph G. We would then be left with the problem of sampling from a posterior distribution on the finite space of ancestral graphs with n(0) roots spanning T generations backwards in time, with probabilities proportional to
P(G) × P(Y(1), ..., Y(L)|G).
There are two problems in this approach, however: Firstly, the Kruglyak-Lander algorithm can be implemented only for small values of d, say d ≃ 20 at most. Secondly, the number of possible ancestral graphs grows extremely rapidly with n(0) and T.
In summary, the direct sampling methods described above will work only for small ancestral graphs, and next we shall describe the general ideas of Markov chain Monte Carlo methods that can yield approximate results also in more complex settings.
A.2 Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm is a recipe to construct a reversible and ergodic Markov chain (Z n : n ∈ IN) with a given invariant probability distribution μ(z) on a state space [49]. The method can be used for general state spaces but we shall restrict our considerations to the case where is finite. We choose a proposal transition kernel Q(z → ) and define the corresponding acceptance probability
Note that in order to compute a(z → ) we need to know the target measure μ(z) only up to a normalizing constant. The corresponding transition kernel of the Markov chain (Z n : n ∈ IN) is then given by
In other words, given the previous state zn-1, we draw a random sample from the proposal distribution Q(zn-1→ ) and then let Z n = with probability a(zn-1→ ), otherwise setting Z n = zn-1. The distribution of the initial state Z0 together with the transition kernel specifies the distribution of the Markov chain (Z n : n ∈ IN). If the chain is irreducible, the construction results in an ergodic Markov chain for which the law of large numbers holds meaning that for any integrable function f(z),
with probability one. This is a useful result when we would like to approximate numerically the expectation on the right hand side, but there is no practical algorithm producing i.i.d. realizations from μ(z). The choice of the proposal distribution Q(z → ) determines the mixing properties of the Metropolis chain (Z n ). If mixing is too slow, the Metropolis chain is useless for Monte Carlo computations.
It is also possible to combine different Metropolis kernels K i (z → ), i ∈ IN, with the same invariant distribution μ and still obtain a Markov chain (Z n ) such that (4) holds. One way is to consider adistribution (p i ) on the non-negative integers, and define a new transition kernel as the mixture
Another possibility would be to combine the proposals into the new proposal
where the mixing distribution is allowed to depend on the current state z, and the corresponding transition kernel is computed by using (2) and (3).
A.2.1 Gibbs' updates
It is sometimes possible to represent the state space as a Cartesian product = . For z = (z', z") ∈ , the proposal kernel of a Gibbs update is given by
and we obtain another Gibbs' update by inverting the roles of z' and z". The corresponding acceptance probability satisfies a((z', z") → ()) ≡ 1. However, it is not always the case that we can use the Gibbs update for a given decomposition of the state space, since this requires direct sampling from the conditional distribution μ(z"|Z' = z').
A.3 MCMC with auxiliary variables
Here we explain a procedure which is used frequently in the MCMC-literature (see Appendix 2 in [50]). Suppose we have constructed a Markov chain (Z n ) on the (finite) state space with a given equilibrium distribution π(z). Consider an enlarged state space , together with a stochastic kernel p(y|z) : → [0, 1]. Define the probability measure (z, y) = π(z)p(y|z). The idea is to explore the marginal distribution by constructing an ergodic Markov chain ( n ) on the enlarged state space with equilibrium distribution (z, y). In what follows, we assume that for every z ∈ we have an algorithm to sample a random realization of Y from p(y|z) and a numerical procedure to compute this conditional distribution.
If K(z → ) is a transition kernel which is reversible with respect to π(z), we can always consider on the enlarged state space the transition kernel
which will automatically be reversible w.r.t (z, y). In particular, the transition kernel is reversible w.r.t (z, y).
Next consider the Metropolis transition kernel with joint proposal distribution on the state space . The corresponding acceptance probability for transition (z, y) → is then given by
Using this Metropolis-Kernel defined on the enlarged state space , we construct a Markov chain ( n ) with invariant distribution π(z) and an initial state z0 on the original state space , as follows:
-
(i)
given the previous state z n , sample y n ~ p(y n |z n );
-
(ii)
sample (z', y') ~ ((z n , y n ) → (z', y'));
-
(iii)
with probability a((z n , y n ) → (z', y')) take Zn+1= z', otherwise Zn+1= z n . Forget y n and y'.
Alternatively one could define a proposal kernel Q(z → ) directly on the original state space by summing out y, i.e.,
However, the computation of this transition distribution requires an extra summation step and there are situations in which we cannot afford using this direct proposal distribution in the Metropolis step.
A.4 Constructing an initial configuration for the Markov chain
Now we return to our application. By using the Metropolis-Hastings algorithm, we shall construct a Markov chain on the configuration space Ω with invariant distribution equal to the constrained distribution (1).
Before entering that topic, however, we need to construct a configuration ω0 ∈ serving as an initial state for the Metropolis algorithm. In other words, we need to find an ancestral graph and corresponding gene flow variables that are logically consistent with the data. Since this is a nontrivial problem, we describe the procedure in detail. Later we use a similar construction to obtain a proposal distribution for the Metropolis algorithm.
We start from generation 0 with n(0) sampled individuals and their (partially) observed genotypes (g k (l) : k = 1, ..., n(0), l = 1, ..., L). Sequentially, following a uniformly distributed permutation, these individuals will choose their parents from generation 1 and transmit their genes to the chosen parents. Denote by {X k = (f, m)} the event that child k has chosen f as the father and m as the mother from amongst possible fathers and possible mothers in the population. To start the construction, suppose that the first child chooses the first father f and the first mother m from the population, and transmits his/her alleles to these parents. For a generic locus l, let g1(l) = {a, b} be the genotype of the first child. With probability we set h1(l) = a, h2(l) = b, g f (l) = {a, ∅} and g m (l) = {b, ∅}, and otherwise we set h1(l) = b, h2(l) = a, g f (l) = {b, ∅} and g m (l) = {a, ∅}. Note that this determines only the haplotypes of the first child, and partially the genotypes of his/her parents, but does not give any information about the pattern of meioises that led to the haplotype of the child.
Proceeding recursively, assume that the first (k - 1) children in the present generation have already chosen altogether F(k - 1) fathers and M(k - 1) mothers, and have transmitted their alleles to the chosen parents. Child k can then choose parents from among these F(k - 1) fathers and M(k - 1) mothers or from the ( - F(k - 1)) fathers and ( - M(k - 1)) mothers who had not yet been chosen by any child. In doing so the child must take into account his/her own genotype and the possibly censored genotypes of the candidate parents, which, at this stage, contain the alleles transmitted by the preceding (k - 1) children. Indeed child k chooses father f and mother m from generation 1 by sampling from a distribution proportional to
Here the term P(X k = (f, m)|X1, ..., Xk-1) is the contribution from the prior distribution of the ancestral graph, g k (l) is the observed genotype (at locus l) of child k, and g f (l) and g m (l) are the current values of the partially determined genotypes of the parents. Finally, the transmission likelihood P(g k (l)|g f (l), g m (l)) is defined as follows under the assumption of free recombination.
If both parental genotypes g f (l) = {a, b} and g m (l) ={c, d}, with a, b, c, d ∈ E l , are already fully determined,
If some of the parental alleles are only partially determined by the previous transmission events, we integrate the missing alleles out:
where for a, b, c, d ∈ E l , we define the conditional population genotype frequencies as
After choosing parents f and m, child k transmits his/her alleles to them. For that, let us consider the assignment of phase. Let g f (l) = {a, b} and g m (l) = {c, d} be the current values of the genotypes of the chosen parents with a, b, c, d ∈ E l ∪ {∅}, and let {x, y} be the genotype of the child. When a, b, c, d ∈ E l are already determined, we have
where
and then set, according to this probability, h2k-1(l) = x and h2k(l) = y, and otherwise h2k-1(l) = y and h2k(l) = x.
If there are undetermined alleles among a, b, c and d, we integrate them out with respect to the conditional genotype frequencies. Thus the parental origins of the alleles x and y are sampled according to the probability (10), where formula (11) is extended to the case of partially censored parental genotypes {a, b}, {c, d}, with a, b, c, d ∈ E l ∪ {∅} by
Then, given the phase of the alleles of the child at locus l, we update independently the genotypes of the parents. To do so, consider the case where h2k-1(l) = x, that is, x came from the father, and let g f (l) = {a, b} be the current value of the father's genotype. There are three cases to consider: (i) If a and b are both determined, there is nothing to do, (ii) If a = b = ∅, we set g f (l) = {x, ∅}, and (iii) If a ∈ E l and b = ∅, and if a ≠ x, then the genotype of the father must be g f (l) = {a, x}, whereas if a = x, we set g f (l) = {a, a} with probability
and otherwise leave g f (l) = {a, ∅}. After having completed this step for all the individuals in generation 0, we have determined their haplotypes and also partially the genotypes of their parents.
For the induction step, we assume that we have followed the procedure for t - 1 generations. and we now describe the procedure for generation t <T. The individuals in generation t have to choose parents from generation (t + 1) and transmit their ancestral alleles to these parents. Note that the situation in generation t > 0 differs from the situation in generation 0, since not all alleles are necessarily ancestral. Let g k (l) = {a, b} be the genotype of individual k in generation t. He or she will choose parents according to the distribution given in (8). In case a, b ∈ E l , that is, both alleles are ancestral, we proceed as in the case t = 0. Otherwise, however, we need to specify the probabilities P(g k (l)|g f (l), g m (l)) also in cases in which g k (l) is censored or partially censored.
If g k (l) is completely censored, we make the convention that P(g k (l) = {∅, ∅}|g f (l), g m (l)) ≡ 1. If g k (l) is partially censored, we define for x ∈ E l and a, b, ∈ E l ∪ {∅},
which is the conditional probability of picking the allele x from the partially observed genotype pair {a, b} that was sampled from the population. Then we define, for x, a, b, c, d ∈ E l ,
which is the probability that, given the information on the genotypes of the parents, a randomly chosen allele from the genotype of child k at locus l is of type x. Having chosen the parents, we decide the parental origin of the allele x according to the probability
After this, it remains to update the genotype of the chosen parent by transmitting the allele x. This is done in exactly the same way as for the alleles in generation 0.
Having followed this procedure for all children in generation t, we have determined their ancestral haplotypes, the censoring pattern on these haplotypes, and the ancestral genotypes of the parents. Moreover, when t > 0, this determines partially the meiosis pattern of the ancestral haplotypes of the individuals in generation (t - 1). For example, let m be the mother (in generation t) of child k (in generation (t - 1)). This child has inherited from his/her mother the haplotype h2k, and the corresponding meiosis pattern ψ2kis determined, for every locus l, as follows:
-
(i)
If h2k(l) = ∅, we set ψ2k(l) = ∅.
-
(ii)
Let h2k(l) = x ∈ E l , and let the alleles of the mother, respectively of grandpaternal and grandmaternal origin, be h2m-1(l) = a and h2m(l) = b, a, b ∈ E l ∪ {∅}. Note that (a, b) ≠ (∅, ∅), since either (a = x) or (b = x). If a ≠ b, the grandparental origin of the ancestral allele x is determined as ψ2k(l) = 1(x = a). The grandparental origin of x remains undetermined only when a = b = x ∈ E l . In such a case we have to sample simultaneously the grandparental origins of all ancestral alleles at locus l in generation t - 1 that are inherited from the mother m. In order to do so, suppose that k1, ..., k n are the children of mother m and h2m-1(l) = h2m(l) = = x ∈ E l . Note that since the mother has two ancestral alleles, necessarily n ≥ 2, and we may exclude the event .
Conditioning on the complementary event Icis equivalent to the following procedure: We sample without replacement two children, say and , among {k1, ..., k n }, and assign (l) = 0 and (l) = 0. Given this the grandparental origins of the remaining children are conditionally independent Bernoulli () random variables.
We iterate the above procedure backwards in time until we reach the founder generation, where we determine, by tossing fair coins, the parental origins of the founders' ancestral genotypes. Given that, we compute the meiosis pattern on the haplotypes in generation (T - 1) as described above.
Remarks
-
(1)
Although at a first reading these sampling formulae may not be completely obvious, the idea behind the sequential scheme is simple: a child chooses his/her parents from the population and transmits his/her alleles to the parents according to Bayes' formula, by conditioning on his/her own alleles as well as on the parental choices and allele transmissions of the previous children. Note, however, that the procedure is not fully Bayesian, since the conditioning does not include at every intermediate stage the information about the genotypes of the children still in the list, and that we are not taking into account the true recombination likelihood. Therefore the resulting configuration is only compatible with the data, but not an exact sample from the posterior distribution.
-
(2)
An alternative way to proceed would be to first let all children from the considered generation choose their parents, and then sample jointly the parental phases of the alleles of the children given the family structure, and finally transmit the alleles to the parents. We develop this idea later in section A.7, as we construct a block-update for the Metropolis-Hastings algorithm.
-
(3)
If the population of candidate parents is small, it is possible that, when a child k is choosing his/her parents, the genotypes of the possible parents are already partially determined by the alleles of the previous (k - 1) children in such a way that every possible parental choice is logically incompatible with the genotype of child k. When such a contradiction is found, we have to restart from the beginning, or at least from one generation back. If the algorithm keeps failing, we may need to increase the size of the population in the prior.
A.4.1 Incorporating the true recombination likelihood
In the construction of an initial configuration we have used the model with free recombination. In the case of closely linked markers, the resulting initial configuration will be compatible with the data, but it will not look realistic, since most likely it will contain too many recombinations.
It is shown in Appendix B, how to apply the Viterbi algorithm to sample the parental phase vector ϕ k = (ϕ k (1), ..., ϕ k (L)) of an individual k, say for a female, in generation t > 0, from a joint conditional distribution, where we condition on her partially observed genotypes (g k (1), ..., g k (L)), on the partially observed genotypes (g f (1), ..., g f (L)) and (g m (1), ..., g m (L)) of her parents f and m in generation (t + 1), and on the partially observed haplotypes {( : j = 1, ..., n} which she has transmitted to her children k(1), ..., k(n) in generation (t - 1). We can then generally improve on the initial configuration by substituting the sampling distribution (10) or (11) by this joint conditional distribution that takes into account the true recombination likelihood of the haplotypes of the children. It is also shown in Appendix B, how to integrate out the phase vector ϕ k , in order to compute the marginal likelihood of the partially observed haplotypes of the children. We could compute this marginal likelihood for all logically compatible choices of pairs of grandparents and include it as a new factor in expression (8) which is proportional to the probability of choosing a pair of grandparents. Alternatively, we could choose the grandparents using expression (8) with free recombination, and then use the true recombination likelihood only for assigning the parental origins to the genes of the parent.
A.5 Block-updates for the Metropolis-Hastings algorithm
Next we discuss the construction of a proposal distribution Q(ω → ) for the Metropolis-Hastings algorithm on the configuration space Ω. Ideally, we would like to propose only configurations that are compatible with the data, or at least we want that Q(ω → ) is not too small when starting from some ω ∈ . The resulting Markov chain should also be irreducible, that is, it should be able to reach every state in with positive probability regardless of where it started from. Single-site updates, like changing the phase of one allele at a time, are not enough to produce an irreducible chain. Larger block-updates are needed also because changes to the ancestral graph usually require simultaneous changes to the allelic paths.
Our aim is to construct a block-proposal distribution by applying locally similar ideas that were used in generating an initial configuration. Starting from some configuration ω ∈ , a selected group of individuals in the ancestral graph try to choose new parents and redirect their ancestral alleles from the "old ancestors" to the new ones. This is done by conditioning on the paths of the ancestral alleles of the other individuals, which are kept fixed during the block-update step. Before making these points more precise, we introduce some more notation.
A.5.1 Conditional probabilities for the types of non-ancestral alleles
Given a configuration ω, we define fr(a|k, ω) for any locus l and individual k as the conditional probability that k transmits an allele of type a ∈ E l at locus l to a "hypothetical new child". When the genotype g k = {} at locus l of individual k is fully ancestral, we have
Otherwise individual k has at least one non-ancestral allele at this locus, and in order to compute fr(a|k, ω), we must take into account all possible ways in which he/she may have inherited the non-ancestral allele(s) from his/her ancestors.
We also need the joint transmission probabilities for pairs of alleles. We denote by fr(a, a'|k, k', ω) the conditional probability that individual k transmits an allele of type a to his/her hypothetical new child and simultaneously individual k' transmits an allele of type a' to his/her hypothetical new child. Here a, a' ∈ E l , and we let the indexes k, k' go through all individuals in a given generation. If at least one of the genotypes g k or g k' is fully ancestral,
fr(a, a'|k, k', ω) = fr(a|k, ω)fr(a'|k', ω).
However, this equation may not hold, if in configuration ω both k and k' carry some non-ancestral allele at locus l, and there is a common ancestor of k and k' who has had a positive probability of transmitting the same non-ancestral allele to both k and k'.
We compute these transmission probabilities recursively, starting from the founders, by setting
fr(a|k, ω) = fr(a|g k ),
where g k = {} ⊆ E l ∪ {∅}, is the possibly censored genotype of founder k in configuration ω, and fr(a|{x, y}) was defined in (12) by using the genotype frequencies of the population. Moreover, if k and k' are distinct founders, we have
fr(a, a'|k, k', ω):= fr(a|k, ω)fr(a'|k', ω),
whereas for k = k'
where the sum is taken over all possible genotypes at locus l. It remains to specify fr(a|k, ω) and fr(a, a'|k, k', ω), when k and k' are not founders and g k or contains non-ancestral alleles. For that let f and m be the father and the mother of k, and let x and y be his/her possibly censored alleles at locus l of paternal and maternal origin, respectively. Since we proceed recursively, we may assume that fr(a|f, ω) and fr(a|m, ω) are already computed. Then
fr(a|k, ω) =
{1(x = a) + 1(x = ∅)fr(a|f, ω) + 1(y = a) + 1(y = ∅)fr(a|m, ω)},
for all a ∈ E l . It remains to speficy fr(a, a'|k, k', ω), when k and k' are not founders and both genotypes g k = {x, y} and = {x', y'} contain some non-ancestral alleles. Let (f, m) and (f', m') be the parents of k and k', and assume that x, x' ∈ E l ∪ {∅} are paternal and y, y' ∈ E l ∪ {∅} are maternal. For a, a' ∈ E l , we obtain, again using recursion, that
where the formula holds also if k = k' or f = f' or m = m'.
These recursive formulae are exact, taking into account all intersections between the possible paths of the two non-ancestral alleles. In principle we could extend these formulae to allele triples, quadruples, and generally, for n-tuples. However, since we will consider only one genotype at a time, we need only the joint transmission probabilities for pairs of alleles.
A.5.2 Conditional genotype probabilities
For any pair of candidate parents f and m, belonging to generation 1 ≤ t ≤ T, we define the conditional genotype probabilities at every locus l of a hypothetical common child k by
P({a, b}|f, m, ω) := fr(a, b|f, m, ω) + 1(a ≠ b)fr(b, a|f, m, ω), a, b ∈ E l .
We also define, for a ∈ E l ,
which corresponds to the conditional probability that an allele at locus l of a hypothetical common child is of type a. We also set, as a convention, P({∅, ∅}|f, m, ω) = 1.
A.6 Block-update I: Children choosing new parents
We consider a randomly selected group of children, indexed by k1, ..., k n , all belonging to the same generation 0 ≤ t <T, who will choose new parents and redirect their ancestral alleles from their "old ancestors" to "new ancestors". Before choosing new parents, these children must withdraw their alleles from their old parents. Starting from configuration ω, we construct a modified configuration in which, ascending from generation (t + 1) to generation T, we delete the paths of the ancestral alleles that go through the children k1, ..., k n . Consequently all the ancestral alleles that were transmitted only by these children become censored.
Next we look at the individuals in generation (t + 1), including the part of the population which was left outside the ancestral graph, and consider all possible parental pairs (f, m). Recall that the individuals outside the ancestral graph have genotype {∅, ∅} by default, and their conditional genotype probabilities coincide with the population genotype frequencies. Following the sequential ordering, each of the children k1, ..., k n will choose randomly a pair of parents from generation (t + 1) and transmit to them (and to their ancestors) his/her ancestral alleles. Locally this is like the construction of the initial configuration explained in section A.4, with the difference that here we must condition on the upper part of the ancestral graph and on the allelic paths that are determined by the modified configuration . Indeed, if child k has genotypes , then he/she chooses parental pair X k = (f, m) with a probability proportional to expression (8), where we extend the definition of P(g k (l)|g f (l), g m (l)) to non-founder parents by means of the conditional genotype probabilities P(g k (l)|f, m, ) given in formulae (13) and (13).
A.6.1 Dropping and adding ancestors
After the children k1, ..., k n have chosen new parents, it is possible that a former parent in the ancestral graph is left without children. In this case he or she will be dropped from the ancestral graph. By induction, the same may apply to the more distant ancestors in the elder generations. In order to obtain a reversible Markov chain, the children are given the possibility of choosing new parents also from the population outside the ancestral graph, whence these new parents together with their ancestors will become a part of the updated ancestral graph.
Note that it is straightforward to use the sequential construction of the prior distribution to sample the ancestry of the population outside the ancestral graph, conditionally on the ancestral graph. In principle such a resampling step must be included in the MCMC algorithm as a pre-move, before updating the ancestral graph. However, that may not be practical if the size of the population is large. In that case, instead of resampling the ancestry of all members of the population, it is enough to sample (conditionally on the current ancestral graph) the ancestry of a limited number of candidate parents outside the ancestral graph.
A.6.2 Resampling the paths of the ancestral alleles
Having chosen a new father f and a new mother m for child k, we sample new parental origins of the ancestral alleles of k by extending formulae (10) and (11) as follows: When both (l) and (l) are ancestral, we have
Otherwise, if (l) is ancestral and (l) is non-ancestral, we have
and symmetrically, if (l) is ancestral and (l) is non-ancestral,
Finally, if both (l) and (l) are non-ancestral, we have that
After this, in case one or two ancestral alleles were transmitted, we must update the genotypes of the new parents f and m, and eventually also the genotypes of their ancestors higher up in the ancestral graph. If both genotypes of the new parents are already fully ancestral at the given locus l, there is nothing to do. If the parents are founders, we proceed as in section A.4. Otherwise, let the alleles in locus l of child k be h2k-1(l) = a (paternal) and h2k(l) = b (maternal), with a, b ∈ E l ∪ {∅}, and at least one of them ancestral. Consider first the case in which only one of a and b is ancestral and, for example, that it is paternal. If the father f has already two ancestral alleles, there is nothing to do. Otherwise, let h2f-1(l) = x and h2f(l) = y, x, y ∈ E l ∪ {∅}, where x or y or both are non-ancestral. We denote by f' and m' the father and the mother of f, respectively. With probability
the ancestral allele a was inherited from the grandfather f', and in case h2f-1(l) was censored, we update it to h2f-1(l) = a, and leave h2f(l) = y. Otherwise a was inherited from the grandmother m', and if h2f(l) was censored, we update it by setting h2f(l) = a, and leave h2f-1(l) = x.
Next we consider the case where both a and b are ancestral. If only one parent has censored alleles, we are back to the previous case, since at most one parental allele will be updated. Otherwise we have to follow the origins of a and b simultaneously. Assume therefore that both f and m have censored alleles, that is, h2f-1(l) = x, h2f(l) = y and h2m-1(l) = x', h2m(l) = y', with x or y or both censored and x' or y' or both censored. Let f' and m' be the father and mother of f, and let f" and m" be the father and mother of m. Then with probability
a was inherited from f' and b was inherited from f", with probability
a was inherited from m' and b was inherited from f", with probability
a was inherited from f' and b was inherited from m", and finally with probability
a was inherited from m' and b was inherited from m". Here C is a normalizing constant. In each of these cases, the corresponding alleles of the parents become ancestral if they were non-ancestral before. This completes the updating procedure for the alleles of the parents.
If some censored allele became ancestral we must update the alleles of the grandparents as well, and possibly continue the procedure further backwards in time, until the alleles coalesce to some ancestral alleles or until the founder generation is reached. This is done in the same way as we updated the alleles of the parents.
We resume the updating procedure as follows. Given the choice of new parents, we sample new allelic paths for the ancestral alleles carried by the child. The path of an allele is a random walk on the ancestral graph, where in each generation the allele is assigned to either the paternal or the maternal origin, conditionally on the paths of the other ancestral alleles determined by the configuration . The new path of an ancestral allele is sampled sequentially until it crosses a path of an ancestral allele of the same type in the configuration , or until the path reaches the founder generation. Note also that, if the child transmits two ancestral alleles to the parents at some locus l, we are coupling the corresponding allelic paths in such a way that the paths are always compatible with each other and with the configuration .
A.6.3 Incorporating the true recombination likelihood
In the update procedure we have described so far, the transmission patterns are resampled by using the model with free recombination. However, we are able to take the true recombination likelihood partially into account. Namely, having assigned the new parents f and m to child k in generation 0 <t <T, we sample jointly the vector of the parental origins (ϕ k (1), ..., ϕ k (L)) of the alleles of k, by conditioning on the ancestral alleles of the ancestors of k and on the ancestral haplotypes that he/she transmits to his/her children in generation (t - 1). This is done by the Viterbi algorithm given in appendix B, where we specify the prior π for the vector of parental origins (ϕ k (1), ..., ϕ k (L)) as
and the right hand side of the last expression was defined in formulas (14–17).
Similarly we improve the procedure which updates the paths of the ancestral alleles (see section A.6.2). When an ancestor receives alleles from his/her descendants, some of his/her censored alleles may become ancestral and we need to sample the phases of these alleles. The Viterbi algorithm can be used to sample all these phases jointly (keeping the phases of the ancestral alleles fixed) by combining the product of the sampling distributions across the marker loci as given in section A.6.2 with the recombination likelihood contribution of the children's haplotypes.
A.6.4 Completing the block-update
Once the procedure is completed, we have updated the modified configuration by assigning new parents to children k1, ..., k n and new paths to their ancestral alleles. The resulting updated configuration is the proposal state in the Metropolis-Hastings algorithm. Note that when creating we are also able to compute sequentially the proposal probability Q(ω → ), and similarly, starting from , we can compute the proposal probability Q( → ω) for the reverse transition. Therefore the corresponding Metropolis-Hastings update can be implemented.
We also use slightly different versions of this block-update. In the first of these the children do not change parents but the paths of their ancestral alleles are updated simultaneously. This is done by simply skipping the sampling of parents described in section A.6. In the other version we let the children involved in the update belong to different generations.
Remarks
-
(1)
If one child is selected to choose new parents under the model with free recombination this proposal distribution is a Gibbs' update (see section A.2.1). The same does not hold more generally, since, when updating the parental choice and the paths of the ancestral alleles of child k1, we take into account neither the ancestral alleles of the other children k2, ..., k n , nor the complete recombination likelihood.
-
(2)
If more than one child are selected for an update, it is possible that a selected child k i does not find any parents compatible with his/her genotypes. This may happen when the children k1, ..., ki-1have already transmitted their alleles to their ancestors in such a way that it is no longer possible to extend the paths of all ancestral alleles of child k i up to the founder generation. In this case the proposed block-update is rejected.
A.7 Block-update II: Half-siblings changing one parent
We take a random father in generation (t + 1) ≤ T and consider all his children belonging to generation t, denoted by k1, ..., k n . These children are going to stay with their original father but will choose new mothers and consequently the paths of their ancestral alleles will be resampled. (To be politically correct, we also use the symmetric update which switches the roles of mothers and fathers.) We could continue as in the previous sections, resampling sequentially, for one child at a time, a new mother and new paths of the ancestral alleles. However, there is a potential problem here: as explained in section A.6.2, after a new mother has been chosen, the parental phases of the alleles of child k1 are resampled without simultaneously considering the ancestral alleles of children k2, ..., k n . When there are many children and many marker loci, it becomes unlikely that this procedure will assign several children to the same mother, and most of the time the algorithm proposes to add more mothers to the ancestral graph than would be necessary. As a consequence, if the true ancestral graph contains couples with many children, the corresponding Metropolis chain is slowly mixing, and the mixing gets even worse as the number of markers increases. To improve on that, we change the order in the resampling procedure. First the children k1, ..., k n choose new mothers, and then we sample jointly the new parental phases of their alleles, by conditioning on the ancestral alleles of the new parents and taking into account the recombination likelihood contribution from the haplotypes of the children k1, ..., k n . This block-update is computationally demanding, but it concerns only a small number of children at a time.
A.7.1 Sampling distribution for choosing the mothers
We start from the modified configuration obtained as in section A.6 from the current configuration ω by withdrawing the ancestral alleles transmitted by the children k1, ..., k n to their ancestors. Let f be the common father of these children. Let k1 choose his/her mother as in section A.6, with the constraint that the father f is fixed. Thus k1 chooses his/her mother m1 from a distribution proportional to
Instead of continuing as in section A.6.2, i.e. by sampling the parental phases of the alleles of k1 and updating the genotypes of the parents f and m1 at each locus l, we compute the joint conditional distribution
under the model with free recombination. Here at every locus l, (l) and (l) are possible values of the updated genotypes of the parents after k1 has transmitted to them his/her ancestral alleles, and (l) is the parental phase of the alleles of k1. The computation of (20) is done by combining the computations from section (A.6.2).
We shall illustrate the procedure with an example. Assume that given the configuration , the ancestral genotypes of the parents and the child are respectively g f (l) = {∅, ∅}, (l) = {a, ∅}, and (l) = {a, b}. In this particular case formula (20) would give the following probabilities
Proceeding then by induction, suppose that children k1, ..., ki-1have chosen mothers m1, ..., mM(i-1), where M(i - 1) ≤ i - 1, and for each locus l we have computed the joint conditional distribution for the phases of the alleles of the children k1, ..., ki-1and of the genotypes of the parents, denoted by
Given that, we must specify how k i chooses his/her mother, and consequently, how the joint conditional distribution for the phases of the children's alleles and of the genotypes of the parents is updated.
Child k i will choose either one of the previously chosen mothers m1, ..., mM(i-1), or a mother who was not yet chosen by any of the first (i - 1) children, with probability proportional to
where X j = (f j , m j ) denotes the parental choice of child j, and
In the last expression the ancestral genotypes of the father and the mother are integrated out with respect to the conditional probability (20), after it is extended to the case in which a mother m had not yet been chosen by any of the first (i - 1) children, by the formula
Continuing with the example, consider the situation in which the second child has ancestral genotype (l) = {b, c} at locus l, and he/she has already chosen mother m1. Then the possible outcomes are
Let the third child with ancestral genotype (l) = {b, d} choose mother m2, who currently has ancestral genotype (l) = {d, ∅}. Then the possible outcomes are
Note that now the event {(l) = 0} has zero probability.
At the end of this sequential step children k1, ..., k n have chosen new mothers in a way which is compatible with their ancestral genotypes and the information about the genotypes of the parents carried by the modified configuration . We have also produced a joint sampling distribution for the phases of the children and the ancestral genotypes of the parents. Therefore we could complete the block-update by sampling, independently at each marker locus, the parental phases of the children's alleles and the ancestral genotypes of the parents from the joint sampling distribution above, and by transmitting recursively the new ancestral alleles upwards to the ancestors as in section A.6.2. However, this strategy would not take into account linkage between the marker loci, which is considered next.
A.7.2 Sampling the phases of the children's and parents' alleles jointly across the marker loci
We continue the block-update by computing for each parent j involved in the update and for each locus l the conditional probability of the parental phase ϕ j (l) given the transmitted genotype g j (l) and the genotypes of his/her ancestors under the modified configuration . Namely, as in equations (14–17),
where f' and m' are the father and the mother of j.
We then combine, at every locus l, the product of the conditional phase distribution of the parents together with the joint distribution of the phases of the children and of the ancestral genotypes of the parents, as given in the previous section. By doing this, we obtain a joint distribution for the parental phases of the children and of the parents at locus l. In turn, the phases of the children together with the phases of the parents determine the grandparental origins of the alleles of the children. This gives a joint distribution for the grandparental origins of the children at locus l, which will be denoted by
We now include in the sampling distribution the likelihood contribution from the consecutive marker loci. Recall that the recombination likelihood from the children's haplotypes is given by
where
Here ψ j (l) ∈ {0, 1} is the grandparental origin of the haplotype j at locus l, ρ(l, l + 1) is the recombination fraction between marker loci l and l + 1, and J = {2k1 - 1, 2k1, ..., 2k n - 1, 2k n }. We use the forward-backward Baum-Viterbi algorithm to sample (and to compute the sampling distribution of) the grandparental origins (ψ j (l) : j ∈ J, l = 1, ..., L) from the joint distribution proportional to
Note that
where rl, l+1(x) = ρ(l, l + 1)x(1 - ρ(l, l + 1))1-x.
In the forward part of the algorithm, starting with 1 (ψ.(l)) = Q1(ψ.(l)) we compute recursively for l = 1, ..., L - 1 the updated probability distributions
On the right hand side appears a convolution in the commutative group ({0, 1}2n, +) which is computed efficiently by discrete Fourier transforms [31]. In the backward part of the algorithm we sample first ψ.(L) ~ L , and then iteratively ψ.(l) given ψ.(l + 1) from the distribution
with normalizing constant C l = Ql+1(ψ.(l + 1))/l+1(ψ.(l + 1)).
Note that at any marker locus l different combinations of the children's phases and parents' phases may lead to the same vector ψ.(l) of children's grandparental origins. Therefore an additional step is required where, independently across the marker loci, the children's and parents' phases are sampled, conditionally on sampled grandparental origins of the children. This will determine also the ancestral genotypes transmitted to the parents. The block proposal is completed by transmitting the new ancestral alleles from the parents to their ancestors in the upper part of the ancestral graph. This is done exactly as in sections A.6.2 and A.6.3.
Remarks
-
(1)
The computational complexity of this block update grows linearly with L, the number of markers, and exponentially with n, the number of children involved in the block update. In practice we restrict our sampling algorithm to values n ≤ 7.
-
(2)
In this block update we are sampling a full meiosis vector (ψ j (l) : j ∈ J, l = 1, ..., L) which contains also the recombination pattern in the non-ancestral part of the haplotype. We have to proceed in this way, since it is not straightforward to sample the meiosis pattern only at the loci carrying ancestral alleles (the first-order Markov property across loci is lost). This means that we are temporarily extending the state space of the Markov chain algorithm with auxiliary variables. A theoretical justification is given in A.3. Note that, in order to compute the acceptance probability of this block update, we need to sample the meiosis pattern on the non-ancestral part of the haplotypes of the children as specified in the old configuration ω. This can be done directly by using the Kruglyak and Lander algorithm [31]. Once we have sampled the block update and computed the Hastings ratio, we erase the recombination pattern in the non-ancestral part of the haplotype and keep only the information about the paths of the ancestral alleles.
-
(3)
In the construction of this block proposal we have included the recombination likelihood from the haplotypes of the children but not the recombination likelihood from the haplotypes of the grandchildren and of the ancestors. Also these recombination likelihood terms contribute to the acceptance probability of the proposal.
A.8 Block-update III: Sex-switching
We introduce one more update step into the algorithm. Consider the bipartite graph formed by the individuals in generation t > 0, where two nodes are connected by an edge if and only if the corresponding individuals have at least one common child. This bipartite graph is decomposable into connected components. We select a random connected component and obtain the proposal configuration by switching the sexes of all the individuals in the selected component. Note that the prior distribution of the ancestral graph is not invariant under sex-switching (except for particular choices of the parameters α and β), but the distribution of the geneflow on the ancestral graph is, since our model for recombination does not depend on sex. Therefore, only the prior of the ancestral graph contributes to the Hastings' ratio which is given by min(1, P(G())/P(G(ω)). This simple update is important for the mixing of the sampler, since a fixed sex assignement would unnecessarily restrict the mating possibilities.
B Sampling the parental phases conditionally on the partially observed haplotypes of the children: a Viterbi algorithm
Consider a parent whose (fully observed) genotype at locus l is , where the two alleles are arbitrarily ordered. Let ϕ = (ϕ(1), ..., ϕ(L)) ∈ {0, 1}Lbe the phase vector of the parent, i.e. if allele was inherited from the (grand)father, then ϕ(l) = 0, whereas if it came from the (grand)mother, then ϕ(l) = 1.
For each of the n children of the parent we introduce an origin vector , which together with the genotype of the parent determines the haplotype h i = (h i (1), ..., h i (L)) inherited by the i-th child from the parent as follows:
Note that the origin vector together with the phase vector of the parent ϕ determines the grandparental origins and the recombination pattern for the haplotype h i .
Next we introduce a censoring mechanism δ i ∈ {0, 1}Lon haplotype i that is independent of ϕ and h i . We define the partially censored origin vector ζ i = (ζ i (1), ..., ζ i (L)) by setting
Given ζ i for all i = 1, ..., n, we define the partially censored parental genotype g(l) = {g0(l), g1(l)} ⊆ E l ∪ {∅} as follows:
Now the problem is to sample the parental phase vector ϕ conditionally on the children's partially censored origin vectors ζ i , i = 1, ..., n, the partially censored parent's genotype vector g, and the information available on the genotypes of the grandparents.
Without loss of generality we assume that g(l) ≠ {∅, ∅} for all l, since we can skip the loci where the genotype of the parent is completely censored. We assume that, a priori, the phases ϕ(l), l = 1, ..., L, are independent with respective distributions π(0, l) = (1 - π(1, l)) ∈ [0, 1]. The prior distributions π(ϕ(l), l) can be specified by using the information available on the genotypes of the grandparents as explained in section A.4.
Given the censoring pattern δ i on haplotype i, we denote the last uncensored locus up to l by j(δ i , l) := max{k ≤ l : δ i (k) = 1} if such a locus exists, and otherwise set j(δ i , l) := 0. Let and j(δ i , l) > 0}. Note that and since by assumption the parent transmits at least one allele to at least one child at each locus, l ∈ = {l} ⊆ {1, ..., l} for all l. Note also that without censoring, we would have simply for all l.
We then consider the process for l = 1, ..., L. This is a Markov chain whose transition law is simply described: the coordinates ϕ(k) do not change for , whereas ϕ(l) is sampled independently from the parental phase prior π(ϕ(l); l). At each locus l > 1 we observe ζ(l) = (ζ1(l), ..., ζ n (l)) where only the components in
K(l) = {i : ζ i (l) ≠ ∅ and j(δ i , l - 1) > 0}
contribute to the likelihood of the phase vector (ϕ(1), ..., ϕ(l)) by a factor
Here R(0, 0, ρ) = R(1, 1, ρ) = 1 - ρ, R(1, 0, ρ) = R(0, 1, ρ) = ρ, ρ (k, l) is the recombination fraction between the loci k and l and := {ζ(k) : k ≤ l}.
Summarizing: If we do not have censoring, (ϕ(l), ζ(l)) is a Markov chain with hidden state ϕ(l) ∈ {0, 1}, and given the censoring pattern (δ i : i = 1, ..., n), we have constructed a Markov process with an enlarged state space depending on the locus l, such that the conditional distribution of the observation process ζ(l), given and (ϕ(k), ζ(k) : k <l), depends only on , and the previous observations (ζ(j(δ i , l - 1)) : i ≤ n) at the last uncensored loci, i.e. given the censoring pattern δ i , (, (ζ i (j(δ i , l)) : i ≤ n)) is a Markov process.
After having enlarged the state space, we are back to the framework of hidden Markov models, and we can use the Viterbi algorithm to sample from the joint posterior distribution the vector (which contains the same information as the vector (ϕ(1), ..., ϕ(L)), conditionally on the observations (ζ i (1), ..., ζ i (L) : i = 1, ..., n). Next we sketch the algorithm.
At the first locus we have always = (ϕ(1)) and = π(ϕ(1); 1). For the following loci we compute sequentially the distribution of the phase vector conditionally on the data up to locus l:
where
Next, we sample from , and continue sequentially backwards from locus L to locus 1. Having sampled , we sample from the conditional distribution
By the same method one computes the posterior probability of a given sample (ϕ(1), ..., ϕ(L)), and it would also be possible to find the phase vector of maximal posterior probability by using dynamic programming.
In a similar fashion, we can compute recursively the marginal distribution of the data. For the given sampling pattern, at the first locus we have simply
Then we compute recursively, for 1 <l ≤ L,
and finally
It is quite remarkable that this algorithm does not need to sample the missing data. The possiblity of sampling the parental phases, by conditioning on the haplotypes of the children, is crucial for the mixing of the MCMC algorithm. Sampling the phases from the prior only is unlikely to produce a reasonable recombination pattern on the haplotypes of the children, with the consequence that the acceptance probability of the whole move will be very low.
References
Gao G, Hoeschele I, Sorensen P, Du FX: Conditional probability methods for haplotyping in pedigrees. Genetics. 2004, 167: 2055-2065. 10.1534/genetics.103.021055.
Lin S, Cutler DJ, Zwick ME, Chakravarti A: Haplotype inference in random population samples. Am J Hum Genet. 2002, 71: 1129-1137. 10.1086/344347.
Blouin MS: DNA-based methods for pedigree reconstruction and kinship analysis in natural populations. Trends Ecol Evol. 2003, 18: 503-511. 10.1016/S0169-5347(03)00225-8.
Cowell RG, Mostad P: A clustering algorithm using DNA marker information for sub-pedigree reconstruction. J Forensic Sci. 2003, 48: 1239-1248.
Lange EM, Lange K: Powerful allele sharing statistics for nonparametric linkage analysis. Hum Hered. 2004, 57: 49-58. 10.1159/000077389.
Du FX, Hoeschele I: A note on algorithms for genotype and allele elimination in complex pedigrees with incomplete genotype data. Genetics. 2000, 156: 2051-2062.
Pirinen M, Gasbarra D: Finding consistent gene transmission patterns on large and complex pedigrees. IEEE/ACM TCBB. 2006, 3: 252-262.
Falush D, Stephens M, Pritchard J: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003, 164: 1567-1587.
Corander J, Waldmann P, Sillanpää MJ: Bayesian analysis of genetic differentiation between populations. Genetics. 2003, 163: 367-374.
Gasbarra D, Pirinen M, Sillanpää MJ, Salmela E, Arjas E: Estimating genealogies from unlinked marker data: a Bayesian approach. Theor Pop Biol. 2007, 72: 305-322. 10.1016/j.tpb.2007.06.004.
Jannink JL, Wu XL: Estimating allelic number and identity in state of QTLs in interconnected families. Genet Res. 2003, 81: 133-144. 10.1017/S0016672303006153.
Pérez-Enciso M: Fine mapping of complex trait genes combining pedigree and linkage disequilibrium information: a Bayesian unified framework. Genetics. 2003, 163: 1497-1510.
Meuwissen THE, Goddard ME: Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genet Sel Evol. 2004, 36: 261-279. 10.1051/gse:2004001.
Sobel E, Papp JC, Lange K: Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. 2002, 70: 496-508. 10.1086/338920.
Sobel E, Sengul H, Weeks DE: Multipoint estimation of identity-by-descent probabilities at arbitrary positions among marker loci on general pedigrees. Hum Hered. 2001, 52: 121-131. 10.1159/000053366.
Thompson EA, Heath SC: Estimation of conditional multilocus gene identity among relatives. Statistics in Molecular Biology and Genetics: selected proceedings of a 1997 joint AMS-IMS-SIAM summer conference on statistics in molecular biology, IMS Lecture Note-Monograph Series. Edited by: Seillier-Moseiwitch F. 1999, 33: 95-113.
Mao Y, Xu S: A Monte Carlo algorithm for computing IBD matrices using incomplete marker information. Heredity. 2005, 94: 305-315. 10.1038/sj.hdy.6800564.
Hernández-Sánchez J, Haley CS, Wooliams JA: Prediction of IBD based on population history for fine gene mapping. Genet Sel Evol. 2006, 38: 231-252. 10.1051/gse:2006001.
Kingman JFC: The coalescent. Stochastic Proc Appl. 1982, 13: 235-248. 10.1016/0304-4149(82)90011-4.
Hudson RR: Properties of a neutral allele model with intragenic recombination. Theor Pop Biol. 1983, 23: 183-201. 10.1016/0040-5809(83)90013-8.
Griffths RC, Marjoram P: Ancestral inference from samples of DNA sequences with recombination. J Comp Biol. 1996, 3: 479-502.
Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. Am J Hum Genet. 2001, 68: 978-989. 10.1086/319501.
Larribe F, Lessard S, Schork NJ: Gene mapping via the ancestral recombination graph. Theor Pop Biol. 2002, 62: 215-229. 10.1006/tpbi.2002.1601.
Morris AP, Whittaker JC, Balding DJ: Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am J Hum Genet. 2002, 70: 686-707. 10.1086/339271.
Zöllner S, Pritchard JK: Coalescent-based association mapping and fine mapping of complex trait loci. Genetics. 2005, 169: 1071-1092. 10.1534/genetics.104.031799.
Kuhner M, Felsenstein J: Sampling among haplotype resolutions in a coalescent-based genealogy sampler. Genet Epidemiol. 2000, 19: 515-521. 10.1002/1098-2272(2000)19:1+<::AID-GEPI3>3.0.CO;2-V.
Fearnhead P, Donnelly P: Estimating recombination rates from population genetic data. Genetics. 2001, 159: 1299-1318.
Beaumont MA, Zhang W, Balding DJ: Approximate Bayesian computation in population genetics. Genetics. 2002, 162: 2025-2035.
Marjoram P, Molitor J, Plagnov V, Tavaré S: Markov chain Monte Carlo without likelihoods. Proc Natl Acad Sci USA. 2003, 100: 15324-15328. 10.1073/pnas.0306899100.
Gasbarra D, Sillanpää MJ, Arjas E: Backward simulation of ancestors of sampled individuals. Theor Pop Biol. 2005, 67: 75-83. 10.1016/j.tpb.2004.08.003.
Kruglyak L, Lander ES: Faster multipoint linkage analysis using Fourier transforms. J Comput Biol. 1998, 5: 1-7.
Stephens M, Scheet P: Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am J Hum Genet. 2005, 76: 449-462. 10.1086/428594.
Lynch M: Estimation of relatedness by DNA fingerprinting. Mol Biol Evol. 1988, 5: 584-599.
Li CC, Weeks DE, Chakravarti A: Similarity of DNA fingerprints due to chance and relatedness. Hum Hered. 1993, 43: 45-52.
Lynch M, Ritland K: Estimation of pairwise relatedness with molecular markers. Genetics. 1999, 152: 1753-1766.
Wang J: An estimator for pairwise relatedness using molecular markers. Genetics. 2002, 160: 1203-1215.
Rousset F: Inbreeding and relatedness coefficients: what do they measure?. Heredity. 2002, 88: 371-380. 10.1038/sj.hdy.6800065.
van der Meulen M, te Meerman GJ: Association and haplotype sharing due to identity by descent, with an application to genetic mapping. Genetic Mapping of Disease Genes. Edited by: Pawlowitzki IH, Edwards JH, Thompson EA. 1997, Academic Press, 115-136.
te Meerman GJ, van der Meulen MA: Genomic sharing surrounding alleles identical by descent: effects of genetic drift and population growth. Genet Epidemiol. 1997, 14: 1125-1130. 10.1002/(SICI)1098-2272(1997)14:6<1125::AID-GEPI94>3.0.CO;2-I.
Beckmann L, Thomas DC, Fischer C, Chang-Claude J: Haplotype sharing analysis using Mantel statistics. Hum Hered. 2005, 59: 67-78. 10.1159/000085221.
Hein J, Schierup MH, Wiuf C: Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. 2005, Oxford University Press
Heath SC: Markov chain Monte Carlo methods for radiation hybrid mapping. J Comp Biol. 1997, 4: 505-515.