Estimating genealogies from linked marker data: a Bayesian approach
- Dario Gasbarra^{1},
- Matti Pirinen^{1},
- Mikko J Sillanpää^{1}Email author and
- Elja Arjas^{1, 2}
https://doi.org/10.1186/1471-2105-8-411
© Gasbarra et al; licensee BioMed Central Ltd. 2007
Received: 20 March 2007
Accepted: 25 October 2007
Published: 25 October 2007
Abstract
Background
Answers to several fundamental questions in statistical genetics would ideally require knowledge of the ancestral pedigree and of the gene flow therein. A few examples of such questions are haplotype estimation, relatedness and relationship estimation, gene mapping by combining pedigree and linkage disequilibrium information, and estimation of population structure.
Results
We present a probabilistic method for genealogy reconstruction. Starting with a group of genotyped individuals from some population isolate, we explore the state space of their possible ancestral histories under our Bayesian model by using Markov chain Monte Carlo (MCMC) sampling techniques. The main contribution of our work is the development of sampling algorithms in the resulting vast state space with highly dependent variables. The main drawback is the computational complexity that limits the time horizon within which explicit reconstructions can be carried out in practice.
Conclusion
The estimates for IBD (identity-by-descent) and haplotype distributions are tested in several settings using simulated data. The results appear to be promising for a further development of the method.
Background
There are several fundamental questions in statistical genetics for which the answer would ideally require knowledge of the ancestral pedigree and of the gene flow therein. In practice, however, one will usually have available only partial information from the pedigree, and hardly any information on the accompanying ancestral allelic histories. Examples of intrinsic questions of this kind are: haplotype estimation from pedigree data [1] or from general population samples [2], pairwise estimation of the degree of relatedness between individuals in natural populations [3] or for forensic purposes [4], study of allele-sharing among affected individuals [5], generation of simulated data in a way which is compatible with observed marker genotypes (genotypic elimination; see [6, 7]), estimation of population structure using multilocus genotype data [8–10], estimation of the number of founder chromosomes for given loci [11], gene mapping by combining pedigree and linkage disequilibrium information [12, 13], and tracing genotyping errors in pedigrees [14].
Due to the shared generating process of inheritance pattern on the pedigree, it is not surprising that the current methods for addressing the above questions also have some similarities. One of the similarities is that several methods first attempt to generate an identity-by-descent (IBD) distribution between the individuals in the study sample, although the particular solutions for carrying this out then differ significantly from each other. There are also papers focusing only on the estimation of the IBD-distribution based on known pedigree information [1, 15–17], known haplotypes [13] or known population history [18]. For population based data, there are some approaches that approximate and model the genealogical history of a sample of chromosomes from a population by using ideas from the coalescent theory [19] and its extension incorporating recombinations, expressed in terms of an ancestral recombination graph [20, 21]. Applications of these ideas include haplotyping [22] gene mapping [23–25], estimating population parameters [26], and recombination rates [27]. However, in spite of the progress (see also [28, 29]), the development of effective MCMC sampling methods for ancestral recombination graphs in general has been relatively slow.
This paper extends the genealogy estimation method of Gasbarra et al. [10] to the case of linked markers. Our starting point is a sample of individuals from a natural population, each being genotyped at certain marker loci, but without any direct information on their pedigree or interrelations. Our target is to provide an explicit reconstruction, in terms of a probability model, of the recent history of the genealogy connecting the sampled individuals, conditionally on the observed genotype data and available information on the demography of the population. The model is specified by the following parameters: (1) time in generations since the founding of the population, (2) the marker allele frequencies in the founder population, (3) two mating parameters α to the case of linked markers. Our starting point is a sample of individuals from a natural population, each being genotyped at certain marker loci, but without any direct information on their pedigree or interrelations. Our target is to provide an explicit reconstruction, in terms of a probability model, of the recent history of the genealogy connecting the sampled individuals, conditionally on the observed genotype data and available information on the demography of the population. The model is specified by the following parameters: (1) time in generations since the founding of the population, (2) the marker allele frequencies in the founder population, (3) two mating parameters a and β controlling the mating behaviour [30], (4) the number of males and females in each generation, and (5) the genetic distances on the marker map. Combined with an algorithm for drawing Monte Carlo samples from the conditional distribution of genealogies, this modelling framework can be applied to address, within limits of computation, all the "intrinsic questions" mentioned above.
Methods
Prior distribution on the configuration space
The configuration space of possible ancestral histories (Ω) has three components: the ancestral graph (or pedigree) specifying the relationships between individuals, the paths of alleles of these individuals at the marker loci, and the types of the founder alleles introduced into the ancestral graph via the founder individuals. The probability model on the configuration space is similar to the one described by Gasbarra et al. [10] except that in this study the linkage between marker loci is allowed. Due to the similarities we give here only a brief summary of the model.
Ancestral graph
For pedigrees we use the probability model introduced by Gasbarra et al. [30]. The model considers an isolated population with non-overlapping generations indexed backwards in time by t = 0, 1, ..., T with t = 0 referring to the present and t = T to the founder generation. The population is characterized by four sets of parameters: ${{N}^{\prime}}_{t}$, ${{N}^{\u2033}}_{t}$, α_{ t }and β_{ t }, for t = 1, ..., T. The parameters ${{N}^{\prime}}_{t}$ and ${{N}^{\u2033}}_{t}$ describe respectively the number of males and females belonging to generation t of the population. Parameter α_{ t }controls the differences of reproductive success between males in generation t: large values of α_{ t }imply nearly equal numbers of children for each male whereas with small values of α_{ t }there will be a few dominant males who are mainly responsible for the reproduction. Parameter β_{ t }tunes the degree of monogamy (of males) in generation t: large values of β_{ t }lead to random mating and small values of β_{ t }introduce more permanent family structures into the pedigree. Naturally the roles of males and females can be changed in the model. We denote this probability measure on pedigree graphs by ${P}_{\mathcal{G}}$(·).
Flow of alleles through the ancestral graph
We assume a fixed marker map with L loci, and denote the recombination fractions between loci by ρ = (ρ(l, l') : 1 ≤ l <l' ≤ L). Note that several chromosomes can be modelled simultaneously using the recombination fraction p(l, l') = $\frac{1}{2}$ to indicate that markers l and l' lie in different linkage groups.
By definition, the genome of each individual in the ancestral graph consists of a pair of paternal and maternal haplotypes. The flow of alleles through the pedigree is determined by the grandparental origins which for haplotype i are denoted by ψ_{ i }= (ψ_{ i }(1), ..., ψ_{ i }(L)) ∈ {0, 1}^{ L }. The convention used here is that ψ_{ i }(l) = 0 if the allele at locus l of haplotype i is of grandmaternal origin, and ψ_{ i }(l) = 1 in the case of grandpaternal origin.
If an allele carried by an individual in generation t > 0 is transmitted to some individual in the present generation, we say that the allele is ancestral, and otherwise that it is censored. Since we are actually interested only in the paths of the ancestral alleles we set ψ_{ i }(l) = ∅ if the allele at locus l of haplotype i is censored.
where Λ_{ i }= {l : ψ_{ i }(l) ≠ ∅}, j(ψ_{ i }, l) denotes the last uncensored locus of haplotype i before l, with the convention that j(ψ_{ i }, l) = -∞, if l is the first uncensored locus of the haplotype i, ρ(-∞, l) = $\frac{1}{2}$ and Δ_{ i }(l) = |ψ_{ i }(l) - ψ_{ i }(j(ψ_{ i }, l))| with the convention that ψ_{ i }(-∞) = 0.
Types of founder alleles
where the population genotype frequencies fr(·; l) at each marker locus l are assumed given. (If Hardy-Weinberg equilibrium is assumed, we can use the population allele frequencies instead.) The genotype frequencies are extended to partially or totally censored genotypes in the obvious way. Note that the ordered founder alleles together with the grandparental origins of the nonfounder haplotypes determine the flow of alleles in the pedigree.
Prior distribution
Data and posterior distribution
where $\mathcal{C}$ ⊆ Ω is the set of configurations that are compatible with the observed genotype data.
As it seems impossible to sample independent realizations from the posterior (see Appendix A.1) we shall use a Markov chain Monte Carlo method to perform the computations.
Markov chain Monte Carlo algorithm
The general idea of MCMC methods and the details of our algorithm are given in the appendices, whence here we only sketch the main ideas of devising an efficient proposal distribution for a Metropolis-Hastings algorithm.
In a typical proposal move of our algorithm, a group of children in the ancestral graph try to change one or both of their parents to other possible parents of the population. This is done either by selecting the children uniformly at random from the ancestral graph, or by considering all children of a randomly chosen parent. It is necessary that the proposed new paths for the ancestral alleles carried by the children are compatible with the genes carried by the new ancestors. In order to obtain such a compatible configuration with a reasonable probability, our proposal distribution given in (8) for choosing the new parents takes into account sequentially the children's genotypes and the transmission probabilities of the alleles of prospective candidate parents at all marker loci. After choosing the new parents, we use the transmission probabilities of the new ancestors to resample the paths of the alleles carried by the children (10). Finally the new configuration is accepted or rejected according to the Metropolis-Hastings rule.
In the simplified setting of our earlier work [10] the transmission probabilities were calculated by assuming free recombinations between marker loci. When the markers are linked, the situation becomes much more difficult. Although the proposal based on independent transmission probabilities produces configurations compatible with the genetic data with high probability, a drawback is that typically these configurations contain unrealistically many recombinations. In the case of several tightly linked markers this leads to low recombination likelihood scores, and consequently low acceptance rates and poor mixing of the Markov chain. In order to avoid that, we have to include at least partially the recombination likelihood into the proposal distribution for the allelic paths. This is computationally demanding but possible through the Viterbi-Baum algorithm for hidden Markov models. In Appendix B we describe how to sample the allelic phases of an individual, jointly at all marker loci, by taking into account his/her genotypes, his/her parents' transmission probabilities, and the likelihood of the recombination pattern on the haplotypes of his/her children. This step is used sequentially to generate the new allelic paths.
Using the algorithm of Kruglyak and Lander [31], we also construct a joint sampling distribution for the allelic phases of a group of siblings and of their parents at all marker loci, combining the parental transmission probabilities with the recombination likelihood on the haplotypes of the children (see Appendix A.7.2).
The computational complexity of these sampling steps grows linearly in the number of markers, suggesting that it is not an unrealistic task to handle hundreds of linked markers.
Results
The performance of the method was tested on two simulated data sets. The first data set was designed specifically to give information on the method's performance in the problems of haplotyping and relatedness estimation. For the former, comparisons were done with corresponding results obtained with PHASE [32] and for the latter with three existing moment estimators [33–36]. The second data set was generated using concepts from gene mapping. The purpose was to evaluate the potential advantages of using the IBD-information produced by our method over simple IBS-sharing statistics.
Haplotyping and relatedness estimation
Simulated data
The gene flow on the pedigree was simulated at 20 linked marker loci. All markers were polymorphic with 10 equally frequent alleles at the population level and the neighbouring loci were separated by the recombination fraction of 0.05. The simulation of genetic data was accomplished by sampling the founder alleles from the population allele frequencies and dropping them down through the pedigree in accordance with the recombination model.
Reconstruction
Our task was to provide possible reconstructions of the simulated pedigree and corresponding gene flow using only the unordered genotype data on the youngest generation but having no information on the underlying pedigree structure. The marker map, the population allele frequencies and the size of the population were assumed known in the reconstruction i.e. they were given the same values as in the data simulation. The mating parameters α and β can be estimated by a maximum-likelihood estimator [30] from observed family structures if the size of the base population is known. We used this approach to estimate β on the basis of the family structures between generations 1 and 2 in the simulated pedigree (here 0 refers to the youngest generation), resulting in the value β = 4 × 10^{-4}. For α we used generation dependent values α_{ t }= β${{N}^{\u2033}}_{t}$, where ${{N}^{\u2033}}_{t}$ was the number of females belonging to generation t = 1, ..., 9. We ran five independent sample chains, each with a different seed for the random number generator. One of the chains was extended to 1, 000, 000 iterations whereas the other four were stopped after 500, 000 iterations. The longer run took about 10 days on a Pentium-4 2.8 GHz processor. The results for haplotyping and IBD-analyses were saved from every tenth iteration. The monitored statistics behaved very similarly across the different runs suggesting that with these data the method performs consistently regardless of the initial state. To compare the performance of the present model with our earlier model that assumed unlinked markers [10] the corresponding runs were also conducted with the simpler model.
Haplotyping
If there are h ≥ 1 heterozygous loci in the multilocus genotype, then there are 2^{h-1}different ways to do the haplotype assignment. Note that here we do not distinguish between the parental (paternal/maternal) origins of the haplotypes, which would further increase the number of different assignments to 2^{ h }. There is only a single correct haplotype configuration, and to measure how much our estimates deviate from it we use the concept of switch distance [2]. We say that two adjacent heterozygous loci are correctly (incorrectly) phased if the corresponding two locus haplotypes are correct (incorrect). The switch distance of the haplotype assignment is defined as the number of incorrectly phased adjacent heterozygous loci. The maximum switch distance of a haplotype configuration is one less than the number of heterozygous loci, and it is zero only for the correct configuration. For example, if the true haplotypes are (111111, 222222) then the switch distance of pair (112222, 221111) is 1 and that of pair (121212, 212121) is 5.
There are two kinds of haplotyping software currently available, namely pedigree-based and population-based methods. In actual fact, our data contain some close relatives, but since no relationship information is assumed to be available we are working with a general population sample. For a comparison we analysed the data with the software PHASE (v.2.1) [32]. Usually PHASE and other population-based haplotyping methods are applied to much denser marker maps where recombinations (per meiosis) are rare and the haplotypes are very likely to be inherited as single units through tens of generations. This is not the case here, but since the current version of PHASE is widely used and takes into account also recombinations, we chose it as the reference method among the existing algorithms.
We considered three different chain lengths for PHASE (100 (default), 500, and 1000 iterations) and then ran ten independent test runs for each with different choices of the seed of the random number generator. The burn-in part and the level of thinning were kept at their default values of 100 and 1 iterations, respectively.
PHASE can be requested to give the posterior probability distribution of all possible haplotypes for each individual. We measured the accuracy of these estimates by calculating the expected switch distance for each individual with respect to his/her posterior distribution of the haplotype configurations. The sums of the switch distances over all 39 individuals, averaged over ten independent runs, were 80.1, 76.6 and 76.7 for the chains of length 100, 500 and 1000 iterations, respectively. Since increasing the number of iterations from 500 to 1000 did not seem to enhance the results, we did not try to run PHASE longer. The best value PHASE gave (76.6) is shown with a dotted line in Figure 2. It seems that, at least on these data, PHASE and our algorithm have quite comparable accuracy in estimating the haplotype configurations, even though the underlying models are very different.
A clear advantage of PHASE over our method is in its speed: a single run takes only a couple of minutes whereas our algorithm was run for several days. One of the reasons is that we are modelling explicitly the whole genealogy, not just the haplotypes, and are thus able to address some other questions with the same effort. In the future we could also try using PHASE to give an initial haplotype configuration for our method.
Relatedness estimation
Here we consider relatedness estimation in a similar way as Gasbarra et al. [10]. Two alleles are said to be identical-by-descent (IBD) if they descend from the same ancestral allele within the pedigree. Note that two alleles may be identical-by-state (IBS), i.e. represent the same allelic form, without being IBD if two or more founder alleles happen to be of the same type. The concept of IBD thus indicates whether two contemporary alleles descended from a common ancestor that had existed since the founder generation, but it does not estimate their possible coalescent times more accurately. However, if needed, we could also capture the exact time (in generations) of these coalescing events.
In order to quantify the relatedness we denote by r_{ ij }(l) the probability that a randomly chosen allele from locus l of individual i has an IBD-copy in individual j. For individuals i and j we define the locus-specific relatedness coefficients R_{ ij }(l) = $\frac{1}{2}$(r_{ ij }(l) + r_{ ji }(l)) and the genome-level relatedness coefficient ${R}_{ij}=\frac{1}{L}{\displaystyle {\sum}_{l=1}^{L}{R}_{ij}(l)}$. Note that in the presence of inbreeding we can have r_{ ij }(l) ≠ r_{ ji }(l). However, always R_{ ij }(l) = R_{ ji }(l) and R_{ ij }= R_{ ji }.
As our input data contain no pedigree information, it seems that other methods available for IBD-estimation from such data are based on different formulas that combine the IBS-status of the markers and the known population allele frequencies to an estimate of the IBD-probability (usually R_{ ij }). We have compared the estimates given by our algorithm with three such moment estimators described by Lynch [33] and Li et al. [34] (LL), Lynch and Ritland [35] (LR) and Wang [36] (W).
These three methods assume unlinked loci and then combine the locus-specific results according to some weighting schemes in order to obtain estimates of the genome-level relatedness coefficients. The derivations of both LR and W are also based on the assumption of no inbreeding. Since our data violate these assumptions some additional error may be caused to the moment estimators. Moreover, it is questionable whether these moment estimators actually answer the exact question of IBD-sharing when restricted to the latest ten generations (see also [37]) as their estimates are relative to the base population defined by the allele frequencies and no exact reference point of IBD-sharing can be specified (like the founder generation in our example). On the other hand, polymorphic data sets like the one used here are advantageous for the moment estimators since in these cases IBS-sharing gives already a fairly good approximation of the actual IBD-sharing.
Gene mapping
In this example we applied concepts from gene mapping to further monitor the accuracy of our reconstruction algorithm. We simulated a monogenic trait with a dominant mode of inheritance, and then investigated whether we can trace the position of a trait locus, relative to a set of marker loci, by considering suitable allele or haplotype sharing statistics. The purpose of the example is to compare the estimated IBD and haplotyping results to the simulated "true" ones and to the plain IBS data, whereas a proper extension of the model to gene mapping will be considered elsewhere (see also Discussion).
Simulated data
We considered a population that has grown exponentially by a factor of 1.2 during the 9 most recent generations. The founder level (assumed to be in Hardy-Weinberg and linkage equilibrium) was taken to be the 19th generation (backwards in time). The population was postulated to have maintained a constant size of 200 individuals between the 9th and the 19th generations. The method of Gasbarra et al. [30] was used to simulate a 20-generation pedigree from this population, with 400 individuals at the current (0th) generation. The parameter α was set to 10.0 in order to decrease the relatedness of the individuals belonging to the current generation. The monogamy parameter β was set to 0.001. As a result the pedigree contained 4815 individuals of whom 120 were founders.
A gene flow was simulated on the marker map containing 14 microsatellite markers separated by the recombination fraction of 0.10, and 26 SNP markers located in such a way that between any two adjacent microsatellites there were two SNPs evenly spaced with respect to genetic distance. The allele frequencies were sampled from the Dirichlet distribution with all parameters equal to 1. For microsatellites there were 10 different alleles and the SNPs were biallelic at the founder level.
Having simulated the pedigree and the gene flow we fixed an additional trait locus half way between (SNP) markers 20 and 21 and simulated the segregation of the founder alleles at that locus in accordance with the inheritance patterns of the flanking markers and the Haldane recombination model. We chose one particular founder allele at the trait locus and collected all of its 44 carriers from the current generation to form our sample.
Reconstruction
The question was whether we could spot the trait locus by comparing the values of an IBD-sharing statistic of the sampled individuals in different marker loci. The idea is that the carriers should share more alleles IBD near the trait locus than elsewhere on the chromosome. For computational reasons, and also in order to violate the assumption of the founder generation being in exact Hardy-Weinberg and linkage equilibria, the time horizon was in the reconstruction set to 9 generations instead of 19 that was used in the simulation. Considering the history only 9 generations backwards is likely to produce challenges for the reconstruction, since in the simulated data there were altogether six different copies of the trait allele at the 9th generation, all of whom had descendants among our sample. On the other hand, when extending the analysis for tens of generations backwards in time the exact number of generations being considered is likely to become less important, and also it is less likely that we could find out the exact generation in which each coalescing event actually occurred.
The population allelic frequencies and the population size were considered known. For β the previously estimated value 4 × 10^{-4} was used and again α_{ t }= β${{N}^{\u2033}}_{t}$ where ${{N}^{\u2033}}_{t}$ was the number of females belonging to generation t, t = 1, ..., 9 (the current generation has index 0). The algorithm was run for 100, 000 iterations, which took about two days on a Pentium-4 2.8 GHz processor. The results were saved from every tenth iteration and averaged over two independent runs.
Results
It seems that generally the number of different founder alleles in the reconstruction is more similar to the original situation in the 19th generation than to that in the 9th generation, even though the reconstruction was actually done for 9 generations. This suggests that, as we reconstruct more generations backwards in time, the assumption of the founder generation being in Hardy-Weinberg and linkage equilibria may become more important than the actual number of generations considered. In other words, the algorithm may try to squeeze the original pedigree and gene flow to the given time horizon. On the other hand, it is unlikely that the postulated genetic equilibria would hold to a very close approximation among a set of founders of a population isolate.
where n = |$\mathcal{S}$| (see also the statistics in [38–40]).
The questions that remain can now be stated as follows: (i) How much of the complete information do the mere genotype data on the youngest generation contain, and (ii) how strong a signal one can expect to get from this kind of data with any computational method? We use a long chromosomal segment (1.45 Morgans) with a quite sparse marker map (over 3 cM between adjacent loci), as we are interested in the recombination process. Difficulties with these data may arise because recombinations are frequent and mix up shorter segments in various ways. In addition, the sampled individuals are more distantly related to each other in this example than in the previous one, as can be seen from the average values of R_{ ij }with respect to the 9th generation, (0.05 and 0.19, respectively, in this and in the previous example).
Haplotyping is easier for our method when there are siblings or other close relatives in the sample, since their data tend to cluster to the same families also in the reconstruction and thus increase locally the amount of information on those parts of the pedigree. This can be seen, for example, in Figure 4 where the estimates of IBD-sharing are very close to the true values for siblings (the leftmost pictures) and somewhat less accurate for more distant relatives (the rightmost pictures). Some further enhancements in the estimates might also be achieved if we knew some parts of the pedigree and were able to utilize this information in our algorithm. Such possibilities will be considered in our future work.
On the other hand, we cannot completely rule out the possibility that the relatively weak signals in haplotype sharing would be influenced by the slow mixing of the MCMC sampler.
Discussion
The main motivation leading to this study was to provide a common methodological basis for considering a number of inter-related fundamental questions in statistical genetics, by extending our earlier work [10] to linked marker data. Our approach can be seen to complement the methods that are used in the reconstruction of coalescents. In particular, both approaches start from DNA samples taken from present day individuals and then make an attempt to trace back their common genetic origins. In the original coalescent analysis the main focus is on attempts to reconstruct the underlying evolutionary history, driven by mutations [19]. To do this, one proceeds backwards along germ lines looking for Most Recent Common Ancestors (MRCA's), every time coalescing two lines when a common ancestor for them is postulated, and finally ending with the root of the resulting tree structure. Later the coalescent theory has been extended in several directions [41], most notably to ancestral recombination graphs (ARG's) [20, 21] that include the recombination process. Common to the coalescent theory and its extensions is that the relevant time scale for them is usually of the order of thousands of generations or even much longer. On balance, generally one then considers only relatively short aligned sequences of DNA at a time. The present method is analogous to the search for MRCA's in that it, too, can be seen as a search for chromosomal areas shared by some individuals in the sample, and is carried out by sampling explicit hypothetical reconstructions of the past. Here, however, the genealogies are assumed to be driven by mating and meioses and the effect of mutations is ignored. Also the time scale in which these reconstructions are carried out is very much shorter, of the order of tens of generations. But then they are carried out by jointly considering marker loci that cover much wider chromosomal areas such as whole chromosomes, or even the entire genome. Note also that our model builds on an explicit consideration of diploid chromosomes, which is a source of substantial technical complications in the computations. Thus, in trying to resolve questions concerning shared ancestral origins of the marker alleles, we also allow for the possibility of inbreeding, that is, loops in the ancestral graph. The original coalescent process considers the genealogy of a random sample of genes from a very large and randomly mating population whence the sample size has a large effect on the coalescent time. In contrast, if we consider closely related or ascertained (for some phenotype) individuals in an effectively smaller population, then even a small sample is likely to find common ancestry within our framework. Thus the main factor in determining the rate of coalescences in our method is the relatedness structure that the data contain, not necessarily the number of sampled individuals in the data. Additionally, we emphasize that in our model (as in ARG's) the coalescences may involve only small parts of the haplotypes and thus the coalecences are not tied to the number of individuals in the first place. Namely, after a few generations backwards in time, the haplotypes of any sampled individual may have split into tens of parts, each having its own genealogical tree.
Reconstructing plausible pedigrees of the sampled individuals conditionally on the observed marker data always requires some knowledge about the recent history of the population. Here such information is provided in the form of postulated parameter values for controlling population growth and mating, as well as assuming that the members of the founder generation are in linkage and Hardy-Weinberg equilibrium. Unrelated founders in linkage equilibrium seems to be a common assumption in pedigree analyses. This assumption is likely to be unrealistic especially when it is applied to small pedigrees whose founders lie only a few generations backwards in time (as in traditional linkage analysis). On the other hand, extending the founder level to some tens of generations farther back in time, as is done here, allows more realistic modelling of the relatedness structure within a few of the most recent generations (i.e. within those generations that contain the sampled individuals).
When assessing the usefulness of our approach in practical applications based on real data it will be important to assess the sensitivity of the results to variations in the tuning parameters. In the gene mapping example one may also enquire how strongly the results would depend on the number of generations that are considered in the reconstruction. As noted in [10] the relation between the concept of generation in the model and its counterpart in the real population is not straightforward and varies as a function of marker data and parameter values.
Our numerical examples illustrated how one can usefully summarize the relevant posterior information contained in an MCMC sample of ancestral graphs by considering certain statistics of interest, such as those describing the relatedness between a pair of individuals at different marker loci. One can then think of the sampled pedigrees as being merely vehicles that alleles need to find their way through the pedigree from the founders to the study sample, or as nuisance parameters that will ultimately be integrated out from the results. In view of the enormous size of the sample space of ancestral graphs, it is the relative robustness of these summary statistics to the exact pedigree and gene flow information which makes our approach based on MCMC sampling at all feasible.
Our approach becomes soon computationally infeasible as the number of generations in the reconstruction increases since the task of finding suitable proposals requires computations whose complexity grows rapidly with an increasing depth of the pedigree. In our numerical examples we used a single desk computer. However, more computational power is needed to handle larger data sets and/or denser marker panels (e.g. SNP data) in a reasonable time. We have sketched a tempering version of the algorithm in which several chains run in parallel (see also [42]). The idea there is to improve the mixing of the sampler by giving each chain its own "temperature" that then controls the weight of each recombination likelihood in the acceptance ratios of the Metropolis-Hastings updates. The higher the temperature, the more easily the proposals are accepted, but the results are monitored only for the chain at the lowest temperature where the recombination likelihood is not relaxed at all. After a certain prespecified number of iterations the chains at neighbouring temperatures compare their configurations in terms of their respective likelihood values and then may swap the temperatures according to a suitable rule. The idea is that if some chain finds a good configuration, this configuration can move gradually towards the lowest temperature from which the results are then collected.
to the choice ${X}_{t}^{(i)}={\widehat{X}}_{i}^{(j)}$. In this way the N chains interact and the particle system explores the target distribution's landscape more efficiently than a system of N independent Metropolis chains based on the same proposal kernel q(x → y). Although the N-fold product distribution π^{⊗N}is not the invariant distribution of the N-particle system, it has been shown that as N → ∞ and t → ∞, the empirical distribution $\left({N}^{-1}{\displaystyle \sum _{i=1}^{N}{\delta}_{{X}_{t}^{(i)}}}\right)$ converges to the target distribution π.
We conclude by some remarks concerning gene mapping. Often data used in genetic mapping studies consist of a number of nuclear or extended families, each formed by first ascertaining an affected individual (proband) and then collecting marker and/or phenotype data on close relatives of the proband. If this is the case, it is natural to make use of the known family structure and of the marker data that may be available, for example, from the siblings and the parents of the proband. Considering the gene flow within each such small known pedigree, and making use of this information in a phenotype or penetrance model, will then correspond to "ordinary" linkage or QTL-analysis. However, particularly in data sets in which all probands are collected from a genetic isolate, as is often the case, also these small pedigrees can be assumed to share, in the sense of IBD, some part of their ancestry. In such cases, there is growing interest in the current literature in modelling the relatedness between founder individuals (e.g., [12, 13, 44]). In the currently existing approaches the recent shared ancestry is modelled only at the putative QTL-position or estimated separately for each marker/QTL (based on flanking marker loci), whereas here we consider this question jointly at all marker loci. Note also that, by making use of the Haldane map function, this gives us a handle for doing the same, in terms of probabilities, on the intervals between flanking markers. In a near future our plan is to modify the present reconstruction method in a way which allows us to fix the known parts of the pedigree and the corresponding marker information to the extent in which it is known, and then apply the reconstruction algorithm for building "bridges between these islands".
It seems likely that no single method can perform equally well on the whole spectrum of different types of genetic data currently available. Indeed, interplay of several methodological approaches will be crucial during the future gene mapping studies. The main role of our method may be in the initial stage of a genome-wide mapping project when interesting regions are sought using a marker spacing that is measured in centimorgans and the pedigree records are not complete. In the study of complex disease traits our method can be applied to estimate the haplotypes and/or relatedness structure which can then be used as input parameters for subsequent QTL or association mapping (e.g. [45]). In order to provide a more systematic approach to this problem, we are currently planning to build an even larger Bayesian model which would allow us to combine these two stages of analysis. This would involve expanding the present method further by adding to the model one more layer of hierarchy corresponding to the underlying genetic architecture of the trait [45–48]. In such a large integrated method, generation of IBD and haplotype distributions as well as screening of QTL-positions could all be performed as parts of a joint analysis. At least in principle, the number of contributing loci and their positions, the size of their effects, interactions within and between genes and environmental factors, as well as the mode of a gene action, could be analysed by such a method. Of course, the computational requirements for this kind of gigantic model are even larger than for the current method, and the practical implementation will be a major computational challenge.
Conclusion
We have implemented an algorithm for analyzing recent history of linked multilocus genotype data sampled from an isolated diploid population. We model the paths of the observed alleles through tens of generations by explicitly including all ancestral individuals and corresponding meioses into the possible ancestral configurations. Thus we are extending the methods that estimate gene flows on fixed pedigrees to the case where also the pedigrees need to be estimated.
We have tested the method on the problems of haplotyping and IBD-estimation. In both cases the method performs well compared to some widely used existing methods. We have also illustrated how our estimates for IBD-sharing are more informative than a simple IBS-sharing statistic on a tentative example on gene mapping.
Our experiences reported here and in [10] encourage us to develop the method further. Indeed, the current version of the method can be seen as a general tool for estimating genealogical relationships between sample units. In more complex applications, such as gene mapping, it can serve as a basis for extended models.
Appendices
In the following two appendices we describe the MCMC algorithm in more detail.
Appendix A first discusses why we are not able to sample from the posterior without MCMC. Then we briefly summarize the theoretical background of the MCMC methods, describe a way to generate an initial configuration for the chain, and finally explain the block-updates that are used to propose new states for the chain.
Appendix B introduces an algorithm to sample the phases of the parent's genotype given partially observed haplotypes of his/her children. It is used in our block-updates described in Appendix A, but it may also turn out to be useful in other settings.
We use the notation introduced in the Methods section. In addition we index the individuals in the pedigree starting from the present generation and assign labels 2k - 1 and 2k respectively to the paternal and maternal haplotype of individual k. If the allele at locus l of haplotype i is ancestral, we denote its type by h_{ i }(l) ∈ E_{ l }, otherwise setting h_{ i }(l) = ∅, where E_{ l }is the set of alleles at locus l. Thus for each individual k and locus l, g_{ k }(l) = {h_{2k-1}(l), h_{2k}(l)}.
A Sampling from the posterior distribution
A.1 Possibility of sampling without Markov chain Monte Carlo
We want to study the posterior distribution (1) by using a Monte Carlo method, so we need an algorithm for sampling random realizations from the posterior. In a way, constructing such an algorithm corresponds to time-reversing a Markov process in a discrete state space, where ideally we could write down the generator of the reversed dynamics jointly for the ancestral graphs and the allelic paths, and the posterior distribution of the haplotypes in the present generation given the data, and then sample directly independent realizations of the backward process. However, the number of terms involved in such summations grows extremely rapidly with the sample size n(0), the time horizon T and the number of loci, whence the computation of the reverse generator is not feasible in the real-life data problems we have in mind. Next we briefly examine some alternative ideas.
First, we could proceed naively, by sampling repeatedly from the prior until we obtain a realization ω ∈ $\mathcal{C}$. This is problematic when π($\mathcal{C}$) is very small, as is the case here, and on average it takes far too many attempts to obtain even one realization from $\mathcal{C}$.
Alternatively, since Ω is a finite set for any given (n(0), T, L), we could compute and draw samples directly from the posterior just by summing π(ω)1(ω ∈ $\mathcal{C}$) over ω ∈ Ω. Again, this is not feasible in practice when |Ω| is very large.
where the genotypes of the founders are determined by the triple (Y(l), ϕ(l), ψ(l)).
By using the Viterbi algorithm, it is possible to sample directly the random vector ((ψ(l), ϕ(l)) : l = 1, ..., L) conditionally on the data Y. Kruglyak and Lander [31] proposed an efficient implementation of the Viterbi algorithm using Fourier transforms on the commutative group {0, 1}^{ d }(see also section A.7.2). The Viterbi algorithm could also be used to integrate out the allelic paths and to obtain the marginal likelihood P(Y(1), ..., Y(L)|G) of the data set Y given the ancestral graph G. We would then be left with the problem of sampling from a posterior distribution on the finite space $\mathcal{G}$ of ancestral graphs with n(0) roots spanning T generations backwards in time, with probabilities proportional to
P(G) × P(Y(1), ..., Y(L)|G).
There are two problems in this approach, however: Firstly, the Kruglyak-Lander algorithm can be implemented only for small values of d, say d ≃ 20 at most. Secondly, the number of possible ancestral graphs grows extremely rapidly with n(0) and T.
In summary, the direct sampling methods described above will work only for small ancestral graphs, and next we shall describe the general ideas of Markov chain Monte Carlo methods that can yield approximate results also in more complex settings.
A.2 Metropolis-Hastings algorithm
with probability one. This is a useful result when we would like to approximate numerically the expectation on the right hand side, but there is no practical algorithm producing i.i.d. realizations from μ(z). The choice of the proposal distribution Q(z → $\overline{z}$) determines the mixing properties of the Metropolis chain (Z_{ n }). If mixing is too slow, the Metropolis chain is useless for Monte Carlo computations.
where the mixing distribution is allowed to depend on the current state z, and the corresponding transition kernel is computed by using (2) and (3).
A.2.1 Gibbs' updates
and we obtain another Gibbs' update by inverting the roles of z' and z". The corresponding acceptance probability satisfies a((z', z") → (${\overline{z}}^{\prime},{\overline{z}}^{\u2033}$)) ≡ 1. However, it is not always the case that we can use the Gibbs update for a given decomposition of the state space, since this requires direct sampling from the conditional distribution μ(z"|Z' = z').
A.3 MCMC with auxiliary variables
Here we explain a procedure which is used frequently in the MCMC-literature (see Appendix 2 in [50]). Suppose we have constructed a Markov chain (Z_{ n }) on the (finite) state space $\mathcal{Z}$ with a given equilibrium distribution π(z). Consider an enlarged state space $\tilde{\mathcal{Z}}=(\mathcal{Z}\times \mathcal{Y})$, together with a stochastic kernel p(y|z) : $\mathcal{Y}\times \mathcal{Z}$ → [0, 1]. Define the probability measure $\tilde{\pi}$ (z, y) = π(z)p(y|z). The idea is to explore the marginal distribution $\pi (z)={\displaystyle {\sum}_{y\in \mathcal{Y}}\tilde{\pi}(z,y)}$ by constructing an ergodic Markov chain ($\tilde{Z}$_{ n }) on the enlarged state space $\tilde{\mathcal{Z}}$ with equilibrium distribution $\tilde{\pi}$(z, y). In what follows, we assume that for every z ∈ $\mathcal{Z}$ we have an algorithm to sample a random realization of Y from p(y|z) and a numerical procedure to compute this conditional distribution.
which will automatically be reversible w.r.t $\tilde{\pi}$(z, y). In particular, the transition kernel $K((z,y)\to (\overline{z},\overline{y}))=1(\overline{z}=z)p(\overline{y}|\overline{z})$ is reversible w.r.t $\tilde{\pi}$(z, y).
- (i)
given the previous state z_{ n }, sample y_{ n }~ p(y_{ n }|z_{ n });
- (ii)
sample (z', y') ~ $\tilde{Q}$((z_{ n }, y_{ n }) → (z', y'));
- (iii)
with probability a((z_{ n }, y_{ n }) → (z', y')) take Z_{n+1}= z', otherwise Z_{n+1}= z_{ n }. Forget y_{ n }and y'.
However, the computation of this transition distribution requires an extra summation step and there are situations in which we cannot afford using this direct proposal distribution in the Metropolis step.
A.4 Constructing an initial configuration for the Markov chain
Now we return to our application. By using the Metropolis-Hastings algorithm, we shall construct a Markov chain on the configuration space Ω with invariant distribution equal to the constrained distribution (1).
Before entering that topic, however, we need to construct a configuration ω_{0} ∈ $\mathcal{C}$ serving as an initial state for the Metropolis algorithm. In other words, we need to find an ancestral graph and corresponding gene flow variables that are logically consistent with the data. Since this is a nontrivial problem, we describe the procedure in detail. Later we use a similar construction to obtain a proposal distribution for the Metropolis algorithm.
We start from generation 0 with n(0) sampled individuals and their (partially) observed genotypes (g_{ k }(l) : k = 1, ..., n(0), l = 1, ..., L). Sequentially, following a uniformly distributed permutation, these individuals will choose their parents from generation 1 and transmit their genes to the chosen parents. Denote by {X_{ k }= (f, m)} the event that child k has chosen f as the father and m as the mother from amongst ${{N}^{\prime}}_{1}$ possible fathers and ${{N}^{\u2033}}_{1}$ possible mothers in the population. To start the construction, suppose that the first child chooses the first father f and the first mother m from the population, and transmits his/her alleles to these parents. For a generic locus l, let g_{1}(l) = {a, b} be the genotype of the first child. With probability $\frac{1}{2}$ we set h_{1}(l) = a, h_{2}(l) = b, g_{ f }(l) = {a, ∅} and g_{ m }(l) = {b, ∅}, and otherwise we set h_{1}(l) = b, h_{2}(l) = a, g_{ f }(l) = {b, ∅} and g_{ m }(l) = {a, ∅}. Note that this determines only the haplotypes of the first child, and partially the genotypes of his/her parents, but does not give any information about the pattern of meioises that led to the haplotype of the child.
Here the term P(X_{ k }= (f, m)|X_{1}, ..., X_{k-1}) is the contribution from the prior distribution of the ancestral graph, g_{ k }(l) is the observed genotype (at locus l) of child k, and g_{ f }(l) and g_{ m }(l) are the current values of the partially determined genotypes of the parents. Finally, the transmission likelihood P(g_{ k }(l)|g_{ f }(l), g_{ m }(l)) is defined as follows under the assumption of free recombination.
and then set, according to this probability, h_{2k-1}(l) = x and h_{2k}(l) = y, and otherwise h_{2k-1}(l) = y and h_{2k}(l) = x.
and otherwise leave g_{ f }(l) = {a, ∅}. After having completed this step for all the individuals in generation 0, we have determined their haplotypes and also partially the genotypes of their parents.
For the induction step, we assume that we have followed the procedure for t - 1 generations. and we now describe the procedure for generation t <T. The individuals in generation t have to choose parents from generation (t + 1) and transmit their ancestral alleles to these parents. Note that the situation in generation t > 0 differs from the situation in generation 0, since not all alleles are necessarily ancestral. Let g_{ k }(l) = {a, b} be the genotype of individual k in generation t. He or she will choose parents according to the distribution given in (8). In case a, b ∈ E_{ l }, that is, both alleles are ancestral, we proceed as in the case t = 0. Otherwise, however, we need to specify the probabilities P(g_{ k }(l)|g_{ f }(l), g_{ m }(l)) also in cases in which g_{ k }(l) is censored or partially censored.
After this, it remains to update the genotype of the chosen parent by transmitting the allele x. This is done in exactly the same way as for the alleles in generation 0.
- (i)
If h_{2k}(l) = ∅, we set ψ_{2k}(l) = ∅.
- (ii)
Let h_{2k}(l) = x ∈ E_{ l }, and let the alleles of the mother, respectively of grandpaternal and grandmaternal origin, be h_{2m-1}(l) = a and h_{2m}(l) = b, a, b ∈ E_{ l }∪ {∅}. Note that (a, b) ≠ (∅, ∅), since either (a = x) or (b = x). If a ≠ b, the grandparental origin of the ancestral allele x is determined as ψ_{2k}(l) = 1(x = a). The grandparental origin of x remains undetermined only when a = b = x ∈ E_{ l }. In such a case we have to sample simultaneously the grandparental origins of all ancestral alleles at locus l in generation t - 1 that are inherited from the mother m. In order to do so, suppose that k_{1}, ..., k_{ n }are the children of mother m and h_{2m-1}(l) = h_{2m}(l) = ${h}_{2{k}_{1}}(l)=\mathrm{...}={h}_{2{k}_{n}}(l)$ = x ∈ E_{ l }. Note that since the mother has two ancestral alleles, necessarily n ≥ 2, and we may exclude the event $I=\{{\psi}_{2{k}_{1}}(l)={\psi}_{2{k}_{2}}(l)=\mathrm{...}={\psi}_{2{k}_{n}}(l)\}$.
Conditioning on the complementary event I^{ c }is equivalent to the following procedure: We sample without replacement two children, say $\widehat{k}$ and $\overline{k}$, among {k_{1}, ..., k_{ n }}, and assign ${\psi}_{2\widehat{k}}$(l) = 0 and ${\psi}_{2\overline{k}}$(l) = 0. Given this the grandparental origins of the remaining children are conditionally independent Bernoulli ($\frac{1}{2}$) random variables.
We iterate the above procedure backwards in time until we reach the founder generation, where we determine, by tossing fair coins, the parental origins of the founders' ancestral genotypes. Given that, we compute the meiosis pattern on the haplotypes in generation (T - 1) as described above.
Remarks
- (1)
Although at a first reading these sampling formulae may not be completely obvious, the idea behind the sequential scheme is simple: a child chooses his/her parents from the population and transmits his/her alleles to the parents according to Bayes' formula, by conditioning on his/her own alleles as well as on the parental choices and allele transmissions of the previous children. Note, however, that the procedure is not fully Bayesian, since the conditioning does not include at every intermediate stage the information about the genotypes of the children still in the list, and that we are not taking into account the true recombination likelihood. Therefore the resulting configuration is only compatible with the data, but not an exact sample from the posterior distribution.
- (2)
An alternative way to proceed would be to first let all children from the considered generation choose their parents, and then sample jointly the parental phases of the alleles of the children given the family structure, and finally transmit the alleles to the parents. We develop this idea later in section A.7, as we construct a block-update for the Metropolis-Hastings algorithm.
- (3)
If the population of candidate parents is small, it is possible that, when a child k is choosing his/her parents, the genotypes of the possible parents are already partially determined by the alleles of the previous (k - 1) children in such a way that every possible parental choice is logically incompatible with the genotype of child k. When such a contradiction is found, we have to restart from the beginning, or at least from one generation back. If the algorithm keeps failing, we may need to increase the size of the population in the prior.
A.4.1 Incorporating the true recombination likelihood
In the construction of an initial configuration we have used the model with free recombination. In the case of closely linked markers, the resulting initial configuration will be compatible with the data, but it will not look realistic, since most likely it will contain too many recombinations.
It is shown in Appendix B, how to apply the Viterbi algorithm to sample the parental phase vector ϕ_{ k }= (ϕ_{ k }(1), ..., ϕ_{ k }(L)) of an individual k, say for a female, in generation t > 0, from a joint conditional distribution, where we condition on her partially observed genotypes (g_{ k }(1), ..., g_{ k }(L)), on the partially observed genotypes (g_{ f }(1), ..., g_{ f }(L)) and (g_{ m }(1), ..., g_{ m }(L)) of her parents f and m in generation (t + 1), and on the partially observed haplotypes {(${h}_{2k(j)}(1),\mathrm{...},{h}_{2k(j)}(L))$ : j = 1, ..., n} which she has transmitted to her children k(1), ..., k(n) in generation (t - 1). We can then generally improve on the initial configuration by substituting the sampling distribution (10) or (11) by this joint conditional distribution that takes into account the true recombination likelihood of the haplotypes of the children. It is also shown in Appendix B, how to integrate out the phase vector ϕ_{ k }, in order to compute the marginal likelihood of the partially observed haplotypes of the children. We could compute this marginal likelihood for all logically compatible choices of pairs of grandparents and include it as a new factor in expression (8) which is proportional to the probability of choosing a pair of grandparents. Alternatively, we could choose the grandparents using expression (8) with free recombination, and then use the true recombination likelihood only for assigning the parental origins to the genes of the parent.
A.5 Block-updates for the Metropolis-Hastings algorithm
Next we discuss the construction of a proposal distribution Q(ω → $\overline{\omega}$) for the Metropolis-Hastings algorithm on the configuration space Ω. Ideally, we would like to propose only configurations that are compatible with the data, or at least we want that Q(ω → $\mathcal{C}$) is not too small when starting from some ω ∈ $\mathcal{C}$. The resulting Markov chain should also be irreducible, that is, it should be able to reach every state in $\mathcal{C}$ with positive probability regardless of where it started from. Single-site updates, like changing the phase of one allele at a time, are not enough to produce an irreducible chain. Larger block-updates are needed also because changes to the ancestral graph usually require simultaneous changes to the allelic paths.
Our aim is to construct a block-proposal distribution by applying locally similar ideas that were used in generating an initial configuration. Starting from some configuration ω ∈ $\mathcal{C}$, a selected group of individuals in the ancestral graph try to choose new parents and redirect their ancestral alleles from the "old ancestors" to the new ones. This is done by conditioning on the paths of the ancestral alleles of the other individuals, which are kept fixed during the block-update step. Before making these points more precise, we introduce some more notation.
A.5.1 Conditional probabilities for the types of non-ancestral alleles
Otherwise individual k has at least one non-ancestral allele at this locus, and in order to compute fr(a|k, ω), we must take into account all possible ways in which he/she may have inherited the non-ancestral allele(s) from his/her ancestors.
We also need the joint transmission probabilities for pairs of alleles. We denote by fr(a, a'|k, k', ω) the conditional probability that individual k transmits an allele of type a to his/her hypothetical new child and simultaneously individual k' transmits an allele of type a' to his/her hypothetical new child. Here a, a' ∈ E_{ l }, and we let the indexes k, k' go through all individuals in a given generation. If at least one of the genotypes g_{ k }or g_{ k' }is fully ancestral,
fr(a, a'|k, k', ω) = fr(a|k, ω)fr(a'|k', ω).
However, this equation may not hold, if in configuration ω both k and k' carry some non-ancestral allele at locus l, and there is a common ancestor of k and k' who has had a positive probability of transmitting the same non-ancestral allele to both k and k'.
We compute these transmission probabilities recursively, starting from the founders, by setting
fr(a|k, ω) = fr(a|g_{ k }),
where g_{ k }= {${g}_{k}^{0},{g}_{k}^{1}$} ⊆ E_{ l }∪ {∅}, is the possibly censored genotype of founder k in configuration ω, and fr(a|{x, y}) was defined in (12) by using the genotype frequencies of the population. Moreover, if k and k' are distinct founders, we have
fr(a, a'|k, k', ω):= fr(a|k, ω)fr(a'|k', ω),
where the sum is taken over all possible genotypes at locus l. It remains to specify fr(a|k, ω) and fr(a, a'|k, k', ω), when k and k' are not founders and g_{ k }or ${{g}^{\prime}}_{k}$ contains non-ancestral alleles. For that let f and m be the father and the mother of k, and let x and y be his/her possibly censored alleles at locus l of paternal and maternal origin, respectively. Since we proceed recursively, we may assume that fr(a|f, ω) and fr(a|m, ω) are already computed. Then
where the formula holds also if k = k' or f = f' or m = m'.
These recursive formulae are exact, taking into account all intersections between the possible paths of the two non-ancestral alleles. In principle we could extend these formulae to allele triples, quadruples, and generally, for n-tuples. However, since we will consider only one genotype at a time, we need only the joint transmission probabilities for pairs of alleles.
A.5.2 Conditional genotype probabilities
For any pair of candidate parents f and m, belonging to generation 1 ≤ t ≤ T, we define the conditional genotype probabilities at every locus l of a hypothetical common child k by
P({a, b}|f, m, ω) := fr(a, b|f, m, ω) + 1(a ≠ b)fr(b, a|f, m, ω), a, b ∈ E_{ l }.
which corresponds to the conditional probability that an allele at locus l of a hypothetical common child is of type a. We also set, as a convention, P({∅, ∅}|f, m, ω) = 1.
A.6 Block-update I: Children choosing new parents
We consider a randomly selected group of children, indexed by k_{1}, ..., k_{ n }, all belonging to the same generation 0 ≤ t <T, who will choose new parents and redirect their ancestral alleles from their "old ancestors" to "new ancestors". Before choosing new parents, these children must withdraw their alleles from their old parents. Starting from configuration ω, we construct a modified configuration $\overline{\omega}$ in which, ascending from generation (t + 1) to generation T, we delete the paths of the ancestral alleles that go through the children k_{1}, ..., k_{ n }. Consequently all the ancestral alleles that were transmitted only by these children become censored.
Next we look at the individuals in generation (t + 1), including the part of the population which was left outside the ancestral graph, and consider all possible parental pairs (f, m). Recall that the individuals outside the ancestral graph have genotype {∅, ∅} by default, and their conditional genotype probabilities coincide with the population genotype frequencies. Following the sequential ordering, each of the children k_{1}, ..., k_{ n }will choose randomly a pair of parents from generation (t + 1) and transmit to them (and to their ancestors) his/her ancestral alleles. Locally this is like the construction of the initial configuration explained in section A.4, with the difference that here we must condition on the upper part of the ancestral graph and on the allelic paths that are determined by the modified configuration $\overline{\omega}$. Indeed, if child k has genotypes $\{{g}_{k}^{0}(l),{g}_{k}^{1}(l)\}$, then he/she chooses parental pair X_{ k }= (f, m) with a probability proportional to expression (8), where we extend the definition of P(g_{ k }(l)|g_{ f }(l), g_{ m }(l)) to non-founder parents by means of the conditional genotype probabilities P(g_{ k }(l)|f, m, $\overline{\omega}$) given in formulae (13) and (13).
A.6.1 Dropping and adding ancestors
After the children k_{1}, ..., k_{ n }have chosen new parents, it is possible that a former parent in the ancestral graph is left without children. In this case he or she will be dropped from the ancestral graph. By induction, the same may apply to the more distant ancestors in the elder generations. In order to obtain a reversible Markov chain, the children are given the possibility of choosing new parents also from the population outside the ancestral graph, whence these new parents together with their ancestors will become a part of the updated ancestral graph.
Note that it is straightforward to use the sequential construction of the prior distribution to sample the ancestry of the population outside the ancestral graph, conditionally on the ancestral graph. In principle such a resampling step must be included in the MCMC algorithm as a pre-move, before updating the ancestral graph. However, that may not be practical if the size of the population is large. In that case, instead of resampling the ancestry of all members of the population, it is enough to sample (conditionally on the current ancestral graph) the ancestry of a limited number of candidate parents outside the ancestral graph.
A.6.2 Resampling the paths of the ancestral alleles
the ancestral allele a was inherited from the grandfather f', and in case h_{2f-1}(l) was censored, we update it to h_{2f-1}(l) = a, and leave h_{2f}(l) = y. Otherwise a was inherited from the grandmother m', and if h_{2f}(l) was censored, we update it by setting h_{2f}(l) = a, and leave h_{2f-1}(l) = x.
a was inherited from m' and b was inherited from m". Here C is a normalizing constant. In each of these cases, the corresponding alleles of the parents become ancestral if they were non-ancestral before. This completes the updating procedure for the alleles of the parents.
If some censored allele became ancestral we must update the alleles of the grandparents as well, and possibly continue the procedure further backwards in time, until the alleles coalesce to some ancestral alleles or until the founder generation is reached. This is done in the same way as we updated the alleles of the parents.
We resume the updating procedure as follows. Given the choice of new parents, we sample new allelic paths for the ancestral alleles carried by the child. The path of an allele is a random walk on the ancestral graph, where in each generation the allele is assigned to either the paternal or the maternal origin, conditionally on the paths of the other ancestral alleles determined by the configuration $\overline{\omega}$. The new path of an ancestral allele is sampled sequentially until it crosses a path of an ancestral allele of the same type in the configuration $\overline{\omega}$, or until the path reaches the founder generation. Note also that, if the child transmits two ancestral alleles to the parents at some locus l, we are coupling the corresponding allelic paths in such a way that the paths are always compatible with each other and with the configuration $\overline{\omega}$.
A.6.3 Incorporating the true recombination likelihood
and the right hand side of the last expression was defined in formulas (14–17).
Similarly we improve the procedure which updates the paths of the ancestral alleles (see section A.6.2). When an ancestor receives alleles from his/her descendants, some of his/her censored alleles may become ancestral and we need to sample the phases of these alleles. The Viterbi algorithm can be used to sample all these phases jointly (keeping the phases of the ancestral alleles fixed) by combining the product of the sampling distributions across the marker loci as given in section A.6.2 with the recombination likelihood contribution of the children's haplotypes.
A.6.4 Completing the block-update
Once the procedure is completed, we have updated the modified configuration $\overline{\omega}$ by assigning new parents to children k_{1}, ..., k_{ n }and new paths to their ancestral alleles. The resulting updated configuration $\overline{\omega}$ is the proposal state in the Metropolis-Hastings algorithm. Note that when creating $\overline{\omega}$ we are also able to compute sequentially the proposal probability Q(ω → $\overline{\omega}$), and similarly, starting from $\overline{\omega}$, we can compute the proposal probability Q($\overline{\omega}$ → ω) for the reverse transition. Therefore the corresponding Metropolis-Hastings update can be implemented.
We also use slightly different versions of this block-update. In the first of these the children do not change parents but the paths of their ancestral alleles are updated simultaneously. This is done by simply skipping the sampling of parents described in section A.6. In the other version we let the children involved in the update belong to different generations.
Remarks
- (1)
If one child is selected to choose new parents under the model with free recombination this proposal distribution is a Gibbs' update (see section A.2.1). The same does not hold more generally, since, when updating the parental choice and the paths of the ancestral alleles of child k_{1}, we take into account neither the ancestral alleles of the other children k_{2}, ..., k_{ n }, nor the complete recombination likelihood.
- (2)
If more than one child are selected for an update, it is possible that a selected child k_{ i }does not find any parents compatible with his/her genotypes. This may happen when the children k_{1}, ..., k_{i-1}have already transmitted their alleles to their ancestors in such a way that it is no longer possible to extend the paths of all ancestral alleles of child k_{ i }up to the founder generation. In this case the proposed block-update is rejected.
A.7 Block-update II: Half-siblings changing one parent
We take a random father in generation (t + 1) ≤ T and consider all his children belonging to generation t, denoted by k_{1}, ..., k_{ n }. These children are going to stay with their original father but will choose new mothers and consequently the paths of their ancestral alleles will be resampled. (To be politically correct, we also use the symmetric update which switches the roles of mothers and fathers.) We could continue as in the previous sections, resampling sequentially, for one child at a time, a new mother and new paths of the ancestral alleles. However, there is a potential problem here: as explained in section A.6.2, after a new mother has been chosen, the parental phases of the alleles of child k_{1} are resampled without simultaneously considering the ancestral alleles of children k_{2}, ..., k_{ n }. When there are many children and many marker loci, it becomes unlikely that this procedure will assign several children to the same mother, and most of the time the algorithm proposes to add more mothers to the ancestral graph than would be necessary. As a consequence, if the true ancestral graph contains couples with many children, the corresponding Metropolis chain is slowly mixing, and the mixing gets even worse as the number of markers increases. To improve on that, we change the order in the resampling procedure. First the children k_{1}, ..., k_{ n }choose new mothers, and then we sample jointly the new parental phases of their alleles, by conditioning on the ancestral alleles of the new parents and taking into account the recombination likelihood contribution from the haplotypes of the children k_{1}, ..., k_{ n }. This block-update is computationally demanding, but it concerns only a small number of children at a time.
A.7.1 Sampling distribution for choosing the mothers
under the model with free recombination. Here at every locus l, ${\overline{g}}_{f}$(l) and ${\overline{g}}_{{m}_{1}}$(l) are possible values of the updated genotypes of the parents after k_{1} has transmitted to them his/her ancestral alleles, and ${\varphi}_{{k}_{1}}$(l) is the parental phase of the alleles of k_{1}. The computation of (20) is done by combining the computations from section (A.6.2).
Table 1
${\varphi}_{{k}_{1}}$(l) | ${\overline{g}}_{f}$(l) | ${\overline{g}}_{{m}_{1}}$(l) | P |
---|---|---|---|
0 | {a, ∅} | {a, b} | p _{1} |
1 | {b, ∅} | {a, ∅} | p _{2} |
1 | {b, ∅} | {a, a} | p _{3} |
Given that, we must specify how k_{ i }chooses his/her mother, and consequently, how the joint conditional distribution for the phases of the children's alleles and of the genotypes of the parents is updated.
Table 2
${\varphi}_{{k}_{1}}$(l) | ${\varphi}_{{k}_{2}}$(l) | ${\overline{g}}_{f}$(l) | ${\overline{g}}_{{m}_{1}}$(l) | P |
---|---|---|---|---|
0 | 1 | {a, c} | {a, b} | ${{p}^{\prime}}_{1}$ |
1 | 0 | {b, ∅} | {a, c} | ${{p}^{\prime}}_{2}$ |
1 | 0 | {b, b} | {a, c} | ${{p}^{\prime}}_{3}$ |
1 | 1 | {b, c} | {a, b} | ${{p}^{\prime}}_{4}$ |
Table 3
${\varphi}_{{k}_{1}}$(l) | ${\varphi}_{{k}_{2}}$(l) | ${\varphi}_{{k}_{3}}$(l) | ${\overline{g}}_{f}$(l) | ${\overline{g}}_{{m}_{1}}$(l) | ${g}_{{m}_{2}}$(l) | P |
---|---|---|---|---|---|---|
1 | 0 | 0 | {b, ∅} | {a, c} | {d, ∅} | ${{p}^{\u2033}}_{1}$ |
1 | 0 | 0 | {b, b} | {a, c} | {d, ∅} | ${{p}^{\u2033}}_{2}$ |
1 | 0 | 0 | {b, ∅} | {a, c} | {d, d} | ${{p}^{\u2033}}_{3}$ |
1 | 0 | 0 | {b, b} | {a, c} | {d, d} | ${{p}^{\u2033}}_{4}$ |
1 | 0 | 1 | {b, d} | {a, c} | {d, b} | ${{p}^{\u2033}}_{5}$ |
1 | 1 | 0 | {b, c} | {a, b} | {d, ∅} | ${{p}^{\u2033}}_{6}$ |
1 | 1 | 0 | {b, c} | {a, b} | {d, d} | ${{p}^{\u2033}}_{7}$ |
Note that now the event {${\varphi}_{{k}_{1}}$(l) = 0} has zero probability.
At the end of this sequential step children k_{1}, ..., k_{ n }have chosen new mothers in a way which is compatible with their ancestral genotypes and the information about the genotypes of the parents carried by the modified configuration $\overline{\omega}$. We have also produced a joint sampling distribution for the phases of the children and the ancestral genotypes of the parents. Therefore we could complete the block-update by sampling, independently at each marker locus, the parental phases of the children's alleles and the ancestral genotypes of the parents from the joint sampling distribution above, and by transmitting recursively the new ancestral alleles upwards to the ancestors as in section A.6.2. However, this strategy would not take into account linkage between the marker loci, which is considered next.
A.7.2 Sampling the phases of the children's and parents' alleles jointly across the marker loci
where f' and m' are the father and the mother of j.
where r_{l, l+1}(x) = ρ(l, l + 1)^{ x }(1 - ρ(l, l + 1))^{1-x}.
with normalizing constant C_{ l }= Q_{l+1}(ψ.(l + 1))/$\tilde{Q}$_{l+1}(ψ.(l + 1)).
Note that at any marker locus l different combinations of the children's phases and parents' phases may lead to the same vector ψ.(l) of children's grandparental origins. Therefore an additional step is required where, independently across the marker loci, the children's and parents' phases are sampled, conditionally on sampled grandparental origins of the children. This will determine also the ancestral genotypes transmitted to the parents. The block proposal is completed by transmitting the new ancestral alleles from the parents to their ancestors in the upper part of the ancestral graph. This is done exactly as in sections A.6.2 and A.6.3.
Remarks
- (1)
The computational complexity of this block update grows linearly with L, the number of markers, and exponentially with n, the number of children involved in the block update. In practice we restrict our sampling algorithm to values n ≤ 7.
- (2)
In this block update we are sampling a full meiosis vector (ψ_{ j }(l) : j ∈ J, l = 1, ..., L) which contains also the recombination pattern in the non-ancestral part of the haplotype. We have to proceed in this way, since it is not straightforward to sample the meiosis pattern only at the loci carrying ancestral alleles (the first-order Markov property across loci is lost). This means that we are temporarily extending the state space of the Markov chain algorithm with auxiliary variables. A theoretical justification is given in A.3. Note that, in order to compute the acceptance probability of this block update, we need to sample the meiosis pattern on the non-ancestral part of the haplotypes of the children as specified in the old configuration ω. This can be done directly by using the Kruglyak and Lander algorithm [31]. Once we have sampled the block update and computed the Hastings ratio, we erase the recombination pattern in the non-ancestral part of the haplotype and keep only the information about the paths of the ancestral alleles.
- (3)
In the construction of this block proposal we have included the recombination likelihood from the haplotypes of the children but not the recombination likelihood from the haplotypes of the grandchildren and of the ancestors. Also these recombination likelihood terms contribute to the acceptance probability of the proposal.
A.8 Block-update III: Sex-switching
We introduce one more update step into the algorithm. Consider the bipartite graph formed by the individuals in generation t > 0, where two nodes are connected by an edge if and only if the corresponding individuals have at least one common child. This bipartite graph is decomposable into connected components. We select a random connected component and obtain the proposal configuration $\overline{\omega}$ by switching the sexes of all the individuals in the selected component. Note that the prior distribution of the ancestral graph is not invariant under sex-switching (except for particular choices of the parameters α and β), but the distribution of the geneflow on the ancestral graph is, since our model for recombination does not depend on sex. Therefore, only the prior of the ancestral graph contributes to the Hastings' ratio which is given by min(1, P(G($\overline{\omega}$))/P(G(ω)). This simple update is important for the mixing of the sampler, since a fixed sex assignement would unnecessarily restrict the mating possibilities.
B Sampling the parental phases conditionally on the partially observed haplotypes of the children: a Viterbi algorithm
Consider a parent whose (fully observed) genotype at locus l is $\tilde{g}(l)=\{{\tilde{g}}^{0}(l),{\tilde{g}}^{1}(l)\}\subseteq {E}_{l}$, where the two alleles are arbitrarily ordered. Let ϕ = (ϕ(1), ..., ϕ(L)) ∈ {0, 1}^{ L }be the phase vector of the parent, i.e. if allele ${\tilde{g}}^{0}(l)$ was inherited from the (grand)father, then ϕ(l) = 0, whereas if it came from the (grand)mother, then ϕ(l) = 1.
Note that the origin vector ${\tilde{\zeta}}_{i}$ together with the phase vector of the parent ϕ determines the grandparental origins and the recombination pattern for the haplotype h_{ i }.
Now the problem is to sample the parental phase vector ϕ conditionally on the children's partially censored origin vectors ζ_{ i }, i = 1, ..., n, the partially censored parent's genotype vector g, and the information available on the genotypes of the grandparents.
Without loss of generality we assume that g(l) ≠ {∅, ∅} for all l, since we can skip the loci where the genotype of the parent is completely censored. We assume that, a priori, the phases ϕ(l), l = 1, ..., L, are independent with respective distributions π(0, l) = (1 - π(1, l)) ∈ [0, 1]. The prior distributions π(ϕ(l), l) can be specified by using the information available on the genotypes of the grandparents as explained in section A.4.
Given the censoring pattern δ_{ i }on haplotype i, we denote the last uncensored locus up to l by j(δ_{ i }, l) := max{k ≤ l : δ_{ i }(k) = 1} if such a locus exists, and otherwise set j(δ_{ i }, l) := 0. Let ${\epsilon}_{l}:=\left\{j\left({\delta}_{i},l\right):1\le i\le n\text{and}j\left({\delta}_{i},l\right)0\right\}$ and j(δ_{ i }, l) > 0}. Note that ${\mathcal{E}}_{l}\subseteq ({\mathcal{E}}_{l-1}\cup \{l\})$ and since by assumption the parent transmits at least one allele to at least one child at each locus, l ∈ ${\mathcal{E}}_{l}$ = {l} ⊆ {1, ..., l} for all l. Note also that without censoring, we would have simply ${\mathcal{E}}_{l}=\left\{l\right\}$ for all l.
We then consider the process ${\overline{\varphi}}_{l}=(\varphi (k):k\in {\mathcal{E}}_{l})\in {\{0,1\}}^{\left|{\mathcal{E}}_{l}\right|}$ for l = 1, ..., L. This is a Markov chain whose transition law $P({\overline{\varphi}}_{l}|{\overline{\varphi}}_{l-1})$ is simply described: the coordinates ϕ(k) do not change for $k\in ({\mathcal{E}}_{l-1}\cap {\mathcal{E}}_{l})$, whereas ϕ(l) is sampled independently from the parental phase prior π(ϕ(l); l). At each locus l > 1 we observe ζ(l) = (ζ_{1}(l), ..., ζ_{ n }(l)) where only the components in
K(l) = {i : ζ_{ i }(l) ≠ ∅ and j(δ_{ i }, l - 1) > 0}
Here R(0, 0, ρ) = R(1, 1, ρ) = 1 - ρ, R(1, 0, ρ) = R(0, 1, ρ) = ρ, ρ (k, l) is the recombination fraction between the loci k and l and ${\mathscr{H}}_{l}$ := {ζ(k) : k ≤ l}.
Summarizing: If we do not have censoring, (ϕ(l), ζ(l)) is a Markov chain with hidden state ϕ(l) ∈ {0, 1}, and given the censoring pattern (δ_{ i }: i = 1, ..., n), we have constructed a Markov process ${\overline{\varphi}}_{l}$ with an enlarged state space ${\mathcal{E}}_{l}$ depending on the locus l, such that the conditional distribution of the observation process ζ(l), given ${\overline{\varphi}}_{l}$ and (ϕ(k), ζ(k) : k <l), depends only on ${\overline{\varphi}}_{l}$, ${\overline{\varphi}}_{l-1}$ and the previous observations (ζ(j(δ_{ i }, l - 1)) : i ≤ n) at the last uncensored loci, i.e. given the censoring pattern δ_{ i }, (${\overline{\varphi}}_{l}$, (ζ_{ i }(j(δ_{ i }, l)) : i ≤ n)) is a Markov process.
After having enlarged the state space, we are back to the framework of hidden Markov models, and we can use the Viterbi algorithm to sample from the joint posterior distribution the vector $({\overline{\varphi}}_{1},\mathrm{...},{\overline{\varphi}}_{L})$ (which contains the same information as the vector (ϕ(1), ..., ϕ(L)), conditionally on the observations (ζ_{ i }(1), ..., ζ_{ i }(L) : i = 1, ..., n). Next we sketch the algorithm.
By the same method one computes the posterior probability of a given sample (ϕ(1), ..., ϕ(L)), and it would also be possible to find the phase vector of maximal posterior probability by using dynamic programming.