Two molecular measures of relatedness based on haplotype sharing

Background Measuring the extent of shared ancestry between individuals or organisms is important in many fields, including forensic science, conservation genetics and animal breeding. The traditional approach is to calculate the expected degree of relatedness between individuals in a pedigree. This assumes that the founders of the pedigree are non-inbred and unrelated to each other, which is rarely the case. In contrast, molecular data allow measurement of actual relatedness without knowledge of a pedigree. Methods to do this have been proposed, but generally do not take the lengths of the genomic regions shared between individuals into account. Results Two measures based on the extent of haplotype sharing between genomes are proposed. The intercept measure B estimates the fraction of shared genome between individuals, and the product measure C is closely related to the numerator relationship matrix A. Both are based on a model for the joint distribution of markers at the haplotype level. The two measures are compared to the pedigree-based measure A and to vanRaden’s G, a frequently used molecular measure, using a set of data comprising 5037 dairy cattle. The comparison criteria include the ability to capture genealogical relatedness and the prediction accuracy obtained when used in genomic prediction. Both B and C explain around 95 % of the variation in A, whereas G explains around 6 %. G captures genealogical relatedness poorly, particularly for distantly related individuals (second cousins or farther). Both B and C tend to be larger than A but this can be ascribed to the assumption of non-inbred unrelated founders. Using C in linear mixed models results in slightly higher prediction accuracy than G, and using B results in slightly lower prediction accuracy. Conclusions The two proposed measures of relatedness capture genealogical relatedness well, outperforming vanRaden’s G in this respect. When used in genomic prediction models, the product measure leads to slightly improved prediction accuracy.


Background
Estimating the extent of shared ancestry between individuals or organisms is central to many fields. Examples range from forensic science [1], studies of population structure [2,3], and conservation genetics [4], to the mixed linear models used in genomic prediction and genome-wide association studies [5,6].
Following Malêcot [7], measures of relatedness between two individuals are generally formulated in terms of the coefficient of coancestry, which is the probability that for a randomly selected gene, two alleles, one taken at random from each individual, descend from a single ancestral gene -that is, the probability that the alleles are identicalby-descent (IBD). Similarly, a measure of inbreeding for an individual is defined as the probability that the two alleles of a randomly selected gene are IBD [7]. When a pedigree is available, probabilities that two alleles are IBD from common ancestors within the pedigree can be calculated, and from these the classical measures of relatedness and inbreeding can be derived. The calculations typically assume that the founders are unrelated and not inbred, which is rarely the case. The measures depend on the choice of pedigree, and represent expected rather than realized relatedness and inbreeding, since they cannot incorporate the randomness inherent in meiosis.
With the advent of high-scale genotyping technologies such as single nucleotide polymorphism (SNP) arrays, it is possible to estimate realized relatedness directly from molecular data without knowledge of genealogy, and a variety of ways to do this are available [8,9]. These generally take the form of genome-wise averages of single-SNP statistics, which have the disadvantage of not taking the lengths of genomic regions shared between two individuals into account [8]. In Section "Methods, " two novel measures based on the extent of haplotype sharing are described, and their properties studied. In Section "Results" the methods are applied to a set of data from dairy cattle, and compared to the classical pedigreebased measure and a measure due to vanRaden [10] that is widely used in animal breeding. Section "Software" describes the software used and Section "Discussion" gives a brief discussion.

Methods
The methods developed in this paper are based on the following conceptual framework. The genome is divided into a series of physical intervals, and the variant DNA strings that may occur in the intervals are denoted segments. To each interval corresponds a set of segments that may occur in the interval, and these sets do not overlap: a segment that may occur in one interval may not occur in another. With some abuse of terminology one may identify the intervals with genes, and the possible segments with alleles. A specific genome is regarded as a collection of segments, and measures of relatedness between genomes are constructed in terms of similarity between such collections. For this the concept of a multiset is needed.

Multisets
A multiset [11] is a generalization of the concept of a set that, unlike a set, allows multiple instances of its elements. The multiplicity of an element is the number of instances of the element in the multiset. For example, [2 figs, 5 pears, 3 plums] is a multiset in which the element fig has multiplicity two. Note the use of square brackets [ ] to distinguish multisets from ordinary sets using curly brackets {}. A multiset corresponds to an ordinary set if the multiplicity of every element is one or zero.
Multiset intersection is a generalization of set intersection. The intersection of two multisets is formed by taking the minima of the multiplicities of the corresponding elements in the two multisets. For example: The sum and product multiset operators, represented by + and × respectively, use the straightforward elementwise operations, for example: The cardinality of (number of elements in) a multiset A is written |A|. Some useful properties of the operators include the equations |A + B| = |A| + |B|, that hold for any multisets A, B and C [11].

Two measures of genomic relatedness
As described above a genome is taken to be composed of a collection of segments, taken from a larger pool of segments S. Let G i represent the genome of individual i, regarded as a collection of segments s in S. Since there may be duplicates, G i is a multiset. Let x is represent the multiplicity of segment s in G i .
Let there be p intervals (loci). For s ∈ S, let l(s) ∈ {1, . . . , p} indicate the interval associated with segment s. At each interval, an individual genome has two segments, corresponding to its two haplotypes. Thus each genome has in all 2p segments.
A natural definition of the similarity of two individuals is the fraction of genome that they share. So for individuals i and j the intersect measure of their similarity is defined as the cardinality of the intersection of the two genomes, divided by the total number of segments: where x ∧ y is the minimum of x and y. Since |G i | = 2p, we have b ii = 1 for all i. For all i and j, 0 ≤ b ij ≤ 1. Also b ij = 0 iff G i and G j have no common segments, and b ij = 1 iff G i and G j are identical. The product measure is defined similarly using the product operator: We have 0 ≤ c ij ≤ 2 for all i and j, with c ij = 0 iff G i and G j have no common segments. In matrix terms C = (c ij ) can be written as To relate the two measures, note that when x and y take values in {0, 1, 2}, xy is given by where κ ij , a measure of shared heterozygosity, is defined as To relate C to the numerator relationship matrix [12], let θ ij be the coefficient of coancestry between individuals i and j, that is, the probability that for a randomly selected gene, two alleles, one taken at random from each individual, are IBD [7]. Define ϑ ij similarly as the probability that for a randomly selected gene, two alleles, one taken at random from each individual, are identical, that is, identical-by-state (IBS). For k = 1, . . . , p, write the two segments of individual i at interval k as (s k i1 , s k i2 ), and similarly (s k j1 , s k j2 ) for individual j. Then for i = j, s∈S:l(s)=k Note that ϑ ij is the probability of an event randomly chosen from the 4p identities on the right-hand side of Eq. (4). Hence ϑ ij = 2pc ij /4p, and so c ij = 2ϑ ij . Thus c ij is twice the IBS-sense coefficient of coancestry ϑ ij .
Similarly, let θ i be the coefficient of inbreeding for individual i, that is, the probability that for a randomly selected gene, the two alleles are IBD [7], and let ϑ i be the corresponding IBS-sense quantity. When i = j, we obtain and as just shown hence C and A are conceptually closely related. An assumption behind this assertion is discussed in Section "Discussion".

Defining the segmentation
To define the segmentation a statistical model in the form of an acyclic probabilistic finite automaton (APFA) [13] is used. Such models allow the extent of haplotype sharing within and between genomes to be quantified, and underlie the Beagle program [14,15] that is widely used for processing high-dimensional SNP data. Phase estimation, imputation and model selection are performed simultaneously, using the algorithm described in [15]. Beagle is highly efficient, taking only a few minutes to process each chromosome for the data described in Section "Data and computations", and performs well: for example, imputation accuracy rates generally exceed 97 % in cattle data [16].
An APFA is represented as a directed multigraph A = (V , E), where V is a vertex set and E an edge set: a small example is shown in Fig. 1. This is a model for the joint distribution of p = 20 markers at the haplotype level. To each edge in E is attached a probability, such that the sum of probabilities of the outgoing edges from each vertex in V is one. Each haplotype corresponds to a path through the graph from the root (the leftmost vertex) to the sink (the rightmost vertex). The probability of a haplotype is the product of the probabilities of the edges in its root-tosink path. See further [17,18].
In [14,15] the haplotypes that traverse a given edge are known as a haplotype cluster. Here a different perspective is adopted. The edges of the APFA are taken to represent chromosomal segments, that is, we set S = E. So if two haplotypes traverse the same edge in an interval they are taken to share the same DNA in that interval, and if they traverse different edges, they are taken not to share DNA in that interval. The data may be represented as an N × |E| matrix X taking values in {0, 1, 2}, whose (i, j)th element specifies the multiplicity of segment j in individual i. The variables corresponding to the columns of X are called haplomarkers, and X is called the haplomarker design matrix. Figure 2 illustrates recombination under the model of Fig. 1. The two haplotypes of an individual correspond to two root-to-sink paths in the graph, and recombination is seen as crossing-over between the paths. If the individuals represented in Fig. 2 are taken in the order mother, father and offspring, we find the relationship matrices

Expected relatedness
Consider first three individuals, i, j, and k, where i and j are the parents of k. We examine how the parentoffspring relatedness measures depend on the relatedness of the parents. Specifically, expressions for the expectations of the parent-offspring relatedness conditional on the parental relatedness will be derived.
During meiosis the genomes G i and G j are first partitioned into two gametes, say The partitioning process (segregation) is complex and stochastic, but only properties that are invariant to this are considered here. Then the genome or H 2 j , and the four combinations are equiprobable. Thus Similarly and Now consider four individuals, h, i, j and k, where again i and j are the parents of k, and where the relatedness between h, i and j is known.
An expression for the expectation of the intersect measure may be derived in a similar fashion: but I have not been able to derive an expression for E(b kh ) corresponding to (12).
Expressions (8), (10) and (12) are identical to those used in the calculation of the numerator relationship matrix using the algorithm of [19]. When only a subset of individuals in a pedigree are genotyped, a hybrid expected/realized relationship matrix R = (r ij ) exploiting both pedigree and genomic information can be obtained using the following simple modification to the algorithm.
Order the individuals so that parents precede their offspring and label them 1, . . . , N. Write the subset of genotyped individuals as S, and for all i, j ∈ S, set r ij = c ij , the realized genomic relatedness described above. For k = 1, . . . N, derive the relatedness between an individual k and the preceding individuals as follows. When k ∈ S, set where i and j are the parents of k. If either or both i and j are unknown (i.e., not in the pedigree) assume they are unrelated, that is, use r ij = 0 in this calculation. This algorithm adjusts the expected relationships downstream of S in the pedigree. An alternative method that adjusts all the relationships outside of S is sketched in the following subsection.

Modifying A
This subsection describes an established technique that is useful in various contexts. The individuals are partitioned into two groups, so that A = A 11 A 12 A 21 A 22 and We regard A as a covariance matrix and wish to construct a newÃ for which the marginal covariance of group one is set to A * 11 , and the conditional (co)variance of group two, given group one, is kept the same. That is to say, such that A 12 , A 21 = (A 12 ) T and A 22 are retained. So we require that where * denotes unspecified and E is an increment matrix to be found. Using standard results on inverses of partitioned matrices we obtain is the required increment. Write the matrix obtained in this wayÃ = A|A * 11 . This technique is used when a subset of individuals in a pedigree are genotyped, to compute a hybrid expected/realized relationship matrix exploiting both pedigree and genomic information [20,21].

Results
In this section empirical comparisons are made between the proposed relationship matrices B and C, the numerator relationship matrix A derived from the pedigree, and the matrix G of vanRaden [10].

Data and computations
The data used in this analysis are genotypes and complex traits for 5037 Nordic Holstein bulls. The 5037 bulls were genotyped using a 50K chip and then imputed with Impute2 [22] to 777K (HD), using a reference panel of 1197 HD genotyped bulls. The pedigree-based relationship matrix (A) for the 5037 bulls was derived from the Nordic Holstein pedigree of year 2013 (which contains a total of 134832 animals) in the standard way.
The genomic relationship matrix (G) following [10] was calculated from the marker data, using where M = (m ik ) is the N ×p marker design matrix whose elements take values in {0, 1, 2}, andm k is the mean allele frequency of the relevant allele of the kth marker, that is, Thus the allele frequencies are set to those in the current sample. Beagle version 3.3.2 was applied to the unphased marker data from the Holstein bulls, and the B and C matrices were derived from the Beagle output files. Beagle uses two tuning parameters, m and b. The larger the parameters, the simpler the selected APFA. The settings m = 1 and b = 0, suggested in [15,18], were used below in the following sections, except Section "Prediction": here the settings m = 4 and b = 0.2 suggested in [23] were used, since they result in slightly better prediction accuracy.

Relatedness and pedigree distance
In Sections "Two measures of genomic relatedness" and "Expected relatedness" it was seen that there is a close conceptual relationship between C and the numerator relationship matrix A. This section examines empirically the extent to which the measures capture the genealogical relationships between individuals. This is done in several ways.
In a crude but informative approach, the distance between each pair of bulls may be calculated as the length of the shortest path between the bulls in the pedigree. For example, full-and half-sibs are at distance two, first cousins are at distance four, and second cousins are at distance six. Figure 3 shows sample densities of the four measures broken down by distance for all pairs of animals. Corresponding summary statistics are shown in Table 1.
The parent-offspring pairs are clearly identified by all four measures, but for the more distant pairs there appears to be least separation for the G measure. For the A measure, distances of four and above are less than 0.2, with a spike close to zero at distance 9, reflecting the assumption of unrelated founders. There are distinct peaks for most distances, indicating good separation between these. a b c g Fig. 3 Sample densities of relationship measures by pedigree distance. Sample densities broken down by pedigree distance for four measures: a the numerator relationship matrix; b the intersect measure; c the product measure and g the genomic relationship measure of [10]  The B and C measures at distance four and above are larger, being less than 0.3, also with good separation. The G measures for distance five and above are centered around zero, so about half of the values are negative, and there is poor separation. This suggests that G performs relatively poorly for distantly related individuals, say second cousins or farther.

Comparison with pedigree-based relationships
The distance measure just described is crude since pairs of animals at a given distance may be more or less related, due to varying numbers and lengths of lineage paths between them and common ancestors in the pedigree. The A matrix takes this into account and is the natural pedigree-based measure. The upper three subplots of Fig. 4 show smoothed scatterplots of A versus G, B, and C, for all distinct pairs of animals in the data. It is seen that the bulk of the points lie above the identity line in the A versus G plot, and under the line in the A versus B, and A versus C plots: that is, G tends to underestimate A whereas B and C tend to overestimate A. See also Table 1.
Adjusted R 2 statistics based on simple linear regression models with no intercept, as shown in Fig. 4, indicate that G only explains around 6 % of the variation of A, whereas both B and C explain around 95 %.

Comparison using consistency
The tendency for B and C to be larger than A could be due to the assumption of non-inbred, unrelated founders that underlies A: if this is false, deflated estimates of relatedness and inbreeding would result. To examine this possibility we use the technique described in Section "Modifying A" to examine the consistency of B, C and G with A, in the following way. A random sample of 1000 animals from the genotyped Holstein bulls is taken, and called group one. The matrices A|G 11 , A|B 11 and A|C 11 are derived and compared with the realized relationships, that is, the off-diagonals of (A|G 11 ) 22 are compared with those of G 22 and so forth. The process is repeated for 10 random samples of size 1000. The three lower subplots in Fig. 4

Comparison of inbreeding coefficients
Finally, Fig. 5 shows smoothed scatter plots comparing the inbreeding coefficients obtained from A with those from G and C. It is seen that C explains 89 % of the variation in A whereas G explains only 42 %. The inbreeding coefficients from A are non-negative, but negative values occur in G. The inbreeding coefficients from C are consistently larger than those from A, which may reflect assumptions of non-inbred unrelated founders underlying A. To examine this, the consistency of the inbreeding coefficients from C and G are compared using the method just described: the results are shown in Fig. 5. The coefficients from C 22 are slightly smaller than those from (A|C 11 ) 22 . Thus the difference may at least in part be due to assumptions of non-inbred unrelated founders. The consistency of the inbreeding coefficients from G with those from A is very poor. Estimates of inbreeding coefficients based on G may be sensitive to choice of allele frequencies in the base population [10].

Prediction
To compare the use of the relatedness measures in prediction, breeding values were predicted using a genomic restricted maximum likelihood (G-REML) model of the form where y is the response vector, μ is the overall mean, 1 is a vector of 1's, g is a vector of breeding values, and e is a vector of residuals. It is assumed that g ∼ N(0, V σ 2 g ) and independently e ∼ N(0, Dσ 2 e ), where V is a relationship matrix (i.e. A, B, C, or G), and D is a diagonal matrix with elements d kk = (1 − r 2 k )/r 2 k to account for heterogeneous residual variances due to varying reliability r 2 k of the complex trait y.
The analysis was performed with the package DMU [24] applied to data from the training set, using the four relationship matrices. To examine prediction using less related individuals, a reduced training set was constructed by excluding all sires and grandsires of any animal in the test set from the training set. The prediction accuracy, that is, the correlation between the predicted and observed values in the test set, are shown in Table 2. It is seen that C consistently has the highest prediction accuracy, though the improvement over G is modest, of the order of 0.4 % when the full training set is used, and 0.6 % when closely related animals are excluded from the training set. In contrast, B has slightly less prediction accuracy than C for both the full and the reduced training set.

Software
Beagle version 3.3.2 was used to select APFA on which the measures are based. A C++ program, available from the author, was written to construct the B and C relationship matrices from Beagle output files. DMU [24] was used to perform the REML analyses. The remaining computations were performed using R: in particular, the A matrix was computed using the pedigree package, and Figs. 1 and 2 were constructed using the gRapfa package [18].

Discussion
Two novel measures of relatedness based on shared haplotypes were introduced in Section "Methods". The intersect measure (B) is an estimate of the fraction of shared genome, and the product measure (C) is closely related to the numerator relationship matrix (A) [12].
The framework underlying the measures is that of a diploid genome divided into intervals and segments (or genes and alleles) in which it is assumed that segments are not shared between intervals, so that the multiplicity of each segment in a genome is in {0, 1, 2}. It would be interesting to examine whether the measures can be extended to polyploid genomes, and whether the assumption of no shared segments, which cannot accommodate phenomena such as gene duplication, can be relaxed.
The close conceptual relation between C and A rests implicitly on an assumption that whenever two haplotypes share the same segment (in the APFA context, traverse the same edge) their chromosomal segments are indeed identical (IBS). This is a strong assumption, and most likely only approximately true. Probably the reason that the proposed measures capture genealogical relatedness well here is that the APFA-based segmentation is in reasonable accordance with this assumption. If the sample size were small, a overly simple APFA would be selected, leading to over-estimation of haplotype sharing. Similarly, if segments were defined directly using the marker alleles (say, with two segments per interlocus interval corresponding to the alleles of a flanking marker), the assumption would be violated, and the resulting measures may be expected to capture relatedness poorly. If full-sequence data are available, it would in principle be possible to verify the assumption for the APFA-based segmentation, or perhaps to develop an improved segmentation method.