IBD graph
An IBD graph is constructed over a set of N haplotypes at a genomic locus as follows. Each haplotype is represented by a node and there exists an edge between nodes if the two haplotypes to which the nodes correspond are IBD at the locus. Valid IBD graphs obey a transitivity property such that if individuals 1 and 2 are IBD and individuals 2 and 3 are IBD, then individuals 1 and 3 are IBD [20]. An IBD graph is transitive if the edges obey the transitivity property, otherwise the graph is intransitive and can not represent the true state of IBD at the locus. Due to the transitivity property, all connected components of a valid IBD graph at a locus are cliques. We leverage the clique property of IBD graphs to improve the pairwise probabilities of IBD output by existing software packages such as Refined IBD [14].
Probabilistic IBD graph
Given the probability of IBD between all pairs of N haplotypes at a genomic locus, we construct a probabilistic IBD graph G
P
= (N, P) as follows. Each haplotype i is represented by a node n
i
∈ N. For every pair of haplotypes there is an edge assigned probability p
ij
∈ P where p
ij
is the probability of IBD between haplotypes i and j at that locus.
Given a probabilistic IBD graph G
P
we consider a proposed IBD graph g = (N, E) over the nodes of G
P
. Any proposed IBD graph g represents a different scenario of how individuals in G
P
can be IBD to each other at the given genomic locus. The probability P (g | G
P
) of g conditional on the probabilistic IBD graph G
P
is computed as follows. We define I as the set of all IBD graphs derived from the nodes of G
P
. For each g ∈ I the conditional probability of g on G
P
, P(g | G
P
) is the product of induced edge probabilities. An edge e
ij
= 1 if it is present in g and it has induced probability p
ij
. An edge e
ij
= 0 if it is not in g and has induced probability (1-p
ij
).
(1)
Updating the pairwise IBD graph
Our objective is to update the probability of each pair of individuals being IBD by conditioning on the probabilities of all pairs in the graph. The intuition is best understood with an example. Consider a probabilistic IBD graph with three nodes (i ∈ {1, 2, 3}) and suppose the initial pairwise probabilities have been assigned by a pairwise IBD-calling algorithm such that p12 = 0.9, p13 = 0.9, and p23 = 0.1. The edges e12 and e13 have a higher probability of IBD than e23, but given that true IBD graphs obey transitivity, the probability of e12 and e13 conditioned on edge e23 will be lower. Similarly, the probability of e23 will be higher when conditioned on e12 and e13 as shown in Figure 1b. By the transitive rule when 2 of the 3 edges in a triangle have a high probability of IBD we expect the third edge to have a high probability of IBD as well.
The conditional probability of an edge given the probabilities of the graph, , is the sum of the probabilities of all transitive graphs in which the edge is present, divided by the sum of the probability of all transitive graphs. We compute the conditional probability using only transitive graphs since we know that an intransitive graph is a biologically implausible scenario. We define V as the set of transitive IBD graphs derived from the nodes of G
P
.
(2)
All such transitive graphs and their respective probabilities are shown in Figure 1a. For illustrative purposes, we include an intransitive graph in the bottom right of Figure 1a. We update each edge by using Equation 2 and the resulting graph with updated probabilities is shown in Figure 1b. To further clarify how we compute a conditional edge probability, we compute the conditional probability of edge e23:
Computing exact conditional probabilities requires computing the probability of every transitive IBD graph, which has a sample space of size O(2h(h−1)/2), where h is the number of haplotypes. Unfortunately, this is computationally infeasible to enumerate and so we develop a sampling method that can be used to efficiently approximate conditional edge probabilities.
Efficient computation of conditional IBD probabilities
We start by generating the probabilistic IBD graph for a given genomic location. We only consider the unique positions along the genome where the IBD graph changes, or more specifically, the points where the initial IBD segments begin or end. Analyzing other positions would be redundant because the positions provide no information about how the IBD graph changes. An initial graph is generated by adding in all edges output by Refined IBD [14] that pass a LOD score threshold. Alternative pairwise IBD probability methods may be used if desired. We identify the connected components of this graph because edges that are part of disjoint components have no effect on each other when computing updated probabilities.
For each connected component c, we construct G
P
by translating the Refined IBD LOD scores to probabilities (see Section Converting LOD scores to probabilities). A connected component c may have edges that were never called by Refined IBD and thus have a probability of 0. We assign uncalled edges the probability ϵ = 0.0046 in order to ensure that P (g | G
P
) > 0 and that the edge can be sampled (discussed in detail in Section Converting LOD Scores to Probabilities). Then instead of enumerating the set of all possible valid graphs V inducible by G
P
we sample from V. We define N
g
as the current sum of probabilities of all sampled graphs so far and N
ij
as the current sum of probabilities of all sampled graphs containing edge e
ij
. At any stage in the sampling process, the estimate of the conditional probability that individuals i and j are IBD is . If all valid graphs are sampled with equal probability, this converges to the exact conditional probability shown in equation (2). The sampling procedure is given in Algorithm 1.
Algorithm 1 Graph sampling
Input:
G
P
Output: N
g
, N
ij
(∀i, j; i ≠ j)
N
ij
= 0, N
g
= 0
for all i, j do
if p
ij
≥ 0.99 then e
ij
= 1
else e
ij
= 0
Add edges to make all connected components of G
p
cliques
Compute P(g | G
P
)
N
g
+ = P (g | G
P
)
for all i, j where e
ij
= 1 do N
ij
+ = P (g | G
P
)
while has not converged ∀i, j do
Sample a random e
ij
and set e
ij
= 1 with probability p
ij
Ensure graph transitivity Compute P (g | G
P
)
N
g
+ = P (g | G
P
)
for all i, j where e
ij
= 1 do N
ij
+ = P (g | G
P
)
Edges are sampled according to a weighted distribution where weight w
ij
is based on p
ij
and is defined as:
(3)
(4)
If σ ≈ 0, then edges with p
ij
≈ 1 or 0 will almost never be sampled since the selection weights of such edges will be infinitesimally small. Similarly if σ is too large, then all edges will be assigned similar selection weights and as a result graphs will be sampled uniformly instead of in proportion to their probability. We selected σ = 0.234 because it allows for efficient convergence times (see Convergence properties and runtime).
This weighted sampling scheme assures that edges with p
ij
≈ 1 or 0 are sampled less often than edges with p
ij
≈ 0.5. Intuitively this makes sense because we sample proposed IBD graphs in proportion to their respective P (g | G
P
). Edges with p
ij
≈ 1 induce a proposed graph with a greater P (g | G
P
) when they are present. Similarly edges with p
ij
≈ 0 induce a proposed graph with a greater P (g | G
P
) when they are missing. Thus, to sample the most probable graphs more often, edges with high values of p
ij
should typically have e
ij
= 1 and edges with low values of p
ij
should typically have e
ij
= 0. Changing the state of an edge can cause the proposed graph g to be intransitive. Therefore we add or remove edges from g to ensure transitivity. At each iteration if an edge has p ≥ 0.99 then p
ij
is set to 1 so that we never sample very high probability edges.
Algorithm 2 Ensure graph transitivity
Input: G
P
, e
ij
that was just added or removed
Output: Transitive G
P
if e
ij
= 1 then
Add edges to make all connected components of G
p
cliques
else
S
i
= nodes connected to i, S
j
= nodes connected to j
for all p
mk
< 0.99 do e
mk
= 0
for each connected component X, where |X| > 1 do
∀ nodes x ∈ X
if then Add all nodes of X to S
i
else if then Add all nodes of X to S
j
else Flip a fair coin
for randomly selected k ∉ (S
i
∪ S
j
) do
if then Add k to S
i
else if then Add k to S
j
else Flip a fair coin
Make S
i
and S
j
cliques
We continue selecting edges until we reach a convergence criterion or a user-set time limit has been reached. As a convergence criterion we check if all edge probabilities change less than 1 × 10−11 for 5000 sequential iterations. With larger graphs that may never reach the convergence criteria, we allow the user to set a runtime limit. After convergence or hitting the user runtime limit, we output all edges and their respective updated probabilities. We tried a variety of sampling schemes to explore the space of graphs and selected this one due to its performance over simulated datasets (see Application to simulated data).
Converting LOD scores to probabilities
To find the relationship between LOD scores and the true positive rate of IBD segments we ran Refined IBD on simulated data (see Application to simulated data) using a LOD score cutoff of 0.1 and a length cutoff of 0.1 centimorgans. A true positive segment is defined as a predicted segment that is at least 50% true IBD. We fit a curve to the observed relationship between LOD score and true positive rate of IBD segments (see Figure 2). The equation of our curve is of the form p = (2o + af)/(o + f ) where o = posterior odds, f = (prior * (103)/.997)−(prior * (103)), a = (1 − LOD)3/7 if LOD ≤ 1, and a = −0.15 otherwise. The values for f and a were chosen to maximize the fit of the curve.
Since the Refined IBD [14] LOD score is the negative base 10 log likelihood of one shared haplotype divided by the likelihood of no shared haplotypes, we use Bayes rule for odds to convert a LOD score into a posterior odds: , where O(A | B) is the posterior odds, O(A) is the prior odds, and is the likelihood ratio.
For the prior, we use the probability of any two individuals in the sample being IBD at any point in the genome, ϵ = 0.0046. This is the average proportion of the genome shared IBD between all pairs of individuals estimated using results from Refined IBD over simulated data. For edges that have a probability of 0 (i.e. an edge with no pairwise call), we assign a probability equal to the prior because otherwise these edges would never be sampled and graphs would have a probability of 0.
Ideally, the relationship between LOD score and true positive rate is given by p = odds/(1 + odds). However, the relationship between LOD score and true positive rate in our sample of simulated individuals deviates from this theoretical relationship. Our function and p = odds/(1 + odds) are of the same form (i.e. g(x) = (cx + d)/(mx + n)). This served as our motivation in defining the conversion function. Lastly, given that the simulated data we generated is reflective of European population growth, the relationship between LOD score and true positive rate may differ in other populations (see Discussion).
Merging results across graphs and inferring new segments
IBD segments can span multiple regions and our method analyzes IBD at a single region. The probability of IBD between two individuals can therefore be output at multiple adjacent regions by our method. Furthermore, the IBD probability may be assigned a different value in each region due to the inexact nature of the sampling method. If the same IBD segment is assigned different probabilities across multiple loci we use the maximum value across all regions.
Once an IBD graph is analyzed using the sampling procedure, edges that were previously missing (i.e. those that were not called by Refined IBD) are output with a start and stop site that is equal to the intersection of all IBD segment boundaries in the graph. Since we do not look in the region for sequence identity between haplotypes we can only output the probability that IBD exists somewhere within the region. These new segments may also overlap with other called IBD segments. In order to reconcile overlapping IBD segments, we merge them provided that they pass a probability threshold set by the user and that they lie on the same haplotype. As the final probability, we use the maximum of the merged segments. For all analyses presented here, we only merged segments that had a probability of 0.99 or greater.
Creating simulated IBD data
We generated simulated genotype data as previously described by [14]. To start, we use Fastsimcoal [21] to generate phase known DNA sequence data of 2000 diploid individuals. A single individual is represented as one chromosome consisting of ten independent 30 MB regions, each with a mutation rate of 2.5 × 10−8 and a recombination rate of 10−8. The population simulated begins with an effective population size of 3000 diploid individuals with a growth rate of 1.8% at time t = 300 (where t is the number of generations ago from the present). Moving forward in time, the growth rate was changed to 5% and to 25% at times t = 50 and t = 10 respectively, resulting in a final effective populations size of 24,000,000 at t = 0. The simulation is reflective of European population sizes estimated from the linkage disequilibrium of common variants [22].
Using the DNA sequence data we create genotype data by first filtering single nucleotide polymorphisms (SNPs) that were not bi-allelic with a minor allele frequency (MAF) less than 2%. Next, we choose 10,000 variants uniformly by MAF (where 2% ≤ MAF ≤ 50%) per 30 MB region. This SNP density is in line with that of a 1,000,000 SNP genotyping array. Finally, we remove all phase information and apply a genotyping error at a rate of .05% by turning heterozygous genotypes into homozygous genotypes and vice versa. Using the simulated genotype data, we use Refined IBD [14] to phase the data and call pairwise IBD. We define true IBD segments as those segments longer than or equal to 0.1 centimorgan. A potential consequence of this approach to creating simulated data is that the resulting IBD graph may not completely obey transitivity.