Genotyping scenarios
In order to first evaluate how bead-chip genotype data respond to pooling treatment and second, how imputations methods perform on pooled data, we designed the following simulation experiment. We build two marker sets with genotype data from a human population at low respectively high density (LD resp. HD data sets) by extracting only those markers from the 1000 Genomes Project (1KGP) data set that are present in one lower-density and one higher-density Illumina bead-chip in common use. We then compare the performance of two approaches for genotyping markers at high-density. The first approach serves as a baseline and simulates a usual study case where part of the markers are genotyped at low density in a target population, and the rest of the markers are imputed based on a high-density reference panel. The second approach evaluates genotyping markers at a high density from pools of individuals and then using imputation for those individual genotypes that are not fully decodable from the pooling.
Data sets and data preparation
We use data from the well-studied reference resource made available by the 1KGP, more specifically phase 3 v5 [21, 29, 42,43,44], providing genotype data over 2504 unrelated human individuals across 26 subpopulations analyzed worldwide [45].
We select markers from chromosome 20 that has been studied in several previous papers [5, 41, 46]. This chromosome spans approximately 63 million DNA base pairs [42]. Within the 1KGP in the phase 3 version released 2015, 1,739,315 variants are genotyped as biallelic SNPs, out of which 1,589,038 (91.4%) have a minor allele frequency (MAF) less than 5%. These are called rare or low-frequency variants [37, 47].
After selecting the biallelic SNPs, we retain markers that are common to both the 1KGP chromosome 20 data set and analyzed on the Illumina bead-chip products Infinium OmniExpress-24 Kit and Infinium Omni2.5—8 Kit. Intersecting the markers from the Illumina arrays and the markers genotyped in the 1KGP for the chromosome 20 yields two overlapping experimental maps. The map derived from the OmniExpress bead-chip consists of 17,791 biallelic markers, out of which 17,015 markers are shared with the map derived from the Omni2.5 bead-chip which lists in total 52,697 markers (see Fig. 2a). With respective densities of 1 SNP per 3.5 kb and 1 SNP per 1.19 kb, we hence obtain low-density (LD) and high-density (HD) marker sets [38].
For simulating imputation, the 2504 unrelated human samples are randomly split into two populations, regardless of their subpopulation. The first one is the reference panel (PNL) with 2264 individuals, the latter is the study population (STU) with 240 individuals, thus observing proportion PNL:STU-sizes of ca. 10:1 as in [3]. For the classical imputation scenario simulation, we delete in the STU population genotype data for the markers only present in the HD data set and keep fully genotyped at LD the 17,015 markers common to both maps. In the pooling scenario, we keep all the 52,697 HD in STU and simulate pooled genotypes as described hereafter. In PNL, we keep the genotype data for all LD and HD markers for both scenarios. Figure 1 gives an overview of the experimental steps carried out in both scenarios.
Figure 2 illustrates the composition of the different data sets composition before imputation. In both scenarios, after imputation, the study population is eventually fully genotyped at HD markers.
Group testing design for simulating pooled genotyping from microarrays data
The study population is further processed with pooling simulation, which yields missing genotypes spread in the data.
Based on the DNA Sudoku study [9], we define critical parameters for optimizing the design which are the number of individuals per block, the number of intersecting pools per block holding each pair of samples, and the number of pools that hold any given sample. These parameters and the pooling algorithm can be mathematically formulated as a binary \(k \times m\) matrix M with k rows representing pools and m columns representing samples. M is called the design matrix of the scheme.
NORB parameters and design matrix We choose \(n_B = 16\) samples for the block size with pools of degree 4, a samples’ weight equal to 2, and a pool intersection value equal to 1. Hence, we get a number of pools per block equal to 8. The reduction factor \(\rho\) is 2, or equivalently the number of individuals is twice the number of pools within a block.
Square representation of a block We introduce a graphical representation of a pooling block with genotypes at a given SNP, according to the chosen parameters. As described by Ngo and Du in their taxonomy of nonadaptive pooling designs [25], a simple transversal design can be represented as a grid. The rows and columns \(\{P_{t}\}_{1 \le t \le T}\) are the pools, and \(\{G_{i}\}_{1 \le i \le n_B} \in \{ -1, 0, 1, 2 \}\) the individuals’ genotypes which is, in order, interpreted as ’missing genotype’, ’homozygous for the reference allele’, ’heterozygous’, ’homozygous for the alternate allele’.
| P\(_{5}\) | P\(_{6}\) | P\(_{7}\) | P\(_{8}\) |
---|
P\(_{1}\) | G\(_{1}\) | G\(_{2}\) | G\(_{3}\) | G\(_{4}\) |
P\(_{2}\) | G\(_{5}\) | G\(_{6}\) | G\(_{7}\) | G\(_{8}\) |
P\(_{3}\) | G\(_{9}\) | G\(_{10}\) | G\(_{11}\) | G\(_{12}\) |
P\(_{4}\) | G\(_{13}\) | G\(_{14}\) | G\(_{15}\) | G\(_{16}\) |
Pooling is simulated on the genotypes in the study population (STU data set) for the imputation scenario 2 (pooled HD data). STU was created in view of having a size which is a multiple of the block size chosen, i.e. STU has a size \(B_{stu} * n_B = 15*16\), where \(B_{stu}\) is the number of pooling blocks formed from the study population. At every SNP, we implemented the pooling simulation as described hereafter.
Encoding and decoding rules With the design we have selected for our experiment, simulating pooling on items involves an encoding step followed by a decoding step. Two examples of genotype pooling simulation are shown in Fig. 3a, b.
First, the encoding step simulates the genotype outcome for a pool from the combination of the individual genotypes in it. SNP chip genotyping detects which alleles are present in the sample at each SNP (0 for the reference allele or 1 for the alternate allele) on the chip. That means, in the simulation of the pooling encoding step, a pool has genotype 0 (respectively 2) if and only if all samples forming the pool are homogeneous and have homozygous genotype 0 (resp. 2). Any other intermediate combination of a pool from samples having heterogeneous genotypes 0, 1, or 2 results in a heterozygous pool with genotype 1.
In the second step, decoding the individual genotypes from their intersecting pools is done while assuming there was no genotyping error. In our design, every sample is at the intersection of two pools. If both pools have genotype 0 (or 2), the sample has genotype 0 (or 2). Also, since a pool has a homozygous genotype if and only if all contributing samples have the homozygous genotype, this implies that any individual at the intersection of a homozygous pool and a heterozygous one must be homozygous. In the case of a pooling block with exactly one carrier of the alternate allele (Fig. 3a), if exactly two pools have a heterozygous genotype 1 (pools \(P_3\) and \(P_5\) in Fig. 3a), we deduce the individual at their intersection has the alternate (or reference) allele, but we cannot state if two copies of this allele are carried (genotype 2, or 0 in the symmetrical case where the reference allele is the minor one) or only one (genotype 1). In this case, ambiguity arises at decoding, in other words, genotype data is reported as missing. To fully assess the probable state of the genotypes of each sample in a pooling block, not only the pools where a sample is included have to be considered but also the full block. We propose to make use of the constraints imposed by the outcome for each pool to estimate the genotype distribution for any undecoded sample. This includes the distribution between heterozygote and homozygote for decoded carriers.
Figure 3c, d show some results we obtain after simulating pooling and imputation at two markers for \(4 \times 16 = 64\) samples in the study population: Fig. 3c is an example for a common variant and Fig. 3d illustrates the case of a rarer variant. In practice, genotyping pools of samples on microarrays requires computational processing of the decoding step only.
Estimation of the genotype probabilities from combinatorial information
At the block level, the pooling scheme implies possible and impossible latent genotypes for a given sample. For example, a decoded block comprising twelve REF-homozygous and four missing genotypes as in Fig. 3b imposes the constraint at least two out of the four samples are minor allele carriers (i.e. genotype in \(\{1, 2\}\)), whereas the other missing samples can have any genotype in \(\{0, 1, 2\}\). Consequently, within these four unknown sample states, the probability of encountering actual homozygous-REF is lower than in a case where the missingness pattern of genotypes is independent of the actual genotype value, as is typically the case in imputation from low to higher density. By proceeding in a similar way for any observable pooling block, we propose to explicitly model the expected distribution of each incompletely decoded genotype.
Genotype representations
In this paper, beyond the G representation introduced previously, we use the genotype probabilities (GP) format, which expresses any genotype as a probability simplex over the three zygosity categories. G and GP are equivalent representations, for example if all genotype states are uniformly equally likely to be observed, this results in a genotype probability \(GP = (0.33, 0.33, 0.33)\) (i.e. \(G=-1\)). A determined genotype has one of the following probabilities: \(GP = (0, 0, 1)\), (0, 1, 0), or (1, 0, 0) (i.e. \(G=2\), \(G=1\), or \(G=0\)).
Statistical formulation of the genotype decoding problem
We introduce hereafter the notations and definitions which frame the pooling procedure as a statistical inference problem in missing data. In this framework, we later present an algorithm for estimating the most likely genotype at any missing entry conditioned on the configuration of the pooling block. Our strategy proceeds by enumerating genotype combinations for the missing data that are consistent with the data observed from the pooling blocks, and uses that enumeration to compute an estimate of the latent genotype frequencies.
Model distribution for the genotypes Let the genotype G be a random variable with three outcomes 0, 1, and 2. The genotype probabilities \(\pi\) are expressed as
$$\begin{aligned} \pi = ( p_{0}, p_{1}, p_{2} ) \end{aligned}$$
(1)
where \(( p_0, p_1, p_2 )\) are the probabilities for the genotype 0, 1, and 2 at a given variant for a given sample. Therefore, we model the complete (not pooled) genotype data within a pooling block as an array \({{\textbf {x}}}\) of size \(16 \times 3\) (\(n_B = 16\)) where each data point \(x_i\) is a probability simplex \([ p_{0i}, p_{1i}, p_{2i} ]\). Each probability simplex is an indicator vector, since the genotype is fully known.
$$\begin{aligned} {{\textbf {x}}}&= ( x_1, x_2, \ldots , x_{16} ) \end{aligned}$$
(2)
$$\begin{aligned} \forall i \in [1, 16] \qquad x_i&= \begin{bmatrix}p_{0i} \\ p_{1i} \\ p_{2i}\end{bmatrix} \end{aligned}$$
(3)
Since the samples are randomly assigned to pooling blocks, the genotype probabilities \(x_i\) are independent from each other.
Furthermore, we denote \({{\textbf {z}}}\) the prior probabilities for genotypes that follow pooling and pool decoding. \({{\textbf {z}}}\) is another list of probabilities, where some genotypes are fully decoded, some are fully unrecoverable, and some indicate carrier status, without being able to distinguish between a heterozygous genotype or a homozygous one as on Fig. 3a. The pooled genotypes are represented by
$$\begin{aligned} {{\textbf {z}}}&= ( z_1, z_2, \ldots , z_{16} ), \end{aligned}$$
(4)
$$\begin{aligned} \forall i \in [1, 16] \qquad z_i&= \begin{bmatrix}{\tilde{p}}_{0i} \\ {\tilde{p}}_{1i} \\ {\tilde{p}}_{2i}\end{bmatrix} \end{aligned}$$
(5)
The data \(z_i\) for each cell of a pooling block is modelled with the simplex of genotype probabilities \(( {\tilde{p}}_{0i}, {\tilde{p}}_{1i}, {\tilde{p}}_{2i} )\).
Mapping of the data space We denote layout the data for the full genotypes \({{\textbf {x}}}\), which is represented as a list of genotype probabilities for each individual in the block. We denote t the function transforming \({{\textbf {x}}}\) into \({{\textbf {z}}}\). Since there are several complete layouts \({{\textbf {x}}}\) that could give the same result \({{\textbf {z}}}\) after pooling, t is a many-to-one mapping
$$\begin{aligned} t :{\mathcal {X}}&\longrightarrow {\mathcal {Z}} \end{aligned}$$
(6)
$$\begin{aligned} {\mathbf {x}}&\longmapsto {\mathbf {z}} \end{aligned}$$
(7)
where \({\mathcal {X}}\) is the space of complete observations, and \({\mathcal {Z}}\) is the space of decoded pooling blocks.
Given the priors \(z_i\) for any sample, the problem to solve is to estimate a posterior probability distribution \({\hat{\pi }}_i = ( {\hat{p}}_{0i}, {\hat{p}}_{1i}, {\hat{p}}_{2i} )\) for the three genotypes \(\{ 0,1,2 \}\) in any individual, i.e. recovering a probability distribution from which the true genotype \(x_i\) can be said to be sampled, as a probabilistic inversion of t.
Inherently to the NORB design chosen, the assortment of observable \({{\textbf {z}}}\) is finite and constrained. Moreover, any individual genotype \(z_i\) depends on the genotypes of the pools intersecting it, but also on all other pools in the block. Therefore, any sample \(z_i\) in the full set of probabilities \({{\textbf {z}}}\) representing the pooling block can be parametrized by the pool configuration and the possible intersections.
Valid layouts in block patterns Let \(\psi\) be the pooling block pattern described as \(\psi = (n_{G_{rows}}, n_{G_{columns}})\), where \(n_{G_{rows}}\) (resp. \(n_{G_{columns}}\)) are the counts of row-pools (resp. column-pools) with encoded genotypes (0, 1, 2). For example, on Fig. 3a, the 8 pools can be described with the block pattern \(\psi = ( (3, 1, 0), (3, 1, 0) )\) since there are 3 row-pools having genotype 0, 1 having genotype 1, none having genotype 2, and the same for the column-pools. On Fig. 3b, the pooling pattern is \(\psi = ( (2, 2, 0), (2, 2, 0) )\).
We denote \({\mathcal {Z}}_{\psi }\) the space of decoded pooling blocks showing the pattern \(\psi\), and correspondingly \({\mathcal {X}}_{\psi }\) the space of the set of valid layouts for \(\psi\). A layout is said to be valid with respect to the pattern \(\psi\) if applying pooling simulation to \({{\textbf {x}}}\) lets us observe \(\psi\) from \({{\textbf {z}}}\). In other words, the valid layouts are
$$\begin{aligned} {\mathcal {X}}_{\psi } = \left\{ t_{\psi }({{\textbf {x}}}) \in {\mathcal {Z}}_{\psi }: {{\textbf {x}}} \right\} . \end{aligned}$$
(8)
The Additional file 1 shows examples of valid and invalid layouts for the same observed pooling pattern.
Parametrizing the data mapping Let \((r, c) \in \{0, 1, 2\}^2\) be the genotype pair of two intersecting pools, such that any \(z_i\) is conditioned on (r, c) . We note that if \((r, c) = (1, 1)\), the decoding of the intersected individual genotype \(z_i\) is indeterminate. In other cases, the intersected genotype is fully recoverable as with (0, 1) (resulting in \(z_i = [1,0,0]^{\top }\)). The pair \((r, c) = (0, 2)\) is not consistent with any genotype, therefore it is never observed.
Based on these notations, we seek to approximate the most likely genotype probabilities \(\{ {\hat{\pi }}_{i} \}\) in missing data that are consistent with \(x_i\) by using inversion sampling of the priors \(z_i\) with respect to \(t_{\psi }\). That is to say,
$$\begin{aligned} Pr(x_i | \psi ; r, c)&= t_{\psi }^{-1} \big ( Pr(z_i | \psi ; r, c) \big ). \end{aligned}$$
(9)
Computing the estimate of the posterior for the missing outcomes as \({\hat{\pi }} := \overline{ {\hat{\pi }}_{i} }\) in a pooling block with pattern \(\psi\) by inverse transform sampling is a numerical problem that can be solved as a maximum likelihood estimation (MLE) based on the enumeration of all valid layouts.
Maximum likelihood type II estimates
We propose to partition \({\mathcal {Z}}\) into \(\{ {\mathcal {Z}}_{\psi } \}_{\psi \in \Psi }\). This enables to marginalize the likelihood over \(\psi , r, c\) and lets the problem be solved as a series of separate probability simplex MLE problems in each sample subspace \({\mathcal {Z}}_{\psi }\). The marginal likelihood is sometimes found as type II-likelihood (ML-II) and its maximization (MMLE) as empirical Bayes method. We present as supplementary information a method for computing \({\hat{\pi }}\) by maximizing the marginal likelihood of any observed pattern \(\psi\) and deriving genotype posterior probabilities estimates (see Additional file 1). The MMLE example is also well-suited for introducing how we conduct a systematic and comprehensive enumeration of the valid layouts for a given pattern \(\psi\).
Self-consistent estimations
Motivation and general mechanism As a natural extension to the MMLE in presence of incomplete data [48], we implemented a method for estimating the unknown genotypes probabilities inspired by the EM algorithm. The following procedure is applied for each set of parameters \(\psi , r, c\).
We initiate the prior estimate of any entry in the block to \(z_i = [0.25, 0.5, 0.25]^{\top }\). This choice is based on the assumption that, without information about their frequencies, both alleles at a marker are expected to be equally likely carried.
The algorithm iteratively updates \({\tilde{\pi }} := \overline{ {\tilde{z}}_{i} }\) by alternating between computing the likelihood of the valid layouts using the prior estimate (E step) and deriving the posterior estimate from the frequencies of the genotypes aggregated across the data completions (M step). The M step can incorporate a rescaling operation of the proportions of genotypes that we designate as heterozygotes degeneracy resampling. Eventually, the E and M steps produce a self-consistent estimate \({\hat{\pi }}\) [49] (see Additional file 1 for a calculation example).
Heterozygote degeneracy arises from the internal representation we use for the genotypes under the pooling process. Indeed, the two heterozygous states carrying the phased alleles pairs (0, 1) or (1, 0) are collapsed into a single heterozygous genotype \(GP = ( 0, 1, 0 )\) (or equivalently \(G=1\)). In a way analogous to for example the particles paths in particles filter models, we define this collapsing as heterozygous degeneracy. For instance, a layout involving 4 heterozygous genotypes should be subdivided into \(2^4\) micro layouts combining alleles pairs (0, 1) and (1, 0). More generally, the heterozygous degeneracy has order \(2^{n_1}\), where \(n_1\) is the number of items having genotype 1 in the layout. In practice, enumerating these micro layouts would increase the computation time a lot. Instead, we include the higher probability for heterozygotes internally in the model, taking the degeneracy into account when normalizing, and again when producing the final likelihoods to be used in the imputation process, where a uniform distribution is the expected structure for data without any informative prior.
Equations of the optimization problem We proceed in a way identical to MMLE for enumerating all possible completions for the \(n_m\) unknown genotypes. At each iteration m, The E step calculates first the marginal likelihood of every layout by sampling its genotypes from \({\tilde{\pi }}^{(m-1)} | \psi\). The mixing proportion \({\mathbb {E}} [ {{\textbf {x}}} | {{\textbf {z}}}, {\tilde{\pi }}, \psi ]^{(m)}\) of each layout is computed from all aggregated likelihoods and for any \({{\textbf {z}}} \in {\mathcal {Z}}_{\psi }\). A breakdown of the formula for \({\mathbb {E}} [ {{\textbf {x}}} | {{\textbf {z}}}, {\tilde{\pi }}, \psi ]^{(m)}\) is provided in the Additional file 1.
The M step recomputes the genotype frequencies \(( {\tilde{p}}_0, {\tilde{p}}_1, {\tilde{p}}_2 )\) by applying MLE to the likelihoods calculated at the E step.
$$\begin{aligned} \tilde{p_{k}}^{(m)}&= \frac{ \sum \limits _{{{\textbf {x}}} \subset {\mathcal {X}}} n_k \, {\mathbb {E}} [ {{\textbf {x}}} | {{\textbf {z}}}, {\tilde{\pi }}, \psi ]^{(m)} }{ \sum \limits _{k} \sum \limits _{{{\textbf {x}}} \subset {\mathcal {X}}} n_k \, {\mathbb {E}} [ {{\textbf {x}}} | {{\textbf {z}}}; {\tilde{\pi }}, \psi ]^{(m)} }, \end{aligned}$$
(10)
$$\begin{aligned}&k \in \{0, 1, 2\} \end{aligned}$$
(11)
where \(n_k\) is the counts of genotype k observed in the layout \({{\textbf {x}}}\).
Since we do not compute the distribution of the genotype frequencies from the allelic dosage, we suggest a resampling step after the M step that artificially accounts for the heterozygous degeneracy. Hence, we introduce arbitrary weights \(w = ( w_0, w_1, w_2 ) = ( 1,2,1 )\) for rescaling \(( {\tilde{p}}_0,{\tilde{p}}_1, {\tilde{p}}_2 )\). If we do not account for the heterozygote degeneracy, we pick these weights as \(w = ( 1,1,1 )\).
$$\begin{aligned} {\tilde{p}}_{k}^{(m) \prime }&= \frac{ w_k \, {\tilde{p}}_{k}^{(m)} }{ {\tilde{p}}_{k}^{(m-1)} }, \ k \in \{0, 1, 2\} \end{aligned}$$
(12)
$$\begin{aligned} {\tilde{p}}_{k}^{(m) \prime \prime }&= \frac{ {\tilde{p}}_{k}^{(m) \prime } }{ \sum \limits _{k} {\tilde{p}}_{k}^{(m) \prime } } \end{aligned}$$
(13)
$$\begin{aligned} {\tilde{\pi }}^{(m)}&= ( {\tilde{p}}_{0}^{(m) \prime \prime }, {\tilde{p}}_{1}^{(m) \prime \prime }, {\tilde{p}}_{2}^{(m) \prime \prime } ). \end{aligned}$$
(14)
At the last iteration, when the algorithm has converged, the final estimate of \({\tilde{\pi }}\) is computed from a modified version of rescaling, where we compensate for the artificial upscaling used in the previous steps
$$\begin{aligned} \hat{p_k}^{(m)} | \psi&= \frac{ (1/w_k) \, {\tilde{p}}_{k}^{(m)} }{ \sum \limits _{k} (1/w_k) \, {\tilde{p}}_{k}^{(m)} }, \ k \in \{0, 1, 2\} \end{aligned}$$
(15)
$$\begin{aligned}&{\hat{\pi }} | \psi = ( \hat{p_0}^{(m)}, \hat{p_1}^{(m)}, \hat{p_2}^{(m)} ) \end{aligned}$$
(16)
Such self-consistent iterative methods provide local distribution estimates for the undecoded genotypes at the pooling block level, based on information from the pooling design. They are independent of the overall MAF in the population because of the choice we made for the prior, and do not take into account the genetic variations specific to the population and its structural traits.
Imputation for retrieving missing genotypes
For each sample in the study population, we use the aforementioned estimated genotype probabilities \({\tilde{\pi }} | \psi , r, c\) as prior beliefs \(\theta _G\) in imputation. Figure 2 summarizes the experimental settings for both this scenario and the classical one. We compare the imputation performance on pooled SNP genotype data of two population-based algorithms, representing each the haplotype clustering approach and the coalescence principle.
A haplotype clustering method: Beagle
In this work, Beagle is used in its 4.0 version and with the recommended default parameters. This software version is the best performing release having the features needed for this study. Beagle 5.0 is available but this version does not support logged-GP (GL) data type as input.
We use the HapMap GRCh37 genetic map suggested by Beagle developers and consistent with the genome assembly underlying the version of the 1KGP data used [38]. In practice though, we have not noticed clear deterioration when conducting imputation on pooled data without providing any genetic map.
For the classical imputation scenario, we beforehand verify equivalent results and performance are obtained both if Beagle is run on genotypes in a GT format or GL format. In the first case, unassayed HD markers were set to ./. and in the latter, to \((-0.481, -0.481, -0.481)\). As advised in the documentation, we imputed the entire STU population in the same batch.
In the pooling scenario, we used the same reference panel, but we deliberately chose to run Beagle sample-wise for avoiding the very specific genetic structure of pooled data being used as template haplotypes. Preliminary testing showed a clear deterioration in results if this was not done.
A coalescence-based method for haplotype phasing and imputation: prophaser
The original version of MACH did not support GL as input study data, in contrast to IMPUTE2. The main motivation for writing the Prophaser [36] code was to implement this feature with full control of e.g. cutoff thresholds for close-to-zero probabilities. The reference panel is read from GT data.
Prophaser phases and imputes unassayed markers sample-wise and independently from the rest of STU. Whereas MACH and IMPUTE2 include strategies for selecting a subset of reference samples for computational efficiency reasons, we decided to consistently use the full reference panel as templates in a single iteration estimation. Hence, Prophaser uses all reference haplotypes as templates.
Evaluation of the experimental design
We quantified the performance of the two genotyping scenarios with the concordance rate and cross-entropy. In both cases, the original data from 1KGP in the study population were used as the ground truth, and the predicted data were the imputed genotypes in the same study population.
Concordance The most widely used imputation quality metric is the genotype concordance measure which counts the number of matches between the true and the best-guess imputed genotypes. A homozygous genotype imputed as heterozygote (or conversely) is counted as a half mismatch, and a homozygote imputed to its opposite homozygote as a full mismatch. Concordance sometimes appears as its complementary formulation with the discordance rate [3]. Several publications refer to the concordance rate directly as the genotype accuracy rate [39] or as imputation accuracy [32], whilst the discordance rate is designated as the imputation error rate [33, 38].
Cross-entropy In the studies presenting the successive Beagle software versions, the accuracy in the sense of the concordance does not quantify how similar the imputed genotypes are to the true ones. This has already been pointed out by e.g. Nothnagel et al. [27]. As an example, we can consider the two following cases: (a) a true genotype \(G = 1\) being imputed with \(GP = ( 0.56, 0.42, 0.02 )\), and (b) a genotype \(G = 1\) being imputed with \(GP = ( 0.7, 0.28, 0.02 )\). Using the best-guess genotype definition, both genotypes will be imputed as \(G = 0\) and hence a discordance of one point, but the prediction (a) is “weaker” since it has a lower best-guess likelihood (\(0.56 < 0.7\)). In that sense, the prediction (a) should be considered as less significant than the (b) one even if both are wrong. Therefore, we introduce the cross-entropy metrics \(\chi\) as a divergence measure of the predicted genotype distribution. The cross-entropy we propose is defined as in equation 17 at the j-th marker for N individuals imputed.
$$\begin{aligned} \chi _j = \frac{ \sum \limits _{i=1}^{N} \left( - \sum \limits _{g=0}^{2} Pr(G_{ij}=g) \ log({\mathcal {L}}_{ijg}) \right) }{ N } \end{aligned}$$
(17)
where \({\mathcal {L}}_{ijg}\) is the genotype likelihood (or posterior imputed genotype probability) for the genotype state g at the j-th marker for the i-th individual. For low-probability genotypes, we used a cut-off of \(log(10^{-5})\) if the genotype probability was less than \(10^{-5}\).
Computational tools
Due to their computational costs, imputation algorithms were run on compute servers. The computing resources were provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science. This infrastructure provides nodes (compute servers) of two 10-core Xeon E5-2630 V4 or two 8-core Xeon E5-2660 processors running at 2.2 GHz, with 128 to 512 GB memory.