Iam hiQ—a novel pair of accuracy indices for imputed genotypes

Background Imputation of untyped markers is a standard tool in genome-wide association studies to close the gap between directly genotyped and other known DNA variants. However, high accuracy with which genotypes are imputed is fundamental. Several accuracy measures have been proposed and some are implemented in imputation software, unfortunately diversely across platforms. In the present paper, we introduce Iam hiQ, an independent pair of accuracy measures that can be applied to dosage files, the output of all imputation software. Iam (imputation accuracy measure) quantifies the average amount of individual-specific versus population-specific genotype information in a linear manner. hiQ (heterogeneity in quantities of dosages) addresses the inter-individual heterogeneity between dosages of a marker across the sample at hand. Results Applying both measures to a large case–control sample of the International Lung Cancer Consortium (ILCCO), comprising 27,065 individuals, we found meaningful thresholds for Iam and hiQ suitable to classify markers of poor accuracy. We demonstrate how Manhattan-like plots and moving averages of Iam and hiQ can be useful to identify regions enriched with less accurate imputed markers, whereas these regions would by missed when applying the accuracy measure info (implemented in IMPUTE2). Conclusion We recommend using Iam hiQ additional to other accuracy scores for variant filtering before stepping into the analysis of imputed GWAS data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04568-3.

Illumina Inc., San Diego, CA. The underlying chemistry differs but both array types can be used to ascertain genotypes in a similar fashion [2]. In contrast to the more expensive and error prone new generation sequencing technologies, the number of genotyped variants ranges from 300,000 to 4 million. Array-based markers are supposed to tag the genomic region in their vicinity, but represent only a small proportion of all known DNA variants. Furthermore, these variants are not a random selection but have been chosen according to criteria such as minor allele frequency (MAF), location in exons or blocks of linkage disequilibrium or putative associations with certain disease.
Imputation methods and strategies have been developed and are now a standard tool in GWAS to close the gap between genotyped and existing DNA variants [3][4][5]. These methods transfer information of DNA structure from one or several reference panels with high marker density (e.g. 1000 Genomes Project phase 3 [6] or Haplotype Reference Consortium (HRC) [7]) to the genotyped study samples [4]. Most imputation methods estimate a-posteriori genotype probabilities (referred to as dosages, ranging from 0 to 1) for each untyped variant and each individual in the sample of interest. The resulting increase of variant density in the study sample improves the genomic coverage and can increase the power to identify genomic variants associated with a trait [8]. Imputation further has the potential that an identified associated marker is located closer to a true risk locus; it facilitates fine mapping of causal variants and is essential for meta-analyses of GWAS, particularly when different genotyping arrays have been used for multiple studies [9]. However, imputation requires advanced statistical methods for data analysis and may introduce extra uncertainty in interpreting findings. Further, only DNA variants that have previously been genotyped in the used reference panel can be imputed [4,10].

Known accuracy measures
It is important to evaluate the quality of imputation, e.g. to exclude poorly imputed variants from statistical analysis. Several quality indices have been developed and are routinely applied [4,5,18]. These comprise inter alia the squared correlation r 2 between the true and imputed dose of an allele across all imputed samples (MaCH r 2 , Minimac or Beagle r 2 ) or IMPUTE2's info.
All r 2 measures can be derived from a-posteriori allele probabilities without knowledge of the true allele dose, but only if the allele probabilities are well calibrated and MAF is not too low. The power of an allelic test with N samples and imputed alleles is approximately equal to the power of the same test with r 2 N samples and known alleles, in case of a binary trait. Differences among the known r 2 measures are discussed elsewhere [4]. The commonly used info is defined as the proportion of statistical information on the population allele frequency in the imputed genotypes, relative to "known" genotypes [5]. If the Hardy-Weinberg disequilibrium (HWE) holds, info equalizes to Minimacs r 2 . Hence, r 2 -based measures and info are directly related to the power of statistical test of a marker x trait association.
In general, both metrics have preferable characteristics if the a-posteriori genotype probabilities (dosages) are accurately calculated [18]. However, multiple factors can affect imputation accuracy, e.g. sample size, sequencing coverage and haplotype accuracy of the references panel(s), density of the genotyping array, allele frequency and poor LD between genotyped and imputed variants [4]. One can calculate these accuracy measures from dosage files. However, the standard outputs of common imputation programs (e.g. Beagle or IMPUTE2) contain different metrics. Hence, choosing an imputation program binds the user to the metrics provided, although the SNPTEST program offers the option of calculating a measure similar to that of info [19].
We propose a new pair of metrics to depict additional aspects of imputation accuracy also calculable from dosage files. First, we aim to quantify the amount of individual-specific versus population-specific genotype information in the imputed genotypes. Second, we aim to assess the heterogeneity between dosages of a marker across the sample at hand. Both measures can be used to identify markers or regions in which populationspecific genetic information conceal individual-specific information and are therefore less informative for e.g. association testing. These new metrics are not intended as a competitor to established scores, but are intended to support the making of wellfounded decisions in SNP filtering of imputed markers prior to an analysis or in interpretation of results after an analysis.
We calculated this pair of accuracy measures on a series of 27,065 cases and controls gathered by the International Lung Cancer Consortium (ILCCO) to find meaningful thresholds for marker exclusion and compared it with info, because all of the ILCCO samples had previously been imputed with IMPUTE2 applied to a standard 1000 Genomes referent panel. Further, we contrasted the usability of the new measures to info in some simulated data.

Comparison of Iam and hiQ
When applying the novel indices Iam hiQ (as defined in the section Novel accuracy measures) to 517,482 SNPs types with the OncoArray, only a small portion (n = 40,678, 4‰) can be considered as imputed without doubt (Iam = 1 and hiQ = 1). For the majority of SNPs a value between 0.95 and < 1 was assigned for hiQ (9,760,392, ~ 94%), while only 30% (n = 3,243,272 markers) achieved such a large value with respect to Iam. It is worth to mention, that we assigned a reduced value for Iam (from 0.4 to 0.75) to about as many SNPs (n = 3,491,596, 33%). More details are given in Additional file 1: Table S1.
Both components of Iam hiQ, are contrasted in a bubble plot (Fig. 1). The oversized grey bubble in the top right corner represents the vast majority of almost fully-informative markers with Iam ≥ 0.99 and hiQ ≥ 0.99. It can easily be seen that the remaining small minority of not fully accurately imputed markers take advantage of the whole theoretical range for Iam (even negative values). In contrast, hiQ always exceeds 0.4 in the sample at hand, but seems to be sensitive in markers with low values for Iam, whereas lower values of hiQ are only assigned to common markers.

Defining thresholds for marker filtering
In order to exclude less accurately imputed variants from further analysis, one needs to define a meaningful and applicable threshold for any accuracy index. For the measure info threshold values like 0.8 or 0.3 have been proposed, but without sound justification [5,20,21]. It was even proposed to lower the threshold for info in very large samples, as those of the UK Biobank, and still maintain a good ability to detect associations [22].
We applied robust regression (PROG ROBUSTREG of SAS 9.4; cut-off-α = 10 -6 for leverage points, cut-off-multiplier = 5 for outliers [23]) to estimate the expected value of Iam and hiQ and their variance-covariance matrix, assuming a hidden two-dimensional normal distribution (ignoring the upper bounds of the indices). Based on this we derived the 99.9999999% (1-10 −9 ) random region (dashed line in Fig. 1) to define, very conservative and data driven, lower bounds for the two indices, limiting the probability of a false-exclusion to ~ 1/(100•10,439,017) (one hundredth under Bonferroni correction assuming independent markers). The robust mean for Iam was 0.7409 that for hiQ was 0.9885. Restricted to common markers (MAF ≥ 0.1) we achieved similar mean values (Iam: 0.8101, hiQ: 0.9894). The lower bounds of the random region of hiQ were 0.9627 for all markers and 0.9673 for common markers, which is almost identical. In contrast, the lower bounds of the random region of Iam were 0.2553 for all markers and 0.4657 for common markers, demonstrating the influence of MAF, via HWE, on Iam. We decided to use the study specific thresholds of 0.47 for Iam and 0.97 for hiQ to further classify markers of poor accuracy. Because Iam ranges linearly from population informative dosages to fully individual informative dosages, the threshold of ~ 0.5 indicates markers with less than ~ 50% individual-specific genotype minor allele frequency minor allele frequency <1% 10%-30% 30%-50% 5%-10% 1%-5% information (in average across all samples). Such an intuitive interpretation cannot be given for hiQ.
For the majority of 9,094,772 (87.2%) variants, sufficient imputation accuracy was achieved, according to our defined thresholds (see Table 1). Limiting the markers to those with 0.5 < info ≤ 0.8 and info ≥ 0.8, this fraction increases to 95.1% or 99.9%, respectively. In very rare genetic variants (MAF < 1%) the fraction drops to 76.5%. In contrast, only 0.6% of variants meet neither the Iam nor the hiQ criteria. Interestingly, 1,214,620 variants (11.7%) missed only the Iam criteria. The fraction was larger in very rare variants (23.2%) and in variants with info < 0.5 (58.5%), while it was moderate in variants with 0.5 < info ≤ 0.8 (2.5%). Figure 2 presents the accuracy of imputed markers according to Iam hiQ in a Manhattan-like plot, with Iam given in the lower part (blue) and hiQ given in the upper part (red). This plot contains all 10,427,599 SNPs. Regions with massively less accurate imputation can easily be identified, especially by hiQ (red needles). This is for instant the case close to the centromere of chromosomes 1, 2 and 9 (accuracy by chromosome 1 is presented in Additional file 1: Figure S1). However, variants with Iam or hiQ below the defined thresholds can be found in many regions across the whole genome. Massively less accurate imputation can be found upstream the centromere, less distinct downstream the centromere and close to the telomeres, as well as around position 50K (blue icicle). Nevertheless, it is still hard to visually find regions that are enriched with less accurately imputed markers.

Identifying markers and regions of low accuracy
To identify more genomic regions prone to host inaccurate markers we calculated the exponentially weighted moving averages (ewma) of Iam and hiQ (PROC EXPAND of SAS 9.4; smoothing factor 0.1) [23]. We consider variants with an ewma < threshold (0.47 for Iam and 0.97 for hiQ) as belonging to a "hot region" and variants with an ewma < threshold/2 (0.23 for Iam and 0.48 for hiQ) as belonging to a "very hot region". Across the whole genome, we were able to identify 4,603 "hot regions" and 171 "very hot regions" according to Iam HWE , as well as 2,899 "hot regions" according to hiQ. These regions partially overlap or are interconnected. "Hot" and "very hot" Iam-regions contain in total 85,790 variants, only about 8‰ of all variants. "Hot" hiQ-regions contain in total 53,590 variants, only about 5‰ of all variants. However, about 1 out of 3 "hot" or "very hot" regions is very small and contains only one variant. In contrast, 10% of the "hot" Iam-regions and about 20% of either the "very hot" Iam-or the "hot" hiQregions contain more than 20 variants (see Additional file 1: Tables S2-S4). Some of these regions on chromosome 1 are indicated by flames in Fig. 3.

Comparing Iam hiQ with info and certainty
The missing rate for info was about 18% and 0.6% for certainty, in the data set at hand. Iam and hiQ could be determined for all markers (see Additional file 1: Table S5). The correlation between Iam and info was largest (r 2 = 0.944), indicating that both represent comparable information on accuracy. hiQ and certainty correlate only moderate among themselves as do the other measures (r 2 < 0.5) (see Table 2). However, only every second fully-informative SNP (Iam = 1 and hiQ = 1) was assigned a value ≥ 0.8 for info (see Additional file 1: Table S1, red shaded points in Fig. 1), whereas this was the case for less than 2‰ of variants with reduced Iam (0.4 to 0.75). This means that Iam and info nevertheless carry different information on imputation accuracy. Figures for a visual comparison of Iam and info are included in the Additional file 1: Figures S2 and S3. These clearly show that info is less suitable for mapping regions enriched with less accurately imputed genotypes, genome-wide and chromosome-wide.

Usability
We also investigated the usability of the proposed indices in contrast to info by simulation. Usability was considered in terms of discrimination between sufficient and insufficient imputation, rather than in terms of validity of imputation because validity is a characteristic of the imputation routine (e.g. IMPUTE2).
Eight scenarios consisting of two common genotyped tagSNPs flanking one intermediate marker for imputation were defined, differing from each other by the underlying haplotype structure, MAF and LD-patterns were defined. Two scenarios each form a pair (a scene), consisting of a scenario in which the missing marker can be imputed sufficiently/better and one scenario in which the missing marker can be imputed insufficiently/worse. Imputation was performed on 100 randomly drawn samples for each scenario, and accuracy measures were calculated. The ability of an index to discriminate a sufficient from an insufficient scenario (usability) was visually inspected plotting comparative receiver operation curves (one ROC per index) for each scene, and quantified as area under der curve (AUCs) of ROCs. Details on the simulation and the results are given in the Additional file 1. Info and Iam appear to be comparable usable in terms of discrimination between sufficient and insufficient imputation for common SNPs. However, hiQ seems to be superior when MAF of the imputed marker is low.

Suggestion on how to use Iam hiQ
For general use, a threshold of 0.5 for Iam and 0.9 for hiQ seems reasonable to identify markers with low accuracy. However, we do not claim that this recommendation to generally optimal. All variants with values for Iam and e.g. info below the threshold value, as well as all variants with MAF < 1% and a hiQ below the threshold value should be excluded from a data analysis in order to ensure that all aspects of the imputation quality are met. This pre-analysis marker filtering can be extended to all variants in "hot" or "very hot" regions. If association results cannot be replicated across several studies, a low value of Iam indicates a reduced individual-specific information content, even if, for example, info and hiQ imply sufficient power and genotype heterogeneity.

Discussion
Imputation is a cost-effective tool for GWAS to fill gaps of non-genotyped variants instead of whole-genome sequencing for all recruited individuals, since global coverage in genomic information of available arrays with less than 1 million SNPs not exceeds 25% [10,24]. However, imputation accuracy matters. Several accuracy measures have been proposed and implemented in imputation software, unfortunately diverse across platforms. Das et al. [4] favour r 2 , the squared correlation between true and imputed allele dose, because it is tightly related to the power of an allelic test. However, they also emphasized the importance of adequate imputed samples for the r 2 accuracy.
We introduce Iam hiQ, an independent and complementary pair of accuracy measures. Other than e.g. r 2 , Iam quantifies the amount of individual-specific versus population-specific genotype information in a linear manner for each individual before averaging, while hiQ addresses the inter-individual heterogeneity of dosages for a marker across the sample at hand. These new measures are not intended to compete with established scores, but should complement them. We derived meaningful, but study specific thresholds for variant filtering applying Iam hiQ to a large case-control sample at hand. We showed how regions enriched with less accurately imputed genotypes can be identified (computationally and visually), and finally compared Iam hiQ to info, as provided by IMPUTE2. Iam hiQ is simple to interpret: Iam chance of 0 indicates a complete loss of genomic information for a variant. Iam HWE of 0 indicates a reduction However, it has been discussed that any imputation accuracy measures assuming HWE to calculate "expected" genotype counts can be confounded. This was demonstrated for MaCH r 2 [25]. For the proposed method, HWE is solely chosen as anchor point to define pure population informative dosages. One should keep in mind that Iam hiQ is just a tool for quality assurance and not a data analysis module. Thus, slight violations from HWE do not compromise their use, but in case of family data caution is advised. In such cases, one can either apply Iam HWE to founders only, or use Iam chance .
Finally, accuracy measures with non-justified thresholds, as e.g. info, should be applied with caution. This in mind, we derived thresholds for Iam hiQ, in contrast to other measures, from observation on a large sample and follow a traceable logic. Because its direct and linear relationship to the average amount of individual-specific genomic information contained in the dosages of a marker, Iam is easy to interpret. By this, it differs from r 2 , which is approximately equal to the power of the same test with r 2 N samples.
For the presented quality assurance, we calculated Iam hiQ for autosomes only. Extending this to the X and Y chromosome is possible, but the sex of genotyped individual and the position of the variant on the chromosome must be taken into account when calculating a correct HWE distribution. Even an ex post application of Iam hiQ can be useful, particular to explain whether missed replication of an observed markerphenotype association is due to inaccurate imputation. Since the imputation accuracy of particularly rare markers tend to be low, an improved imputation of the ILCCO samples is planned on newer panels that contain more SNPS with low MAF.

Conclusion
In summary, Iam hiQ is a newly proposed pair of accuracy measures for imputed genotypes. In contrast to others, it addresses directly the contents of individual-specific genotype information and the heterogeneity between dosages. It is independent of the imputation platform and can be computed for all imputed variants. We recommend using Iam hiQ additional to other accuracy scores for variation filtering before stepping into the analysis of imputed GWAS data.

Availability of data and materials
A macro for SAS ® 9.4 to calculate the measures Iam HWE , Iam chance and hiQ for autosomal markers based on the dosage-file as output of IMPUTE2 is provided with the Additional file 1.

Novel accuracy measure
In the following we will consider m = 1 to M markers with two alleles ( a and A ) and a MAF f A in the source population of the study sample consisting of N individuals. The three possible genotypes aa, aA, and AA are indicated by allele doses 0, 1 and 2 (equal to the number of minor alleles A of a genotype). Imputation will result in triplets of a-posteriori genotype probabilities p 0 p 1 p 2 , referred to as dosages, with 2 g=0 p g = 1 . We assume the whole uncertainty related to genotype imputation is contained in these triplets. The allele dose of an individual i for an imputed marker will then be d i,m = 2 i=0 i · p i and can take any value between 0 and 2. Multi-allelic markers are assumed to be split into pseudo-two-allele variants.

Index of individual-specific versus population-specific genotype information: Iam
To quantify the amount of individual-specific versus population-specific genotype information in the dosages of the imputed single marker m for a single person i, we first consider the following three marginal situations: (1) The triplet of dosages takes on the values 1 0 0 , or in a different order, if imputation is fully sufficient, when the missing genotype is unambiguously derived from the reference panel. Thus, the dosages contain fully individual-specific genotype information. To construct an index to distinguish dosages 1 0 0 from 1/3 1/3 1/3 , or respectively, we were guided by the well-established Herfindahl-Hirschman Index (HHI) [26]. HHI is a concentration measure for distributions of discrete random variables with k possible realisations, defined as = k i=1 p 2 k . HHI ranges from 1 (if p j = 1 and p k =j = 0 ; alike (i)) to 1/k (if all p k = 1/k ; alike [ii]). Because we are interested in anti-concentration, the opposite of HHI, we first define the quantity.   Iam chance = 0 indicates that the 3 genotypes are equally likely, averaged over all individuals. Therefore, imputation did not contribute any information at all.
Iam HWE = 0 indicates that the genotypes are just as likely as under the HWE, averaged over all individuals. Therefore, imputation contributes only information of MAF in the population (respectively the reference sample), but not for further individual-specific information.
The computation of both Iam indices requires only the dosages provided by the imputation program used. For case-control or cross-sectional studies MAF can by estimated by averaging the allele doses across all individuals, using the same data: f A will be calculated fair enough for the outlined purpose even for markers that are associated to a trait and therefore have different MAFs between affected and unaffected individuals. The same applies in the presence of low grade hidden relationships. However, if the study sample consists of relatives, it is advisable to consider only unrelated founders for the estimation of f A . Q HWE,i,m depends on MAF. For rare markers Q HWE,i,m is much closer to 0 than for common markers (see Table 3). In case all dosages correspond to HWE (as in situation [iii]) Iam chance for common markers is close to 0 (indicating inaccurate imputation), whereas for rare markers it is close to 1 (misleadingly indicating accurate imputation). This shows that it is fairly hard to determine the content of individual-specific information in the triplet of dosages of rare markers.
It is possible that Iam HWE,i,m takes negative values, if the majority of triplets of dosages can be located between the non-informative and population-informative case. This might be caused by genotyping errors as well as by small deviations between sample and population MAF or locally increased inbreeding coefficients in the source population. Some values of Q i,m will then be between 2/3 and Q HWE,i,m . Due to the upper mentioned shift of Q HWE,i,m by MAF, this is more likely for rare than for common markers. However, small negative values should be regarded as occurred by pure chance.

Index of heterogeneity in quantities: hiQ
Inter-individual heterogeneity of dosages for a marker m is a second concern with respect to the usability of imputed genotypes. Consider the following example: Table 4 gives two markers with average dosages 0.6 0.3 0.1 across 10 individuals. Marker 1 is not suitable for any data analysis, because all dosages are identical. The best guess for all individuals is genotype aa. In contrast, marker 2 consists of three different dosages, leading to different best guess genotypes for the individuals. This heterogeneity serves power for statistical testing.
To construct an index of heterogeneity in quantities of dosages (hiQ) we compare "average dosages"(ad) across all individuals with "average of best guess dosages" (ab) applying the Hellinger H-distance. The H-distance quantifies the distance between two (trinomial) probability distributions, taking value H = 0 in case of coincident probability distributions and H = 1 if the probability vectors are perpendicular [27,28]. Therefore, we defined In Table 4 the "average of best guess dosages" for marker 1 f bg g = 1 0 0 compared to the average dosages ( f ad g = 0.6 0.3 0.1 ) yields an hiQ = 1 − 1 − √ 0.6 = 0.53 . This indicates a loss of heterogeneity between dosages and reduced power of a statistical test. For marker 2, where f ad g = f bg g , hiQ takes on the value 1, indicating fully achievable inter-individual heterogeneity and no reduced power of a statistical test.
A SAS ® macro far calculating Iam HWE , Iam chance and hiQ based on the dosage-file as output of IMPUTE2 is included in the Additional file 1.

Application to a sample of lung cancer patients and controls
We applied the novel indices to a dataset of the Integrative Analysis of Lung Cancer Etiology and Risk program of the International Lung Cancer Consortium (INTEGRAL-ILCCO) to examine the behaviour of Iam hiQ, to find appropriate thresholds for marker filtering and for comparison with an established accuracy measure. The sample comprises 14,803 lung cancer cases and 12,262 controls of European descent. They were genotyped on the OncoArray, which queried 517,482 SNPs. The array is designed to cover the whole genome (with GWAS backbone) and for fine mapping of susceptibility to common cancers as well as for de novo discovery, and hence is enriched with low frequent and rare variants [29]. About 50% of markers are considered as GWAS backbone. Details of the sample, the genotyping and the quality control are given elsewhere [30]. The OncoArray whole-genome data were imputed in a two-stage procedure, using SHAPEIT to derive phased genotypes and IMPUTEv2 to infer additional genotypes for genetic variants included in the 1000 Genomes Project (phase 3 panel) [6,15]. We restricted calculations and comparisons to markers of the autosomes. A total of n = 10,427,599 SNPs f ad g f bg g . Table 4 Inter-individual heterogeneity of dosages: example were finally included in this quality assessment. Presumably difficult to impute, due to their MAF, are 5,226,623 of these SNPs (50%) with a MAF lower than 1% and 1,500,835 SNPs with a MAF between 1 and 5% (Additional file 2).