KRLMM: an adaptive genotype calling method for common and low frequency variants
 Ruijie Liu^{1},
 Zhiyin Dai^{1},
 Meredith Yeager^{2},
 Rafael A Irizarry^{3}Email author and
 Matthew E Ritchie^{1, 4, 5}Email author
DOI: 10.1186/1471210515158
© Liu et al.; licensee BioMed Central Ltd. 2014
Received: 24 February 2014
Accepted: 19 May 2014
Published: 23 May 2014
Abstract
Background
SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering.
Results
Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data.
Conclusions
Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the opensource crlmm package distributed via the Bioconductor project (http://www.bioconductor.org).
Keywords
Genotyping Clustering Microarray data analysisBackground
Microarray technology has revolutionized the study of the genetics of complex disease, with largescale casecontrol studies completed in the past 7 years [1, 2] uncovering many thousands of new susceptibility loci for a wide range of autoimmune, mental and cardiovascular disorders and cancers [3]. These discoveries were made possible by the pioneering work of the HapMap project [4] in cataloguing genetic variation in multiple human populations. This collection has recently been expanded upon by the 1000 genomes project [5] which has enhanced the catalogue of low frequency and rare variation (defined as polymorphism with a minor allele frequency (MAF) of 0.5%–5% and < 0.5% respectively). Singlenucleotide polymorphism (SNP) microarrays with increased coverage of low frequency and rare variants are available from both major manufacturer’s Affymetrix and Illumina. Previous comparisons of genotyping methods have shown that many popular algorithms have reduced call accuracy for these SNPs [6–8]. Optimal analysis of this new content will therefore depend on calling algorithms that are capable of making sensible calls even when there are few observations, as is the case for genotypes involving the minor allele.
Summary of the Illumina data sets analyzed
Platform  # SNPs per  Data set for  MAF  # HapMap 

sample  SNP selection  samples  
Omni1Quad  ∼ 1.1 million  HapMap  > 5%  267 (88:44:45:90:0) 
Omni2.5Quad  ∼ 2.5 million  1000 genomes  > 2.5%  171 (0:43:45:0:83) 
Omni5Quad  ∼ 4.3 million  1000 genomes  > 1%  341 (172:44:40:85:0) 
In this paper, we introduce KRLMM, a new genotype calling method for Illumina BeadArray data that takes a novel approach to that of other methods by allowing a variable number of clusters (k=1, 2 or 3) to be fitted to the between sample normalized intensity data. We analyze datasets from a number of platforms and highlight the benefit of careful signal adjustment between samples to optimize calling accuracy. We compare this approach to four existing algorithms (GenCall, GenoSNP, Illuminus, and OptiCall) by analyzing data sets with increasing coverage of low frequency/rare variants, and compare the performance of these methods in terms of accuracy at varying minor allele frequency. KRLMM is shown to perform favourably in most comparisons, particularly at low MAF.
Methods
Data sets
Signal characteristics
KRLMM algorithm
Preprocessing
Regression to choose k
The KRLMM algorithm is a onedimensional clustering method that uses kmeans clustering (as implemented in the kmeans R function) for a variable number of clusters, where the choice of k is determined by the SNPspecific signal. As shown in Figure 1, an ideal clustering might be relatively tight (i.e. low residual sum of squares) rather than diffuse and have cluster centers with low bias when compared to the consensus positions of the AA, AB and BB clusters obtained by looking across all SNPs (as measured by the Mahalanobis distance defined below). It should also assign an appropriate number of calls to each genotype in order to obey HardyWeinberg equilibrium. A clustering that performs well according to these criteria is likely to be more accurate than one that scores poorly in one or more of these areas. To exploit these underlying signal characteristics and genetic principles, we applied logistic regression using these variables to predict k independently for each SNP.
To obtain an initial estimate of the cluster centers, kmeans clustering with k = 3 is first applied to normalized logratios from each SNP separately. A vector containing the median values for the AA, AB and BB clusters (μ_{ k }) is used as a consensus value, and the variancecovariance matrix (${\widehat{\mathbf{V}}}_{k}$) estimated. The SNPspecific (i) predictors calculated across all n_{ i }samples (j is the sample index) used in the regression model include the residual sum of squares (${R}_{\mathit{\text{ik}}}=\sum _{k}\sum _{j}{({M}_{\mathit{\text{ij}}}{\widehat{\mu}}_{\mathit{\text{ik}}})}^{2}$ for k clusters, where M_{ ij }is the normalized logratio), Mahalanobis distance (${D}_{\mathit{\text{ik}}}=({\mathbf{x}}_{\mathit{\text{ik}}}{\widehat{\mathit{\mu}}}_{k})\widehat{{\mathbf{V}}_{k}}{({\mathbf{x}}_{\mathit{\text{ik}}}{\widehat{\mathit{\mu}}}_{k})}^{T}$, where x_{ ik }is a vector of cluster centers from a given kmeans clustering) and agreement with HardyWeinberg equilibrium (${H}_{\mathit{\text{ik}}}=\sum _{l=1}^{k}\frac{{({N}_{\mathit{\text{il}}}{n}_{i}{r}_{\mathit{\text{il}}})}^{2}}{\left({n}_{i}{r}_{\mathit{\text{il}}}\right)}$, ${r}_{i1}={p}_{i}^{2}$, r_{i2 }= 2 p_{ i }(1p_{ i }), r_{i3 }= (1p_{ i })^{2}, N_{i1 }= number of AA calls made, N_{i2 }= number of AB calls made and ${p}_{i}=\frac{2{N}_{i1}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}{N}_{i2})}{2{n}_{i}}$ is the empirical major allele frequency based on a given number of clusters, k). Each variable is calculated for each SNP using the cluster assignments obtained via kmeans clustering with k = 1,2,3 (agreement with HardyWeinberg equilibrium was not calculated for k = 1 as this quantity is not informative) leaving 8 variables in the regression. Coefficients for each parameter were estimated from fitting this model to a training set of 10,000 randomly chosen SNPs from the HapMap data sets. The independent genotype calls provide us with the true k for these SNPs. Both ordered logistic regression (assumes that the groups k = 1,2,3 are ordered, consistent with increasing dose (0, 1 or 2 copies) of the alternate allele), or regular logistic regression (no ordering assumed) as available in the polr (MASS R package) and vglm functions (VGAM R package) respectively were used. Once regression coefficients are available, and given a complete set of covariates (R_{ ik }, D_{ ik }, H_{ ik }for all k), the best k is determined for each SNP by obtaining fitted values from the model and choosing the k with maximum probability. Next kmeans clustering is applied using the predicted value of k to obtain genotype calls. This approach is applied to all SNPs from autosomes and pseudoautosomal (XY) regions of the genome.
Call confidence measures
The silhouette width (${\mathit{\text{SW}}}_{\mathit{\text{ij}}}=\frac{{\widehat{b}}_{\mathit{\text{ij}}}{\u0175}_{\mathit{\text{ij}}}}{\mathit{\text{max}}({\u0175}_{\mathit{\text{ij}}},{\widehat{b}}_{\mathit{\text{ij}}})}$, where ${\widehat{b}}_{\mathit{\text{ij}}}$ is the smallest average between cluster distance for the ijth observation and all other observations in a different cluster and ${\u0175}_{\mathit{\text{ij}}}$ is the average within cluster distance for the ijth observation and all other observations from the same cluster) as calculated in PAM clustering [21] is used as a call confidence measure. This value will be near 1 for calls made with high confidence (${\u0175}_{\mathit{\text{ij}}}$ will be small relative to ${\widehat{b}}_{\mathit{\text{ij}}}$) and 1 for low certainty calls (where ${\u0175}_{\mathit{\text{ij}}}$ is large relative to ${\widehat{b}}_{\mathit{\text{ij}}}$ and dominates the calculation).
Choosing an optimal Preprocessing/Regression combination
Calling SNPs from sex chromosomes
SNPs outside of pseudoautosomal regions on the X and Y chromosomes represent special cases that are called separately for males and females to allow for the appropriate number of clusters (Y chromosome: No call for females, k = 1 or 2 clusters are permitted in males since they have 1 copy of this chromosome. X chromosome: k = 1,2,3 clusters are allowed for females, since they have normal copy number, and k = 1 or 2 is allowed for males since they are hemizygous). When gender (male/female) of the samples is not specified, the average intensities (S, xaxis in Figure 1) from the Y chromosome SNPs are used to impute this information. Applying kmeans clustering with k = 2 clusters separately to all chromosome Y markers should assign females to the low signal group (i.e. background hybridization only from these probes since females don’t have a Y chromosome) and the males to a second high intensity group that corresponds to signal from one copy of the Y chromosome. Samples are then assigned as female or male by looking at which cluster (low intensity = female, high intensity = male) is closer on average by comparing the median cluster position determined by kmeans and the median Svalue for these SNPs within each sample. This simple approach was found to be 100% accurate on each of the 3 platforms upon comparison of the imputed gender with the information provided by the HapMap database.
Implementation
Other algorithms
Table 2 summarizes the major features of the 5 algorithms compared in terms of normalization method and underlying model. Normalization occurs either within (GenCall, Illuminus, GenoSNP, OptiCall) or between sample (KRLMM). Likewise, the modelbased clustering can occur within sample (GenoSNP), between sample (GenCall, Illuminus, KRLMM) or in a hybrid manner (OptiCall). For a review of the first 3 methods, see Ritchie et al. (2011) [6].
The OptiCall algorithm is a hybrid between GenoSNP and Illuminus. It first makes use of data from 50,000 randomly chosen intensity values to do an initial clustering using a mixture model with 4 states (AA, AB, BB and NoCall’). This step is akin to GenoSNP’s between SNP approach for genotyping. OptiCall then clusters SNPbySNP using a model similar to the one used in Illuminus, with added hierarchical structure including priors derived from the initial clustering. In clusters with few observations, the prior ensures the clusters are sensibly placed, thus overcoming one of the shortcomings in Illuminus. We chose OptiCall as a representative method from the newer calling methods tailored for low frequency/rare variants. M ^{3}[7] was also considered, however its implementation in MATLAB (which requires a licence) precluded us from using this software. Another option is the zCall algorithm [19], which unlike the other methods compared, begins with the output of GenCall rather than the raw intensity data. Such a postcalling correction method could presumably be adapted to improve the calls from any method once calls and confidence values are available from the raw data.
Results and discussion
Conclusion
Overall, the KRLMM algorithm marginally outperforms the vendor’s own alternative (GenCall) in terms of accuracy, even at low minor allele frequencies. For low frequency variants, our results show the merit of approaches that use the signal from other SNPs either directly in their models (OptiCall, GenoSNP) or indirectly to help choose the appropriate number of clusters (KRLMM) to give more accurate calls. Such approaches offer modest improvements over GenCall and Illuminus that do not have a tailored approach for dealing with SNPs with low MAF.
The KRLMM method uses kmeans clustering of the logratios obtained after between sample quantile normalization of the raw intensities and logistic regression to determine the number of clusters to fit to each SNP. It is the only algorithm to date that allows flexibility in the number of clusters. Although the data we have used to benchmark our approach (generated inhouse at Illumina) is known to be of very high quality, we are still able to detect small performance differences between our method and the vendor provided alternative. These performance differences are expected to be more pronounced in disease association studies, where samples are collected over an extended period and will be subject to additional sources of variation. The benefit of KRLMM’s between sample normalization is anticipated to deliver additional performance gains in such settings.
The KRLMM method requires a small subset of training data (i.e. 10,000 SNPs with known k). Although this information was obtained from HapMap data in our case, the regression coefficients could just as easily be estimated using calls from another method if such data were more readily available, as may be the case for platforms from other species. Support for newer Illumina platforms can be easily added and future extensions of the method to handle data from the twocolor Axiom platform from Affymetrix are also conceivable. Another adaptation would be to allow k > 3 so that KRLMM could be used on cancer samples where gains and losses in the genome introduce copy number variation.
Future comparisons of the performance of different genotyping algorithms on data from a genomewide association study (GWAS) or suitable non HapMap control data set, such as the large collection of reference samples in Wang et al. (2011) [25] would also be instructive. Previous work has demonstrated that the choice of method can influence GWAS results obtained from the Affymetrix platform [26]. This important extension will determine whether the same holds true for studies that use Illumina SNP arrays. Publications on this platform to date predominantly rely on calls from the GenCall or Illuminus algorithms. This will be of particular interest at the rare or low frequency end of the minor allele spectrum, especially when it comes to any significant associations detected or missed by particular genotyping methods. Access to a suitable reference GWAS data set where raw data for both cases and controls is available will be the key to such a comparison.
Abbreviations
 GWAS:

Genomewide association study
 KRLMM:

The new genotype calling algorithm described in this article
 MAF:

Minor allele frequency
 SNP:

Singlenucleotide polymorphism.
Declarations
Acknowledgements
We thank Gordon Smyth for advice on regression analysis. This research was supported by NHMRC Project grant 1050661 (RL, MR), Victorian State Government Operational Infrastructure Support, Australian Government NHMRC IRIISS and NIH grant R01GM083084 (RI).
Authors’ Affiliations
References
 Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661678. 10.1038/nature05911.View ArticleGoogle Scholar
 The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene): Genomewide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20. Nat Genet. 2009, 41 (7): 824828. 10.1038/ng.396.View ArticleGoogle Scholar
 Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology. Nat Genet. 2008, 40 (2): 124125. 10.1038/ng0208124.View ArticlePubMedGoogle Scholar
 International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851861. 10.1038/nature06258.View ArticleGoogle Scholar
 The 1000 Genomes Project Consortium: A map of human genome variation from populationscale sequencing. Nature. 2010, 467 (7319): 10611073. 10.1038/nature09534.View ArticlePubMed CentralGoogle Scholar
 Ritchie ME, Liu R, Carvalho BS, Irizarry RA, Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene): Comparing genotyping algorithms for illumina’s infinium wholegenome SNP beadchips. BMC Bioinformatics. 2011, 12: 6810.1186/147121051268.View ArticlePubMed CentralPubMedGoogle Scholar
 Li G, Gelernter J, Kranzler HR, Zhao H: M^{3}: an improved SNP calling algorithm for Illumina BeadArray data. Bioinformatics. 2012, 28 (3): 358365. 10.1093/bioinformatics/btr673.View ArticlePubMed CentralPubMedGoogle Scholar
 Shah TS, Liu JZ, Floyd JA, Morris JA, Wirth N, Barrett JC, Anderson CA: OptiCall: a robust genotypecalling algorithm for rare, lowfrequency and common variants. Bioinformatics. 2012, 28 (12): 15981603. 10.1093/bioinformatics/bts180.View ArticlePubMed CentralPubMedGoogle Scholar
 Steemers FJ, Chang W, Lee G, Barker DL, Shen R, Gunderson KL: Wholegenome genotyping with the singlebase extension assay. Nat Methods. 2006, 3 (1): 3133. 10.1038/nmeth842.View ArticlePubMedGoogle Scholar
 Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SW, Shen RM, Barker DL, Gunderson KL: Highresolution genomic profiling of chromosomal aberrations using infinium wholegenome genotyping. Genome Res. 2006, 16 (9): 11361148. 10.1101/gr.5402306.View ArticlePubMed CentralPubMedGoogle Scholar
 Kermani BG: Artificial intelligence and global normalization methods for genotyping. 2005, [http://patentimages.storage.googleapis.com/pdfs/US7035740.pdf],Google Scholar
 Giannoulatou E, Yau C, Colella S, Ragoussis J, Holmes CC: Genosnp: a variational Bayes withinsample SNP genotyping algorithm that does not require a reference population. Bioinformatics. 2008, 24 (19): 22092214. 10.1093/bioinformatics/btn386.View ArticlePubMedGoogle Scholar
 Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, Clark TG: A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics. 2007, 23 (20): 27412746. 10.1093/bioinformatics/btm443.View ArticlePubMed CentralPubMedGoogle Scholar
 Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of highdensity oligonucleotide SNP array data. Biostatistics. 2007, 8 (2): 485499. 10.1093/biostatistics/kxl042.View ArticlePubMedGoogle Scholar
 Ritchie ME, Carvalho BS, Hetrick KN, Irizarry RA, Tavaré S: R/Bioconductor software for Illumina’s Infinium wholegenome genotyping BeadChips. Bioinformatics. 2009, 25 (19): 26212623. 10.1093/bioinformatics/btp470.View ArticlePubMed CentralPubMedGoogle Scholar
 Scharpf RB, Irizarry RA, Ritchie ME, Carvalho B, Ruczinski I: Using the R package crlmm for genotyping and copy number estimation. J Stat Softw. 2011, 40: 132.View ArticlePubMed CentralPubMedGoogle Scholar
 Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E, Veitch J, Collins PJ, Darvishi K, Lee C, Nizzari MM, Gabriel SB, Purcell S, Daly MJ, Altshuler D: Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40 (10): 12531260. 10.1038/ng.237.View ArticlePubMed CentralPubMedGoogle Scholar
 Browning BL, Yu Z: Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces falsepositive associations for genomewide association studies. Am J Hum Genet. 2009, 85 (6): 847861. 10.1016/j.ajhg.2009.11.004.View ArticlePubMed CentralPubMedGoogle Scholar
 Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O’Dushlaine C, Moran JL, Chambert K, Stevens C, Sklar P, Hultman CM, Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM, Swedish Schizophrenia, Consortium: zCall: a rare variant caller for arraybased genotyping: genetics and population analysis. Bioinformatics. 2012, 28 (19): 25432545. 10.1093/bioinformatics/bts479.View ArticlePubMed CentralPubMedGoogle Scholar
 Kampstra P: Beanplot: a boxplot alternative for visual comparison of distributions. J Stat Softw. 2008, 28: 19.View ArticleGoogle Scholar
 Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: WileyView ArticleGoogle Scholar
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2013, Vienna: R Foundation for Statistical Computing, [http://www.Rproject.org]Google Scholar
 Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): 8010.1186/gb2004510r80.View ArticleGoogle Scholar
 Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD: illuminaio: An open source IDAT parsing tool for Illumina microarrays. F1000Research. 2013, 2: 264PubMed CentralPubMedGoogle Scholar
 Wang Z, Jacobs KB, Yeager M, Hutchinson A, Sampson J, Chatterjee N, Albanes D, Berndt SI, Chung CC, Diver WR, Gapstur SM, Teras LR, Haiman CA, Henderson BE, Stram D, Deng X, Hsing AW, Virtamo J, Eberle MA, Stone JL, Purdue MP, Taylor P, Tucker M, Chanock SJ: Improved imputation of common and uncommon SNPs with a new reference set. Nat Genet. 2011, 44 (1): 67. 10.1038/ng.1044.View ArticlePubMed CentralPubMedGoogle Scholar
 Miclaus K, Chierici M, Lambert C, Zhang L, Vega S, Hong H, Yin S, Furlanello C, Wolfinger R, FG: Variability in GWAS analysis: the impact of genotype calling algorithm inconsistencies. Pharmacogenomics J. 2010, 10 (4): 324335. 10.1038/tpj.2010.46.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.