KRLMM: an adaptive genotype calling method for common and low frequency variants
© Liu et al.; licensee BioMed Central Ltd. 2014
Received: 24 February 2014
Accepted: 19 May 2014
Published: 23 May 2014
SNP genotyping microarrays have revolutionized the study of complex disease. The current range of commercially available genotyping products contain extensive catalogues of low frequency and rare variants. Existing SNP calling algorithms have difficulty dealing with these low frequency variants, as the underlying models rely on each genotype having a reasonable number of observations to ensure accurate clustering.
Here we develop KRLMM, a new method for converting raw intensities into genotype calls that aims to overcome this issue. Our method is unique in that it applies careful between sample normalization and allows a variable number of clusters k (1, 2 or 3) for each SNP, where k is predicted using the available data. We compare our method to four genotyping algorithms (GenCall, GenoSNP, Illuminus and OptiCall) on several Illumina data sets that include samples from the HapMap project where the true genotypes are known in advance. All methods were found to have high overall accuracy (> 98%), with KRLMM consistently amongst the best. At low minor allele frequency, the KRLMM, OptiCall and GenoSNP algorithms were observed to be consistently more accurate than GenCall and Illuminus on our test data.
Methods that tailor their approach to calling low frequency variants by either varying the number of clusters (KRLMM) or using information from other SNPs (OptiCall and GenoSNP) offer improved accuracy over methods that do not (GenCall and Illuminus). The KRLMM algorithm is implemented in the open-source crlmm package distributed via the Bioconductor project (http://www.bioconductor.org).
KeywordsGenotyping Clustering Microarray data analysis
Microarray technology has revolutionized the study of the genetics of complex disease, with large-scale case-control studies completed in the past 7 years [1, 2] uncovering many thousands of new susceptibility loci for a wide range of autoimmune, mental and cardiovascular disorders and cancers . These discoveries were made possible by the pioneering work of the HapMap project  in cataloguing genetic variation in multiple human populations. This collection has recently been expanded upon by the 1000 genomes project  which has enhanced the catalogue of low frequency and rare variation (defined as polymorphism with a minor allele frequency (MAF) of 0.5%–5% and < 0.5% respectively). Single-nucleotide polymorphism (SNP) microarrays with increased coverage of low frequency and rare variants are available from both major manufacturer’s Affymetrix and Illumina. Previous comparisons of genotyping methods have shown that many popular algorithms have reduced call accuracy for these SNPs [6–8]. Optimal analysis of this new content will therefore depend on calling algorithms that are capable of making sensible calls even when there are few observations, as is the case for genotypes involving the minor allele.
Summary of the Illumina data sets analyzed
# SNPs per
Data set for
∼ 1.1 million
∼ 2.5 million
∼ 4.3 million
In this paper, we introduce KRLMM, a new genotype calling method for Illumina BeadArray data that takes a novel approach to that of other methods by allowing a variable number of clusters (k=1, 2 or 3) to be fitted to the between sample normalized intensity data. We analyze datasets from a number of platforms and highlight the benefit of careful signal adjustment between samples to optimize calling accuracy. We compare this approach to four existing algorithms (GenCall, GenoSNP, Illuminus, and OptiCall) by analyzing data sets with increasing coverage of low frequency/rare variants, and compare the performance of these methods in terms of accuracy at varying minor allele frequency. KRLMM is shown to perform favourably in most comparisons, particularly at low MAF.
Regression to choose k
The KRLMM algorithm is a one-dimensional clustering method that uses k-means clustering (as implemented in the kmeans R function) for a variable number of clusters, where the choice of k is determined by the SNP-specific signal. As shown in Figure 1, an ideal clustering might be relatively tight (i.e. low residual sum of squares) rather than diffuse and have cluster centers with low bias when compared to the consensus positions of the AA, AB and BB clusters obtained by looking across all SNPs (as measured by the Mahalanobis distance defined below). It should also assign an appropriate number of calls to each genotype in order to obey Hardy-Weinberg equilibrium. A clustering that performs well according to these criteria is likely to be more accurate than one that scores poorly in one or more of these areas. To exploit these underlying signal characteristics and genetic principles, we applied logistic regression using these variables to predict k independently for each SNP.
To obtain an initial estimate of the cluster centers, k-means clustering with k = 3 is first applied to normalized log-ratios from each SNP separately. A vector containing the median values for the AA, AB and BB clusters (μ k ) is used as a consensus value, and the variance-covariance matrix () estimated. The SNP-specific (i) predictors calculated across all n i samples (j is the sample index) used in the regression model include the residual sum of squares ( for k clusters, where M ij is the normalized log-ratio), Mahalanobis distance (, where x ik is a vector of cluster centers from a given k-means clustering) and agreement with Hardy-Weinberg equilibrium (, , ri2 = 2 p i (1-p i ), ri3 = (1-p i )2, Ni1 = number of AA calls made, Ni2 = number of AB calls made and is the empirical major allele frequency based on a given number of clusters, k). Each variable is calculated for each SNP using the cluster assignments obtained via k-means clustering with k = 1,2,3 (agreement with Hardy-Weinberg equilibrium was not calculated for k = 1 as this quantity is not informative) leaving 8 variables in the regression. Coefficients for each parameter were estimated from fitting this model to a training set of 10,000 randomly chosen SNPs from the HapMap data sets. The independent genotype calls provide us with the true k for these SNPs. Both ordered logistic regression (assumes that the groups k = 1,2,3 are ordered, consistent with increasing dose (0, 1 or 2 copies) of the alternate allele), or regular logistic regression (no ordering assumed) as available in the polr (MASS R package) and vglm functions (VGAM R package) respectively were used. Once regression coefficients are available, and given a complete set of covariates (R ik , D ik , H ik for all k), the best k is determined for each SNP by obtaining fitted values from the model and choosing the k with maximum probability. Next k-means clustering is applied using the predicted value of k to obtain genotype calls. This approach is applied to all SNPs from autosomes and pseudo-autosomal (XY) regions of the genome.
Call confidence measures
The silhouette width (, where is the smallest average between cluster distance for the ijth observation and all other observations in a different cluster and is the average within cluster distance for the ijth observation and all other observations from the same cluster) as calculated in PAM clustering  is used as a call confidence measure. This value will be near 1 for calls made with high confidence ( will be small relative to ) and -1 for low certainty calls (where is large relative to and dominates the calculation).
Choosing an optimal Preprocessing/Regression combination
Calling SNPs from sex chromosomes
SNPs outside of pseudo-autosomal regions on the X and Y chromosomes represent special cases that are called separately for males and females to allow for the appropriate number of clusters (Y chromosome: No call for females, k = 1 or 2 clusters are permitted in males since they have 1 copy of this chromosome. X chromosome: k = 1,2,3 clusters are allowed for females, since they have normal copy number, and k = 1 or 2 is allowed for males since they are hemizygous). When gender (male/female) of the samples is not specified, the average intensities (S, x-axis in Figure 1) from the Y chromosome SNPs are used to impute this information. Applying k-means clustering with k = 2 clusters separately to all chromosome Y markers should assign females to the low signal group (i.e. background hybridization only from these probes since females don’t have a Y chromosome) and the males to a second high intensity group that corresponds to signal from one copy of the Y chromosome. Samples are then assigned as female or male by looking at which cluster (low intensity = female, high intensity = male) is closer on average by comparing the median cluster position determined by k-means and the median S-value for these SNPs within each sample. This simple approach was found to be 100% accurate on each of the 3 platforms upon comparison of the imputed gender with the information provided by the HapMap database.
Table 2 summarizes the major features of the 5 algorithms compared in terms of normalization method and underlying model. Normalization occurs either within (GenCall, Illuminus, GenoSNP, OptiCall) or between sample (KRLMM). Likewise, the model-based clustering can occur within sample (GenoSNP), between sample (GenCall, Illuminus, KRLMM) or in a hybrid manner (OptiCall). For a review of the first 3 methods, see Ritchie et al. (2011) .
The OptiCall algorithm is a hybrid between GenoSNP and Illuminus. It first makes use of data from 50,000 randomly chosen intensity values to do an initial clustering using a mixture model with 4 states (AA, AB, BB and NoCall’). This step is akin to GenoSNP’s between SNP approach for genotyping. OptiCall then clusters SNP-by-SNP using a model similar to the one used in Illuminus, with added hierarchical structure including priors derived from the initial clustering. In clusters with few observations, the prior ensures the clusters are sensibly placed, thus overcoming one of the shortcomings in Illuminus. We chose OptiCall as a representative method from the newer calling methods tailored for low frequency/rare variants. M 3 was also considered, however its implementation in MATLAB (which requires a licence) precluded us from using this software. Another option is the zCall algorithm , which unlike the other methods compared, begins with the output of GenCall rather than the raw intensity data. Such a post-calling correction method could presumably be adapted to improve the calls from any method once calls and confidence values are available from the raw data.
Results and discussion
Overall, the KRLMM algorithm marginally outperforms the vendor’s own alternative (GenCall) in terms of accuracy, even at low minor allele frequencies. For low frequency variants, our results show the merit of approaches that use the signal from other SNPs either directly in their models (OptiCall, GenoSNP) or indirectly to help choose the appropriate number of clusters (KRLMM) to give more accurate calls. Such approaches offer modest improvements over GenCall and Illuminus that do not have a tailored approach for dealing with SNPs with low MAF.
The KRLMM method uses k-means clustering of the log-ratios obtained after between sample quantile normalization of the raw intensities and logistic regression to determine the number of clusters to fit to each SNP. It is the only algorithm to date that allows flexibility in the number of clusters. Although the data we have used to benchmark our approach (generated in-house at Illumina) is known to be of very high quality, we are still able to detect small performance differences between our method and the vendor provided alternative. These performance differences are expected to be more pronounced in disease association studies, where samples are collected over an extended period and will be subject to additional sources of variation. The benefit of KRLMM’s between sample normalization is anticipated to deliver additional performance gains in such settings.
The KRLMM method requires a small subset of training data (i.e. 10,000 SNPs with known k). Although this information was obtained from HapMap data in our case, the regression coefficients could just as easily be estimated using calls from another method if such data were more readily available, as may be the case for platforms from other species. Support for newer Illumina platforms can be easily added and future extensions of the method to handle data from the two-color Axiom platform from Affymetrix are also conceivable. Another adaptation would be to allow k > 3 so that KRLMM could be used on cancer samples where gains and losses in the genome introduce copy number variation.
Future comparisons of the performance of different genotyping algorithms on data from a genome-wide association study (GWAS) or suitable non HapMap control data set, such as the large collection of reference samples in Wang et al. (2011)  would also be instructive. Previous work has demonstrated that the choice of method can influence GWAS results obtained from the Affymetrix platform . This important extension will determine whether the same holds true for studies that use Illumina SNP arrays. Publications on this platform to date predominantly rely on calls from the GenCall or Illuminus algorithms. This will be of particular interest at the rare or low frequency end of the minor allele spectrum, especially when it comes to any significant associations detected or missed by particular genotyping methods. Access to a suitable reference GWAS data set where raw data for both cases and controls is available will be the key to such a comparison.
Genome-wide association study
The new genotype calling algorithm described in this article
Minor allele frequency
We thank Gordon Smyth for advice on regression analysis. This research was supported by NHMRC Project grant 1050661 (RL, MR), Victorian State Government Operational Infrastructure Support, Australian Government NHMRC IRIISS and NIH grant R01GM083084 (RI).
- Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447 (7145): 661-678. 10.1038/nature05911.View ArticleGoogle Scholar
- The Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene): Genome-wide association study identifies new multiple sclerosis susceptibility loci on chromosomes 12 and 20. Nat Genet. 2009, 41 (7): 824-828. 10.1038/ng.396.View ArticleGoogle Scholar
- Yu W, Gwinn M, Clyne M, Yesupriya A, Khoury MJ: A navigator for human genome epidemiology. Nat Genet. 2008, 40 (2): 124-125. 10.1038/ng0208-124.View ArticlePubMedGoogle Scholar
- International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-861. 10.1038/nature06258.View ArticleGoogle Scholar
- The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticlePubMed CentralGoogle Scholar
- Ritchie ME, Liu R, Carvalho BS, Irizarry RA, Australia and New Zealand Multiple Sclerosis Genetics Consortium (ANZgene): Comparing genotyping algorithms for illumina’s infinium whole-genome SNP beadchips. BMC Bioinformatics. 2011, 12: 68-10.1186/1471-2105-12-68.View ArticlePubMed CentralPubMedGoogle Scholar
- Li G, Gelernter J, Kranzler HR, Zhao H: M3: an improved SNP calling algorithm for Illumina BeadArray data. Bioinformatics. 2012, 28 (3): 358-365. 10.1093/bioinformatics/btr673.View ArticlePubMed CentralPubMedGoogle Scholar
- Shah TS, Liu JZ, Floyd JA, Morris JA, Wirth N, Barrett JC, Anderson CA: OptiCall: a robust genotype-calling algorithm for rare, low-frequency and common variants. Bioinformatics. 2012, 28 (12): 1598-1603. 10.1093/bioinformatics/bts180.View ArticlePubMed CentralPubMedGoogle Scholar
- Steemers FJ, Chang W, Lee G, Barker DL, Shen R, Gunderson KL: Whole-genome genotyping with the single-base extension assay. Nat Methods. 2006, 3 (1): 31-33. 10.1038/nmeth842.View ArticlePubMedGoogle Scholar
- Peiffer DA, Le JM, Steemers FJ, Chang W, Jenniges T, Garcia F, Haden K, Li J, Shaw CA, Belmont J, Cheung SW, Shen RM, Barker DL, Gunderson KL: High-resolution genomic profiling of chromosomal aberrations using infinium whole-genome genotyping. Genome Res. 2006, 16 (9): 1136-1148. 10.1101/gr.5402306.View ArticlePubMed CentralPubMedGoogle Scholar
- Kermani BG: Artificial intelligence and global normalization methods for genotyping. 2005, [http://patentimages.storage.googleapis.com/pdfs/US7035740.pdf],Google Scholar
- Giannoulatou E, Yau C, Colella S, Ragoussis J, Holmes CC: Genosnp: a variational Bayes within-sample SNP genotyping algorithm that does not require a reference population. Bioinformatics. 2008, 24 (19): 2209-2214. 10.1093/bioinformatics/btn386.View ArticlePubMedGoogle Scholar
- Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, Clark TG: A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics. 2007, 23 (20): 2741-2746. 10.1093/bioinformatics/btm443.View ArticlePubMed CentralPubMedGoogle Scholar
- Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007, 8 (2): 485-499. 10.1093/biostatistics/kxl042.View ArticlePubMedGoogle Scholar
- Ritchie ME, Carvalho BS, Hetrick KN, Irizarry RA, Tavaré S: R/Bioconductor software for Illumina’s Infinium whole-genome genotyping BeadChips. Bioinformatics. 2009, 25 (19): 2621-2623. 10.1093/bioinformatics/btp470.View ArticlePubMed CentralPubMedGoogle Scholar
- Scharpf RB, Irizarry RA, Ritchie ME, Carvalho B, Ruczinski I: Using the R package crlmm for genotyping and copy number estimation. J Stat Softw. 2011, 40: 1-32.View ArticlePubMed CentralPubMedGoogle Scholar
- Korn JM, Kuruvilla FG, McCarroll SA, Wysoker A, Nemesh J, Cawley S, Hubbell E, Veitch J, Collins PJ, Darvishi K, Lee C, Nizzari MM, Gabriel SB, Purcell S, Daly MJ, Altshuler D: Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nat Genet. 2008, 40 (10): 1253-1260. 10.1038/ng.237.View ArticlePubMed CentralPubMedGoogle Scholar
- Browning BL, Yu Z: Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. Am J Hum Genet. 2009, 85 (6): 847-861. 10.1016/j.ajhg.2009.11.004.View ArticlePubMed CentralPubMedGoogle Scholar
- Goldstein JI, Crenshaw A, Carey J, Grant GB, Maguire J, Fromer M, O’Dushlaine C, Moran JL, Chambert K, Stevens C, Sklar P, Hultman CM, Purcell S, McCarroll SA, Sullivan PF, Daly MJ, Neale BM, Swedish Schizophrenia, Consortium: zCall: a rare variant caller for array-based genotyping: genetics and population analysis. Bioinformatics. 2012, 28 (19): 2543-2545. 10.1093/bioinformatics/bts479.View ArticlePubMed CentralPubMedGoogle Scholar
- Kampstra P: Beanplot: a boxplot alternative for visual comparison of distributions. J Stat Softw. 2008, 28: 1-9.View ArticleGoogle Scholar
- Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: WileyView ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2013, Vienna: R Foundation for Statistical Computing, [http://www.R-project.org]Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): 80-10.1186/gb-2004-5-10-r80.View ArticleGoogle Scholar
- Smith ML, Baggerly KA, Bengtsson H, Ritchie ME, Hansen KD: illuminaio: An open source IDAT parsing tool for Illumina microarrays. F1000Research. 2013, 2: 264-PubMed CentralPubMedGoogle Scholar
- Wang Z, Jacobs KB, Yeager M, Hutchinson A, Sampson J, Chatterjee N, Albanes D, Berndt SI, Chung CC, Diver WR, Gapstur SM, Teras LR, Haiman CA, Henderson BE, Stram D, Deng X, Hsing AW, Virtamo J, Eberle MA, Stone JL, Purdue MP, Taylor P, Tucker M, Chanock SJ: Improved imputation of common and uncommon SNPs with a new reference set. Nat Genet. 2011, 44 (1): 6-7. 10.1038/ng.1044.View ArticlePubMed CentralPubMedGoogle Scholar
- Miclaus K, Chierici M, Lambert C, Zhang L, Vega S, Hong H, Yin S, Furlanello C, Wolfinger R, FG: Variability in GWAS analysis: the impact of genotype calling algorithm inconsistencies. Pharmacogenomics J. 2010, 10 (4): 324-335. 10.1038/tpj.2010.46.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.