M^{3}S: a genotype calling method incorporating information from samples with known genotypes
 Gengxin Li^{1}Email author and
 Hongyu Zhao^{2}
Received: 13 April 2015
Accepted: 2 November 2015
Published: 3 December 2015
Abstract
Background
A key challenge in analyzing high throughput Single Nucleotide Polymorphism (SNP) arrays is the accurate inference of genotypes for SNPs with low minor allele frequencies. A number of calling algorithms have been developed to infer genotypes for common SNPs, but they are limited in their performance in calling rare SNPs. The existing algorithms can be broadly classified into three categories, including: populationbased methods, SNPbased methods, and a hybrid of the two approaches. Despite the relatively better performance of the hybrid approach, it is still challenging to analyze rare SNPs.
Results
We propose to utilize information from samples with known genotypes to develop a two stage genotyping procedure, namely M^{3}S, for rare SNP calling. This new approach can improve genotyping accuracy through clearly defining the boundaries of genotype clusters from samples with known genotypes, and enlarge the call rate by combining the simulated data based on the inferred genotype clusters information with the study population.
Conclusions
Applications to real data demonstrates that this new approach M^{3}S outperforms existing methods in calling rare SNPs.
Keywords
Background
Genomewide association studies (GWAS) have successfully identified tens of thousands of genetic variants contributing to hundreds of human diseases in the past decade [1, 2]. Their success is largely due to the availability of affordable and reliable SNP arrays, such as those from Affymetrix and Illumina [3, 4]. Computationally efficient and accurate genotyping algorithms are needed to infer genotypes of SNPs from the observed data produced by two platforms. Many calling algorithms have been developed for these two platforms, such as: Iluminus [5], GenoSNP [6], GenCall [7], CRLMM [8, 9], BEAGLECALL [10] and zCall [11] for Illumina arrays; and RLMM [12], BRLMM [13] and CHIAMO [14, 15] for Affymetrix GeneChips.
In this article, we focus on the analysis of Illumina arrays, which use two color single base extension (SBE) biochemistry technology [16] to infer the genotype of a SNP with two alleles A and B. The goal of genotype calling is to infer the genotype (AA, AB, or BB) carried by an individual across the SNPs in the genome. All the genotype calling algorithms share the common feature of first defining genotype clusters and then assigning individuals to these clusters. They differ in how the clusters are defined and how the samples are allocated to these clusters. One class of genotyping algorithms is populationbased where all individuals are genotyped SNPbySNP. The genotype clusters are defined through the joint analysis of all samples at a given SNP separately. Although commonly used, its performance highly depends on the size of the study population. It may not perform well for SNPs with low minor allele frequencies (MAF) where there may be very few individuals for certain clusters. For example, GenCall [7], a representative method in this class, needs a large reference population (e.g. 10,000) to accurately define genotype clusters for SNPs with MAF < 0.01 [6]. In contrast, another approach, called the SNPbased method, defines genotype clusters using all SNPs of an individual at a time, as represented by GenoSNP [6]. The good performance of this approach depends on two assumptions: (1) The probes of all the SNPs behave similarly; and (2) the variations within a genotype cluster are much smaller than that between clusters. Compared to the populationbased method, GenoSNP does not need a large number of samples to ensure calling accuracy for SNPs with low MAF. However, empirical applications of this approach lead to many more SNPs failing the HardyWeinberg (HW) principle, indicating that the two implicit assumptions are likely violated in reality.
To address the limitation of these two approaches, we developed M^{3} that combines the populationbased strategy with the SNPbased approach to improve calling accuracy for rare SNPs [17]. Compared to GenCall, M^{3} borrows genotype cluster information from reference SNPs to improve the calling performance for rare SNPs. It also improves upon GenoSNP by utilizing the populationbased calling scheme. However, the effectiveness of M^{3} depends on the quality of the reference SNP. If the reference SNP behaves very different from the target rare SNP, the inferred genotype will likely be incorrect. In this article, we consider using additional information to further improve the quality of the reference SNP. In particular, we consider the use of samples with known genotypes, e.g. HapMap samples, that are often included for quality control purposes. Because the HapMap samples have been extensively studied and their true genotypes can be considered known with almost certainty, the genotype calling results from these samples can provide a good metric for the performance of calling algorithms. Hence, we incorporate known genotype information from these quality control samples into the reference SNP selection procedure under the general framework for M^{3}, and this new method is named M^{3}S where S denotes samples with known genotypes for calling. More specifically, M^{3}S utilizes known genotype information to construct three genotype clusters in the first stage, and the entire samples together with simulated data based on the inferred cluster information are then genotyped in the second stage.
The key component in our improved method is to explicitly define the boundaries of the three genotype clusters for each SNP through samples with known genotypes. Although the idea of leveraging subjects with known genotype information is intuitive, there is a practical challenge in implementations, e.g., there is often only two or even one cluster for known genotype samples, making the inference of boundaries difficult. This can be solved by taking advantage of the reference SNP selection method [17] developed for M^{3} to the samples with known genotypes. It can directly define boundaries of genotype clusters before genotyping other study subjects. In addition, our proposed method is computationally efficient and applicable to the largescale intensity data (see Additional file 1B).
The rest of the paper is organized as follows. We first describe the two stages of our new method (M^{3}S), and then explain how this new method helps calling for rare SNPs. Finally we compare the performance of our method with existing methods to demonstrate its better performance.
Methods
Illumina chip data description
The Illumina microarrays probe millions of SNPs per sample, with newer arrays including more recently discovered rare SNPs. In their probe design, the Illumina arrays are made up of a number of beadpools containing millions of beadtypes. Every SNP is represented by one beadtype with 20 pairs of allelespecific intensities for one individual [16], thus each beadtype is able to assay both SNP alleles. In this study, we consider the pair of raw intensity values at each beadtype for every sample to infer geneotypes.
Statistical methods
Stage I: Estimation of three genotype clusters from samples with known genotypes

(1) Step I: randomly assign samples with known genotypes into two groups. One will be called the training set with l _{ a } samples, and the other is the testing set. Generally, the training samples are used to infer the parameters of genotype clusters, and the testing individuals are used to evaluate the performance of the method. In this study, we define different allocation ratios between the training samples and testing samples (ratio 1:1 or 2:1) to evaluate the impact of the allocation ratio on genotyping. For the data set that motivated this research, there are 141 samples with known genotypes. Under the allocation ratio 1:1, we consider 71 training samples, i.e. l _{ a }=71, and 70 testing samples. For the allocation ratio 2:1, 94 subjects are assigned to the training group, i.e. l _{ a }=94, and 47 individuals are in the testing set.

(2) Step II: evaluate the quality of each featured SNP via analyzing the training samples. When the training samples are randomly selected, the number of distinct genotypes and the sample size of each genotype can be inferred from known genotypes, then all the featured SNPs with training subjects (\(\boldsymbol {a}_{s}^{\ast }\): the raw intensity vector of training samples) can be classified into two groups, where G_{1} collects those featured SNPs having three distinct genotypes, whereas G_{2} collects those featured SNPs with fewer than three distinct genotypes.$${} \left\{ \begin{aligned} \boldsymbol{a}_{s}^{\ast} \in {G}_{2} &\quad\text{if} {c}_{as}<3 \text{or} {l}_{a_{0}s}<3 \text{or} {l}_{a_{1}s}<3 \text{or} {l}_{a_{2}s}<3\\ \boldsymbol{a}_{s}^{\ast} \in G_{1} &\quad\text{otherwise } \end{aligned}\right. $$(1)
where c _{ as } denotes the number of genotype clusters for training samples at the s ^{ t h } featured SNP; \(l_{a_{0}s}\), \(l_{a_{1}s}\) and \(l_{a_{2}s}\) denote the number of training subjects in the three genotype groups (AA, AB or BB) for the s ^{ t h } featured SNP.

(3) Step III: select reference featured SNPs for target featured SNP in G _{2}. The featured SNPs with training samples having three clear clusters are selected as candidate reference featured SNPs denoted by Ref featured SNPs. These Ref featured SNPs are searched from the featured SNPs near around the target featured SNP, and the size of search area denoted by R represents the total number of candidate Ref featured SNPs.

(4) Step IV: calculate the Mahalanobis distance between the target featured SNP with training samples and each Ref featured SNP containing the same training samples. Here, the Mahalanobis distance measures the similarity between the target featured SNP and each Ref featured SNP, where the optimal Ref featured SNP is selected based on minimizing this distance. To simplify the calculation, when the s ^{ t h } featured SNP is the target featured SNP, the two dimensional raw intensity vector \(a_{\textit {hs}}^{\ast }=(r_{\textit {hs}}, g_{\textit {hs}})\) is projected to a univariate variable b _{ hs } [5], and the s ^{ t h } featured SNP and all Ref featured SNPs are classified into three genotype clusters in terms of this univariate variable.$$\begin{aligned} & b_{hs}=\frac{r_{hs}g_{hs}}{r_{hs}+g_{hs}} \hspace{1.2em} \mathrm{s}=1, \ldots, \mathrm{S}\\ & b_{hr}=\frac{r_{hr}g_{hr}}{r_{hr}+g_{hr}} \hspace{1.2em} \mathrm{r}=1, \ldots, \mathit{R}\\ \end{aligned} $$The minimum Mahalanobis distance (d _{ s }) is obtained by the following equation,Note that S _{ R } is the set of Ref featured SNPs selected for the s ^{ t h } featured SNP; b _{ hs } and b _{ hr } are the projected intensities of the h ^{ t h } training sample for the s ^{ t h } featured SNP and the r ^{ t h } Ref featured SNP separately; and s _{ h } is the standard deviation of b _{ hs } and b _{ hr }. So, the best Ref featured SNP selected through d _{ s } may be most informative for the clustering of the s ^{ t h } featured SNP.$$d_{s}=\underset{r; r\in S_{R}}{min}\left\{\sqrt{\sum_{h=1}^{l_{a}}\frac{(b_{hs}b_{hr})^{2}}{{s_{h}^{2}}}} \right\}. $$

(5) Step V: estimate parameters of three genotype clusters from the training individuals. When \(\boldsymbol {a}_{s}^{\ast }\in G_{1}\), the parameters of 3 clusters are directly inferred from training samples of the s ^{ t h } featured SNP. If \(\boldsymbol {a}_{s}^{\ast } \in G_{2}\), the three genotype clusters are estimated by training samples from the s ^{ t h } featured SNP and the best Ref featured SNP (such as: the r ^{ t h } Ref featured SNP). In the second case, the training samples in the s ^{ t h } featured SNP are not adequate to construct the three genotype clusters, and an appropriate Ref featured SNP could help the s ^{ t h } featured SNP estimate three cluster information. A new combined matrix (\(\mathbf {aa}_{s}^{\ast }\)) is defined to collect the training samples for the s ^{ t h } featured SNP and the r ^{ t h } Ref featured SNP where \(\mathbf {a}_{s}^{\ast }\) and \(\mathbf {a}_{r}^{\ast }\) are the raw intensity matrices of the training samples at the s ^{ t h } and r ^{ t h } SNPs, respectively.Then the combined vector of raw intensities is partitioned into three clusters according to the training samples’ known genotypes. If k denotes the cluster label with values 1, 2, or 3, \(\mathbf {aa}_{\textit {sk}}^{\ast }\) is the combined raw intensity matrix in the k ^{ t h } genotype group. The mean and variance of the three genotype clusters can be estimated by the following equations,$$\mathbf{aa}_{s}^{\ast}=\left(\begin{aligned} \mathbf{a}_{s}^{\ast}\\ \mathbf{a}_{r}^{\ast} \end{aligned}\right) $$$$ \mu_{ask}=\left\{ \begin{aligned} & (l_{ask}+l_{ark})^{1}1_{G_{2}}^{T} \mathbf{aa}_{sk}^{\ast} \hspace{1.2em} \text{if} \boldsymbol{a}_{s}^{\ast} \in G_{2}\\ &{l_{ask}}^{1}1_{G_{1}}^{T} \mathbf{a}_{sk}^{\ast} \hspace{5.4em} \text{if} \boldsymbol{a}_{s}^{\ast} \in G_{1}\\ \end{aligned}\right. $$(2)where l _{ ask } and l _{ ark } denote the number of training samples at the k ^{ t h } cluster for the s ^{ t h } featured SNP and r ^{ t h } Ref featured SNP, respectively; \(1_{G_{2}}\) is an (l _{ ask }+l _{ ark }) x 1 column vector with all elements equal to 1; \(1_{G_{1}}\) is a l _{ ask } x 1 column vector with all elements equal to 1; and \(\mathbf {a}_{\textit {sk}}^{\ast }\) is the raw intensity matrix of training subjects in the k ^{ t h } genotype group for the s ^{ t h } featured SNP.$${} \mathbf{\Sigma}_{ask}=\left\{ \begin{aligned} & \frac{(\mathbf{aa}_{sk}^{\ast}1_{G_{2}}\mu_{ask})^{T}(\mathbf{aa}_{sk}^{\ast}1_{G_{2}}\mu_{ask})}{l_{ask}+l_{ark}1} \hspace{1.2em} \text{if} \mathbf{a}_{s}^{\ast} \in G_{2}\\ & \frac{(\mathbf{a}_{sk}^{\ast}1_{G_{1}}\mu_{ask})^{T}(\mathbf{a}_{sk}^{\ast}1_{G_{1}}\mu_{ask})}{l_{ask}1} \hspace{2.0em} \text{if} \mathbf{a}_{s}^{\ast} \in G_{1}\\ \end{aligned}\right. $$(3)
Note that μ _{ ask } is an 1 × 2 vector measuring the average intensity of training samples in the k ^{ t h } cluster for the s ^{ t h } featured SNP; Σ _{ ask } is a 2 × 2 covariance matrix of the k ^{ t h } cluster at the s ^{ t h } featured SNP.
In summary, the first stage focuses on selecting reference featured SNPs to better estimate the parameters of the three genotype clusters.
Stage II: Gaussian mixture model for augmented intensity data

(1) Step I: simulate intensity data according to the inferred parameters of the three genotype clusters from the training samples with known genotypes.$$ \begin{aligned} &\mathbf{y}_{js} \sim {Gaussian}_{k}(\mu_{ask}, \mathbf{\Sigma}_{ask}) \hspace{.5em}\text{with probability} \frac{1}{3} \\ &{j} = 1,\ldots,m, {s} = 1,\ldots,{S}, {k} = 1, 2, \text{or} 3 \end{aligned} $$(4)
In this study, we simulate m additional individuals at each SNP from the above Gaussian mixture distribution. Parameters μ _{ ask } and Σ _{ ask } (k = 1, 2, or 3) could adequately provide the center and variability of the three genotype clusters for each SNP, and each cluster contains equal number of simulated subjects (\(\frac {m}{3}\)). Here, we vary the value of m to be 600, 1500, and 3000 to evaluate the impact of this simulated data on genotyping. Specifically, we simulate equal numbers of subjects in every genotype group to improve the representation of rare genotypes in the samples for better genotype calling. More importantly, adding this simulated data with equal numbers in each genotype cluster to the observed data will not influence the configure of major homozygote and minor homozygote for the observed data.

(2) Step II: genotype calling using both observed and simulated data.
The pair of original raw intensities are x _{ is }=(r _{ is },g _{ is }), and the pair of simulated raw intensities are y _{ js }=(r _{ js },g _{ js }), then the combined raw intensity values at the s ^{ t h } SNP \({\mathbf {t}}_{s} = {{\mathbf {x}_{s}}\choose {\mathbf {y}}_{s}} \) consist of the augmented data. Within one SNP, pairs of raw intensities primarily consist of three genotype clusters which correspond to three genotypes (AA, AB, and BB). We apply the Gaussian Mixture Model (GMM) with fixed components [18], to t _{ s }, where the number of components is fixed at three. Besides, we introduce a null component for those individuals whose genotypes are difficult to be assigned to one of the three clusters. In principle, this model assigns the w ^{ t h } pair of the combined raw intensities t _{ ws } to one component with probability π _{ sk } where k measures three components corresponding to the three genotypes. The latent genotype class is denoted by the indicator variable z _{ ws } generated from a multinomial distribution where z _{ ws } takes the value of 1, 2, or 3. Then the threecomponent GMM can be expressed as:$$ \begin{aligned} &\mathbf{z}_{ws} \sim {Mult}_{3}(1,\mathbf{\pi}_{s1},\mathbf{\pi}_{s2},\mathbf{\pi}_{s3})\\ &\ell(\mathbf{t}_{s}\boldsymbol{\Theta}_{s},\mathbf{z}_{s})=\prod_{w=1}^{n^{\ast}}\prod_{k=1}^{3}\Phi({t}_{ws}\mathbf{\mu}_{sk},\mathbf{\Sigma}_{sk})^{I(\mathbf{z}_{ws}=k)}\\ &{\textit{w} = 1,\ldots,n^{\ast}, \textit{s} = 1,\ldots,\textit{S}, \textit{k} = 1, 2 \text{or} 3} \end{aligned} $$(5)where n ^{∗} is the total number of individuals collected at the s ^{ t h } SNP and simulated data where n ^{∗}=n+m, and S is the total number of SNPs. The normal density Φ has mean μ _{ sk } and variancecovariance matrix Σ _{ sk } in the k ^{ t h } cluster for the s ^{ t h } augmented SNP data; all pairs of raw intensity at the s ^{ t h } augmented data are measured by t _{ s }= (t _{1s }, \(t_{2s},\ldots, t_{n^{\ast }s})^{T}\phantom {\dot {i}\!}\); the unknown parameters of the GMM is denoted by Θ _{ s }= (π _{ s },μ _{ s },Σ _{ s }) where π _{ s }= (π _{ s1}, π _{ s2}, π _{ s3}), μ _{ s }=(μ _{ s1},μ _{ s2},μ _{ s3}), and Σ _{ s }=(Σ _{ s1},Σ _{ s2},Σ _{ s3}).
Through solving the score equation, the maximum likelihood estimates (MLEs) of the above parameters can be easily estimated [18]. The (u+1)^{ t h } iteration of the indicator variable z _{ ws }=k (k=1,2,or 3) is inferred by$$ f_{k}({t}_{ws};\boldsymbol{\Theta}_{s}^{u})=\frac{\pi_{sk}^{u}\Phi({t}_{ws};\mathbf{\mu}_{sk}^{u},\mathbf{\Sigma}_{sk}^{u})}{\sum_{o=1}^{3}\pi_{so}^{u}\Phi({t}_{ws};\mathbf{\mu}_{so}^{u},\mathbf{\Sigma}_{so}^{u})}. $$(6)The relevant iterative estimates for the mean μ _{ sk } and variancecovariance matrix Σ _{ sk } are$$ \mathbf{\mu}_{sk}^{u+1}=\frac{\sum_{w=1}^{n^{\ast}} f_{k}({t}_{ws};\boldsymbol{\Theta}_{s}^{u}){t}_{ws}}{\sum_{w=1}^{n^{\ast}}f_{k}({t}_{ws};\boldsymbol{\Theta}_{s}^{u})}. $$(7)$$ \mathbf{\Sigma}_{sk}^{u+1}=\frac{\sum_{w=1}^{n^{\ast}} f_{k}({t}_{ws};\boldsymbol{\Theta}_{s}^{u})(t_{ws}\mathbf{\mu}_{sk}^{u+1})({t}_{ws}\mathbf{\mu}_{sk}^{u+1})^{T}}{\sum_{w=1}^{n^{\ast}}f_{k}({t}_{ws};\boldsymbol{\Theta}_{s}^{u})}. $$(8)Two values, Posterior Rate (PR: \(q_{\textit {ws}}^{k}\)) and Average Posterior Rate (APR: q _{ s }), for the s ^{ t h } augmented data are calculated to quantify the quality of SNP calling [17], where$$ q_{ws}^{k}=\frac{P({t}_{ws}k)\pi_{sk}}{\sum_{o=1}^{3}P({t}_{ws}o)\pi_{so}}, \hspace{1em} q_{s}=\frac{\sum_{k=1}^{3}\sum_{w=1}^{n_{k}^{\ast}}q_{ws}^{k}}{\sum_{k=1}^{3}n_{k}^{\ast}} $$(9)It can be seen that PR, measures the strength of each observation’s cluster signal, and APR is the average value of all individuals’ PRs at the s ^{ t h } augmented data [17]. Note that \(n_{k}^{\ast }\) is the sample size in the k ^{ t h } cluster for the s ^{ t h } augmented SNP data. Genotypes can be inferred from the augmented data, including both observed and simulated subjects.
Results and discussion
Data set description and cutoffs setting
We analyzed Illumina Omni 1M array data collected from 3258 samples to compare the performances of M^{3}S with representative calling algorithms, including GenCall (a populationbased method), GenoSNP (a SNPbased approach), and M ^{3} (a hybrid of the previous two approaches). In this data set, 38 HapMap samples were measured multiple times, using a total of 141 arrays. We focused on SNPs from chromosome 22 with a total of 15,020 SNPs. The performance of each genotyping method was evaluated by the comparison results between SNP calls inferred by each calling method and those from the International HapMap Project for these HapMap samples [3]. We chose the following cutoffs for the four calling algorithms: GC score ≥ 0.15 in GenCall used to filter good quality SNPs, samples with posterior probability > 85 % for GenoSNP, and average posterior probability > 0.85 for both M ^{3} and M ^{3}S. The effects of different thresholds on genotyping are summarized in Additional file 1E.
Data analysis results
Comparisons of call rates and concordance on HapMap samples for two allocation designs
Design  Item  GenCall  GenoSNP  M^{3}  M^{3}S 

1:1  Call Rate  96.60  99.13  99.76  99.64 
Accuracy  96.41  98.47  99.23  99.33  
2:1  Call Rate  96.57  99.15  99.77  99.65 
Accuracy  96.39  98.49  99.24  99.38 
Comparisons of call rates and concordance on HapMap samples for rare variants under 600, 1500, and 3000 simulated observations and the allocation design 2:1
SNPs  # SNP  Item  M^{3}S  M^{3}S  M^{3}S  

600  1500  3000  
MAF < 0.1  4364  Call Rate  99.69  99.65  99.59  
Accuracy  99.33  99.24  99.22  
MAF < 0.05  2329  Call Rate  99.72  99.71  99.68  
Accuracy  99.34  99.26  99.28  
MAF < 0.01  597  Call Rate  99.59  99.86  99.81  
Accuracy  99.04  99.06  99.23 
Comparisons of call rates and concordance on HapMap samples for rare variants among GenCall, GenoSNP, M^{3} and M^{3}S
SNPs  # SNP  Item  GenCall  GenoSNP  M^{3}  M^{3}S 

MAF < 0.1  4364  Call Rate  95.89  99.02  99.70  99.59 
Accuracy  95.65  98.28  99.19  99.22  
MAF < 0.05  2329  Call Rate  96.44  98.89  99.64  99.68 
Accuracy  96.15  98.02  99.08  99.28  
MAF < 0.01  597  Call Rate  94.37  98.90  99.53  99.81 
Accuracy  93.89  97.28  98.60  99.23 
Comparisons of call rate and concordance of whole SNPs among GenCall, GenoSNP, M^{3} and M^{3}S
Algorithm 1  Algorithm 2  Call rate (%)  Concordance (%)  

Algorithm 1  Algorithm 2  
GenCall  GenoSNP  96.71  99.12  99.71 
GenCall  M^{3}  96.71  99.71  99.85 
GenCall  M^{3}S  96.71  99.56  99.85 
GenoSNP  M^{3}  99.12  99.71  99.41 
GenoSNP  M^{3}S  99.12  99.56  99.45 
M^{3}  M^{3}S  99.71  99.56  99.57 
Comparisons of HardyWeinberg Equilibrium Test among GenCall, GenoSNP, M^{3} and M^{3}S
Population  NumSample  GenCall  GenoSNP  M^{3}  M^{3}S  M^{3}S  M^{3}S 

600  1500  3000  
AA I  2005  224  907  432  481  646  822 
AA II  83  20  255  64  59  61  65 
EA I  867  486  1024  636  639  690  770 
EA II  158  40  348  109  98  106  129 
Discussion
M^{3}S was motivated from a data set with HapMap samples genotyped as part of the study, but many studies do not have these valuable samples collected. In this case, we may use pairs of raw intensity of HapMap subjects provided by the International HapMap Project [3, 4], but raw data in the Affymetrix data format cannot be directly applied to our program. Although some researchers have successively transformed the Affymetrix raw data into the Illumina raw data with high consistency [19–21], we are not sure the impact of transformation on the final genotyping result.
M^{3}S applies the reference SNP selection step to samples with known genotypes for improving the missing rate and call accuracy for rare SNPs, especially for SNPs with very low MAF. The successful application of M^{3}S is to select the appropriate Ref featured SNP from the whole genome. In practice, it is not practical to search the Ref featured SNPs from the entire genome due to plenty of tedious calculation involved, then the instrumental featured SNPs near the target featured SNP are picked out. The assumption about identical probe response of all SNPs allows each SNP to borrow information from other good quality SNPs. When some probes break this assumption, searching for the most optimal Ref featured SNP is still an open question. Besides, the reference SNP selection is based on maximizing the mathematical similarity between the target featured SNP and the Ref featured SNP. Because the probe intensity is highly correlated with 50 base probe sequence, incorporating the probe sequence information in the reference SNP selection procedure may greatly improve the quality of SNP calls.
The prerequisite for running M^{3}S is based on the collection of samples with known genotypes. However, some SNParray data may not contain individuals with known genotypes, which results in the restricted use in M^{3}S. Fortunately, M^{3}S is a supplement to our previous method M^{3} [17]. We strongly suggest scientists to use M^{3} method if their SNP array data do not contain any HapMap samples. If the data have samples with known genotypes, they could apply M^{3}S. Scientists could try these methods according to their requirements. Recently, a large amount of rare variants are widely captured in many SNP array data, some new powerful calling algorithms have been proposed for accurately calling rare SNPs, such as: optiCall [22] and zCall [11]. To better understand the effectiveness of various calling algorithms, we consider to summarize and compare the performances of multiple popular calling methods in our future study for providing an application guide.
Conclusion
Most genotyping approaches for microarray data are populationbased methods, e.g. GenCall with GenTrain, that infer genotype clusters from a large number of samples to achieve a certain call accuracy. Although it may work well for common SNPs, it may perform poorly for rare SNPs where genotype clusters cannot be reliably established unless a very large number of samples are available. A SNPbased method, GenoSNP, was designed to address this problem, but many more SNPs inferred from GenoSNP tend to fail the HWE test which indicates the violation of the strong underlying assumption for GenoSNP to succeed. Recently, we proposed M^{3} to combine the benefits of both populationbased and SNPbased strategies. Although M^{3} outperformed other methods, it is not able to use samples with known genotype information when they are available in a study, often as quality control samples. In this study, we propose M^{3}S to take advantage of genotype cluster information from the samples with known genotypes to further improve M^{3} on genotyping at rare SNPs. M ^{3}S is a twostage procedure where the reference SNP selection method is applied to samples with known genotypes at rare SNPs to estimate the parameters of the three genotype clusters, followed by simulating additional samples from the Gaussian mixture distribution, and fitting GMM to the augmented data for genotype calling. The superiority of this method rests on two aspects. First, samples with known genotypes help define the boundaries of three genotype clusters before finally genotyping. Second, adding simulated data in the original population greatly enlarges the sample size while genotyping. These two aspects help us improve genotyping technique for rare SNPs.
Declarations
Acknowledgements
This work was supported in part by the National Institutes of Health (R01 GM59507 and U01 HG005718) and the VA Cooperative Studies Program of the Department of Veterans Affairs, Office of Research and Development. We sincerely thank Dr. Joel Gelernter for providing this raw intensity data and his valuable advice.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, et al.A genomewide association study identifies novel risk loci for type 2 diabetes. Nature. 2007; 445:881–5.View ArticlePubMedGoogle Scholar
 The Wellcome Trust Case Control Consortium. Genomewide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature. 2007; 447:661–78.View ArticlePubMed CentralGoogle Scholar
 The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007; 449:851–61.View ArticlePubMed CentralGoogle Scholar
 Reich DE, Gabriel SB, Altshuler D. Quality and completeness of SNP databases. Nat Genet. 2003; 33:457–8.View ArticlePubMedGoogle Scholar
 Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, et al.A genotype calling algorithm for the Illumina BeadArray platform. Bioinforma. 2007; 23:2741–6.View ArticleGoogle Scholar
 Giannoulatou E, Yau C, Colella S, Ragoussis J, Holmes CC. GenoSNP: a variational Bayes withinsample SNP genotyping algorithm that does not require a reference population. Bioinforma. 2008; 24(19):2209–14.View ArticleGoogle Scholar
 Illumina Inc. Illumina GenCall Data Analysis Software. TECHNOLOGY SPOTLIGHT. 2005. http://www.illumina.com/Documents/products/technotes/technote_gencall_data_analysis_software.pdf.
 Carvalho B, Bengtsson H, Speed TP, Irizarry RA. Exploration, normalization, and genotype calls of highdensity oligonucleotide SNP array data. Biostatistics. 2007; 8:485–99.View ArticlePubMedGoogle Scholar
 Ritchie ME, Carvalho BS, Hetrick KN, Tavare S, Irizarry RA. R/Bioconductor software for Illumina’s Infinium wholegenome genotyping BeadChips. Bioinformatics. 2009; 25(19):2621–3.View ArticlePubMedPubMed CentralGoogle Scholar
 Browning BL, Yu Z. Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces falsepositive associations for genomewide association studies. Am J Hum Genet. 2009; 85(6):847–61.View ArticlePubMedPubMed CentralGoogle Scholar
 Sklar P, Hultman CM, Purcell S, Goldstein JI, Crenshaw A, Carey J, et al.Swedish Schizophrenia Consortium, ARRA Autism Sequencing Consortium. zCall: a rare variant caller for arraybased genotyping: Genetics and population analysis. Bioinformatics. 2012; 28(19):2543–5.View ArticlePubMedPubMed CentralGoogle Scholar
 Rabbee N, Speed TP. A genotype calling algorithm for Affymetrix SNP arrays. Bioinforma. 2005; 22:7–12.View ArticleGoogle Scholar
 AFFYMETRIX. BRLMM: an improved genotype calling method for the GeneChip Human Mapping 500K Array Set. Technical Report, White Paper. Santa Clara, CA: Affymetrix Inc; 2006. http://www.affymetrix.com/support/technical/whitepapers/brlmm_whitepaper.pdf.Google Scholar
 Chierici M, Miclaus K, Vega S, Furlanello C. An interactive effect of batch size and composition contributes to discordant results inGWAS with the CHIAMO genotyping algorithm. Pharmacogenomics J. 2010; 10:355–63.View ArticlePubMedGoogle Scholar
 Marchini J, Howie B, Myers S, McVean G, Donnelly P. A new multipoint method for genomewide association studies by imputation of genotypes. Nat Genet. 2007; 39:906–13.View ArticlePubMedGoogle Scholar
 Steemers FJ, Chang W, Lee G, Barker DL, Shen R, Gunderson KL. Wholegenome genotyping with the singlebase extension assay. Nat Methods. 2006; 3(1):31–3.View ArticlePubMedGoogle Scholar
 Li GX, Gelernter J, Kranzler HR, Zhao HY. M^{3}: an improved SNP calling algorithm for Illumina BeadArray data. Bioinformatics. 2012; 28(3):358–65.View ArticlePubMedGoogle Scholar
 McLachlan GJ, Peel D. Finite Mixture Models. New York: Wiley; 2000. http://www.wiley.com/WileyCDA/WileyTitle/productCd0471006262.html.View ArticleGoogle Scholar
 Jiang L, Willner D, Danoy P, Xu HJ, Brown MA. Comparison of the Performance of Two Commercial GenomeWide Association Study Genotyping Platforms in Han Chinese Samples. G3:GenesGenomesGenetics. 2013; 3(1):23–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Laurie CC, Doheny KF, Mirel DB, Pugh EW, Bierut LJ, Bhangale T, et al.Quality control and quality assurance in genotypic data for genomewide association studies. Genet Epidemiol. 2010; 34:591–602.View ArticlePubMedPubMed CentralGoogle Scholar
 Hong H, Xu L, Liu J, Jones WD, Su Z, Ning B, et al.Technical Reproducibility of Genotyping SNP Arrays Used in GenomeWide Association Studies. PLoS ONE. 2012; 7(9):e44483. doi:10.1371/journal.pone.0044483.View ArticlePubMedPubMed CentralGoogle Scholar
 Shah TS, Liu JZ, Floyd JA, Morris JA, Wirth N, Barrett JC, et al.optiCall: A robust genotypecalling algorithm for rare, low frequency and common variants. Bioinformatics. 2012; 12:68.Google Scholar