Assessing batch effects of genotype calling algorithm BRLMM for the Affymetrix GeneChip Human Mapping 500 K array set using 270 HapMap samples

Background Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set. Results Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls. Conclusion Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.

size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls.
Conclusion: Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.

Background
Genome-wide association studies (GWAS) aim to identify genetic variants of single nucleotide polymorphisms (SNPs) across the entire human genome that are associated with phenotypic traits, such as disease status and drug response. The International HapMap project determined genotypes of over 3.1 million common SNPs in human populations and computationally assembled them into a genome-wide map of SNP-tagged haplotypes [1,2]. Concurrently, high-throughput SNP genotyping technology advanced to enable simultaneous genotyping of hundreds of thousands of SNPs. These advances combine to make GWAS a feasible and a promising research field for associating genotypes with various disease susceptibilities and health outcomes. Recently, GWAS was successfully applied to identify common genetic variants associated with a variety of phenotypes . Many of these studies used the Affymetrix GeneChip Human Mapping 500 K array set [5,6,11]. The genomic DNA for one of the arrays is cleaved with the Nsp I restriction enzyme and ~262,000 SNPs are interrogated. The second chip uses Sty I -cleaved genomic DNA and ~238,000 SNPs are analyzed. Genotypes from Affymetrix GeneChip Human Mapping 500 K array set data are usually determined by the calling algorithm BRLMM [32] embedded in Affymetrix software packages. Algorithms developed by other laboratories such as PLASQ [33], GEL [34], CRLMM [35], SNiPer-HD [36], MAMS [37], and CHIAMO [11] are also utilized.
The MPAM algorithm was developed for analysis of raw data (i.e., the CEL files) from the first generation of Affymetrix Mapping 10 K array and is based on clustering of chips for each SNP by modified partitioning around medoids [38]. MPAM was error prone for SNPs with missing genotype groups or low minor allele frequency, a problem more pronounced on the second generation of Affymetrix Mapping 100 K array. This prompted Affymetrix to develop a new dynamic model based calling algorithm called DM for Mapping 100 K array data [39]. DM is a single-chip calling algorithm and usually calls genotypes with high overall call rate and accuracy. However, the algorithm exhibited a higher misclassification rate for heterozygous genotypes than for homozygous genotypes. To improve data analyses for genotyping arrays, the multichip genotype calling algorithm RLMM was developed. RLMM is based on a robustly fitted, linear model that employs Mahalanobis distance for classification [40]. RLMM achieved a higher call rate than DM. With the release of the Mapping 500 K SNP array set, Affymetrix extended the RLMM model to BRLMM by adding a Bayesian step that provided improved estimates of cluster centers and variances. The DM and GEL algorithms operate on a single chip, while all others use multiple chips to call genotypes.
High call rate and accuracy of genotype calling are important and essential issues for success of GWAS, since errors introduced in the genotypes by calling algorithms can inflate false associations and may lose true associations between genotype and phenotype. Each of the algorithms was reported to have a high successful call rate and accuracy, or more precisely, high concordance with genotypes determined by the International HapMap Consortium on the HapMap samples. With the exception of DM and GEL, the algorithms require data from multiple chips (i.e., a batch) to make genotype calls. A GWAS usually involves analyses of thousands of samples that generate thousands of raw data files (i.e., CEL files). The raw data file for one sample (two CEL files for Affymetrix Mapping 500 K array set: one from Nsp-digested genomic DNA and one from Sty-digested DNA) is about 130 MB in size. Computer memory (RAM) limits make it unfeasible to analyze all CEL files in a GWAS in one single batch on a single computer. The samples are, therefore, divided into many batches for genotype calling. Affymetrix suggests 40 to 96 CEL files for a batch for the BRLMM method. To date, the effects on genotype calls caused (potentially) by changing the number and specific combinations of CEL files in batches and propagation of the effects to the downstream association analysis have not been investigated.
Since BRLMM is recommended by Affymetrix, we analyzed the effect of batch size and composition on the ability of the BRLMM algorithm to consistently call the 270 samples from the International HapMap project.

Batch size effect
Batch size effect was assessed by comparing the genotypes called from BS1, BS2, and BS3 (see Methods) for call rate and concordance. The overall call rates, defined as the proportion of successful calls to the total number of calls (successful calls plus missing calls) for BS1, BS2, and BS3 were 99.48%, 99.50%, and 99.49%, respectively. However, overall call rates are not informative enough to assess the distribution of missed calls on the chip. Batch size effect on genotype calling rates are best compared using one-against-one comparisons of distributions of call rates on individual samples and SNPs. These distributions were calculated from data of samples and SNPs generated from the calling results of the experiments with three batch sizes (BS1, BS2, and BS3).
The comparison of call rates of samples using MA-like plots is shown in Figure 1. The average call rate of two genotype calling results (x-axis) from experiments with two different batch sizes were plotted against the difference of call rates between the two experiments (large batch sizesmall batch size; y-axis). The horizontal dotted lines at y = 0 represent the expected locations of samples if the missing calls on each sample were exactly the same in the two experiments. Data points above this line are the samples having fewer missing calls (i.e., higher call rate) in the experiment with the larger batch size than in the experiment with the smaller batch size. Data points beneath this line indicate samples having fewer missing calls in the experiment with smaller batch size than in the experiment with the larger batch size. The perpendicular distance from a data point to this line is the difference in call rate of a sample between the two experiments. Figure 1A compares the results of BS1 with BS2; 1B compares the results of BS1 with BS3; and 1C compares the results of BS2 with BS3. Data points at lower average call rates are more distant from the calculated equivalent call rate (dotted line) than the data points at higher average call rates. Thus, batch size affected lower call rates more severely than higher call rates. Furthermore, data points in Figure 1B (BS1 versus BS3) are farther away from the dotted line when compared with the data points in Figure 1A (BS1 versus BS2), which, in turn, were farther away from the dotted line when compared with Figure 1C (BS2 versus BS3). The values of (see Methods) were 0.0304, 0.0416, and 0.0257 for comparisons shown in Figure 1A, B, and 1C, respectively, that are related to the corresponding differences of batch sizes of the compared experiments, 45 (90 -45), 60 (90 -30), and 15 (45 -30). The p-values for comparisons in Figure 1A, B, and 1C are 1.736 × 10 -6 , 0.0296, and 0.0116, respectively, indicating that call rates on samples between calling batch sizes are statistically different.
The comparisons of the call rates for individual SNPs are depicted by MA-like plots in Figure 2. Figure 2A compares the results of BS1 with BS2; 2B compares the results of BS1 with BS3; and 2C compares the results of BS2 with BS3. The trend is similar to that observed in Figure 1 that batch size affected lower call rates more severely than higher call rates for individual SNPs. The values were calculated to be 0.1563, 0.1982, and 0.1467 for the comparisons shown in Figure 2A, B, and 2C, respectively. They were positively correlated with the differences of batch sizes of the compared experiments, 45, 60, and 15, respectively. The p-values for comparisons in Figure 2A, B, and 2C are 2.2 × 10 -16 , indicating that the difference of call rates on SNPs between calling batch sizes are statistically significant.
Comparing call rates in experiments with different batch sizes can only assess the batch size effect on missing calls. Since three genotypes (homozygote, heterozygote, and variant homozygote) are possible for a genotype call, we determined the effect of batch size on the ability to consistently call the genotype. To evaluate the batch size effect on successful calls, concordance of successful genotype calls between experiments with different batch sizes was analyzed (Table 1). Batch size affected successful genotype calls since the concordances were not 100% and heterozygous genotype concordances were more affected than homozygous genotype concordances. The largest difference in batch size (60, BS1 versus BS3) led to the lowest concordances (99.986% overall concordance). However, the concordances for BS2 versus BS3 were slightly lower than for BS1 versus BS2, even though the difference of batch sizes for BS2 versus BS3 (45 -30 = 15) is smaller than that for BS1 versus BS2 (90 -45 = 45). This result is likely due to the relatively large difference in the number of arrays in the batch (BS1 = 90 arrays and BS3 = 30 arrays). High concordance of genotype calls depends on the difference between batch sizes as well as the actual batch sizes themselves.

Batch composition effect
The overall call rate based on all CEL files of the 270 Hap-Map samples for BC1, BC2, and BC3 (see Methods) were 99.48%, 99.43%, and 99.41%, respectively. The genetic homogeneity of the batches in BC1 (samples from 1 population group) is higher than that of BC2 (samples from 2 population groups) which, in turn, is higher than that of BC3 (samples from 3 population groups). The batch sizes were the same for all of the three experiments. Thus, The empty circles depict 500,568 SNPs. The x-axes represent average call rates of individual SNPs in two experiments with different batch sizes. The horizontal dotted lines indicate the expected locations of SNPs where the call rates in the two compared experiments were exactly same. A: Comparison between BS1 and BS2. The y-axis represents call rate in BS1 -call rate in BS2. B: Comparison between BS1 and BS3. The y-axis represents call rate in BS1 -call rate in BS3. C: Comparison between BS2 and BS3. The y-axis represents call rate in BS2 -call rate in BS3.
higher call rates were obtained when genotype calling was conducted with samples of higher genetic homogeneity. The effect of batch homogeneity was relatively minor by this measure. Because the distribution of missing calls on samples and SNPs was more informative for assessing batch effect in our first experiments (BS studies), we examined the distribution of call rates in the BC experiments.
The comparisons of call rates on samples are depicted by MA-like plots ( Figure 3). Figure 3A compares the results of BC1 with BC2; 3B compares the results of BC1 with BC3; and 3C compares the results of BC2 with BC3. It can be seen that most of the data points are above the dotted lines, indicating fewer missing genotypes (i.e., higher call rate) when samples in batches are of higher genetic homogeneity. Batch composition had a larger effect when the call rate was lower. Moreover, the level of batch composition effects was related to differences in the genetic homogeneity of samples in the compared batch compositions. The comparisons of call rates on SNPs for BC1 versus BC2, BC1 versus BC3, and BC2 versus BC3 are shown in Figure  4A, B, and 4C, respectively. Data points at lower average call rate were farther away from the dotted line than the data points at higher average call rate; that is, batch composition affected SNPs with lower call rates more severely than SNPs with higher call rates. Furthermore, more SNPs are above rather than below the calculated equivalent call rates (dotted line) indicating fewer missing genotypes per SNP (i.e., higher call rate) when samples in calling batches are of higher genetic homogeneity. Moreover, it was further confirmed that the level of batch composition effects was related to differences in genetic homogeneity of samples in the compared batch compositions. The values are 0.2046, 0.2384, and 0.1749 for comparisons shown in Figure 4A, B, and 4C, respectively, that are related to the corresponding GH differences between the compared experiments: 0.5, 0.67, and 0.17. The p-values for all comparisons are 2.2 × 10 -16 , confirming that the call rates on SNPs between calling batch compositions are statistically different.
To evaluate batch composition effect on successful genotype calls, concordance of successful genotype calls between experiments with different batch compositions was analyzed (Table 2). Batch composition not only affected the genotype calls but was more pronounced at heterozygous genotypes compared with homozygous genotypes, since the concordance for heterozygous genotype calls were lower than the corresponding concordance for homozygous genotype calls. Moreover, the concordance of successful genotype calls between the compared batch compositions were negatively related to genetic homogeneity differences between the batch compositions. For example, overall concordances were 99.986%, 99.980%, and 99.991% for BC1 versus BC2, BC1 versus BC3, and BC2 versus BC3, respectively. These are in opposite order of the GH differences of the compared experiments, that is, 0.5, 0.67, and 0.17 for BC1 versus BC2, BC1 versus BC3, and BC2 versus BC3, respectively.

Propagation of batch effect to significantly associated SNPs
The objective of a GWAS is to identify the genetic markers associated with a specific phenotypic trait. It is critical to assess whether and how the batch effect propagates to the significant SNPs identified in the downstream association analysis. Three case-control based association analyses were conducted for each of the calling results with different batch sizes and compositions to assess the propagation of batch effect in genotype calling to the significantly associated SNPs (see Methods). Histograms of QC confidence scores of Affymetrix Human Mapping 500 K Array Set CEL files of 270 HapMap samples After removal of low quality SNPs by quality control assessment, each of the three population groups (European, Asian, and African) was set as "case" while the other two groups were set as "control". Association analyses were conducted to identify SNPs that can differentiate the "case" group from the "control" group. Different lists of SNPs significantly associated with a same population group, identified using the genotype calling results with different batch sizes and compositions, were compared using Venn diagram.
The comparisons of the significantly associated SNPs obtained from calling results with different batch sizes are given in Figure 6. The significantly associated SNPs from BS1 are in black circles, from BS2 in blue circles, and from BS3 in red circles. Number of significantly associated SNPs common in all three batch sizes is in brown, shared only by two batch sizes in green. The association analyses results for European versus others are depicted in Figure  6A, for African versus others in 6B, and for Asian versus others in 6C.
It is clear that the batch size effect on genotype calling propagated into the downstream association analyses. Moreover, it was observed that the larger the differences between two batch sizes, the fewer the significantly associated SNPs shared by the two batch sizes. For example, there were 471, 370, and 217 significantly associated SNPs shared only by BS2 and BS3, by BS1 and BS2, and by BS1 and BS3 for the association analyses with European as "case", respectively, that are negatively related to the corresponding differences of batch sizes: 15, 45, and 60. Same trends were observed for the association analyses with African as "case" and with Asian as "case". Figure 7 compares the lists of significantly associated SNPs obtained using the genotypes called by the three batch compositions. The significantly associated SNPs from BC1 are in black circles, from BC2 in blue circles, and from BC3 in red circles. Number of significantly associated SNPs common in all three compositions is in brown, shared only by two compositions in green. Association analyses results for European versus others are depicted in Figure 7A, for African versus others in 7B, and for Asian versus others in 7C.
The Venn diagrams demonstrated that for a same "casecontrol" setting different lists of significantly associated SNPs were identified by the same statistical test (Chi 2 test) using the genotype calling results from different batch compositions. Therefore, the batch composition effect on genotype calling propagated to the significantly associated SNPs. Moreover, it was observed that the larger the difference of genetic homogeneity between two batch compositions, the fewer the significantly associated SNPs shared by the two batch compositions. For example, there were 555, 512, and 229 significantly associated SNPs shared only by BC2 and BC3, by BC1 and BC2, and by BC1 and BC3, respectively, for the association analyses with European as "case". The numbers are negatively Venn diagrams for comparisons of the significantly associated SNPs identified using the genotype calling results with different calling batch sizes Same trends were observed for the association analyses with African as "case" and with Asian as "case".

Discussion
GWAS is increasingly used to identify loci containing genetic variants associated with common diseases and drug responses. The number of SNPs interrogated in a GWAS has grown from thousands to millions; for example, the newest Affymetrix SNPs array 6.0 contains ~2 million probe sets. At the same time, the allele frequency difference of disease-associated or drug-associated SNPs is usually very small. Therefore, a very small error introduced in genotypes by genotype calling algorithms may result in inflated false associations between genotype and phenotype in the downstream association analysis. Reproducibility and robustness are as important to genotype calling as is the accuracy and call rate that are usually used to evaluate performance of genotype calling algorithms. As most genotype calling algorithms are based on multiple chips, and genotype calling for a GWAS is usually conducted in many batches, reproducibility and robustness of multi-chip calling algorithms under different batch sizes and compositions are important variables. Statistical tests of these parameters would increase the confidence for associated SNPs identified in downstream association analysis.
A heterozygous genotype carries a rare allele. Therefore, the robustness of calling heterozygous reduces false positive associations and the chance of missing true associations. Our studies revealed that both batch size and composition affected genotype calling results, especially for heterozygous genotype calling. It was also demonstrated that batch effect propagates to the downstream association analysis. Genotype calling algorithms that eliminate or reduce batch effects but maintain high call rates and accuracy are preferred for GWAS.
BRLMM first derives an initial guess for each SNP's genotype using the DM algorithm and then analyzes across SNPs to identify cases of non-monomorphism. This subset of non-monomorphism SNPs is then used to estimate a prior distribution on cluster centers and variance-covariance matrices. This subset of SNP genotypes is revisited and the clusters and variances of the initial genotype guesses are combined with the prior information of the SNP in an ad-hoc Bayesian procedure to derive a posterior estimate of cluster centers and variances. All SNPs in a chip are called according to their Mahalanobis distances from the three cluster centers and confidence scores are assigned to the calls. With default settings, BRLMM randomly picks 10,000 SNPs to estimate cluster centers and variances. But the number of non-monomorphism SNPs used to estimate the prior distribution on cluster centers and variance-covariance matrices varies with changing Venn diagrams for comparisons of the significantly associated SNPs identified using the genotype calling results with different calling batch compositions

Conclusion
As demonstrated above, both batch size and batch composition affect genotype calling results of GWAS using the BRLMM algorithm. The larger the difference of batch sizes, the larger the effect. When the samples in the calling batches are more homogenous, more concordant genotypes are called. Batch effect propagates to the downstream association analysis and makes the significantly associated SNPs identified inconsistent. Therefore, we suggest from our studies that the same or larger batch sizes should be used to make genotype calls for GWAS and homogenous samples should be put into the same batches.

Raw data
The raw data (CEL files) from the Affymetrix GeneChip

Quality of the raw data
The quality of the raw data from the Affymetrix Human Mapping 500 K array set was assessed using DM [39] before genotype calling by BRLMM. DM is a single array based algorithm; it processes one CEL file at a time in a multiple CEL file batch and statistically assesses experimental qualities with a numerical score between 0 and 100. A high QC (quality control) number means high quality of the experiment (CEL file).

Genotype calling by BRLMM
All experiments of genotype calling by BRLMM reported in this paper were conducted using apt-probeset-genotype of Affymetrix Power Tools 1.8.5. Affymetrix Power Tools (APT) contains a set of cross-platform command line programs that implement algorithms for analyzing and working with Affymetrix GeneChip ® arrays. These programs are available on the Affymetrix website http://www.affyme trix.com/support/developer/powertools/index.affx. APT programs are intended for "power users" who prefer programs that can be utilized in scripting environments and are sophisticated enough to handle the complexity of extra features and functionality. The function of aptprobeset-genotype in APT is an application for making genotype calls using SNP Arrays (100 K, 500 K, Genome-Wide SNP Arrays 5.0 and 6.0). BRLMM is one of the genotype calling algorithms implemented in this function, and enables many parameters to be changed by a user. For the studies reported here, all the parameters, except as noted in the narrative were set to the default values recommended by Affymetrix. The chip description files (cdf) for both Nsp and Sty chips of the Mapping 500 K array set, as well as files for defining SNPs on chromosome X, were also used before genotype calling. They were downloaded from Affymetrix website. Nsp and Sty CEL files were genotype-called separately.

Comparing genotype calling results
In each of the experiments reported here, the genotype calling results by BRLMM from different calling batches were first merged using a set of in-house programs written in C++. When merging the calling results, genotypes of SNPs in Nsp and Sty chips of the same samples were merged followed by assembling together all genotypes of all of the 270 HapMap samples. Thereafter, overall call rates for each of the experiments, call rates of individual samples and SNPs in each of the experiments, and concordant calls between experiments were calculated and exported as tab-delimited text files using the in-house programs written in C++. Comparison of calling results was done using the R package.
Paired two samples t-test in R package (t.test) was used to statistically test the alternative hypothesis that call rates on samples or SNPs between two calling experiments are different.
To quantify batch effect, average absolute differences in call rates were calculated for the comparisons using formula (1).
where and are call rates of experiments 1 and 2 of sample i or SNP i, respectively; N is the total number of samples (in this case, 270) or SNPs (in this case, 500,668 which includes 50 QC probe sets in both Nsp and Sty chips).

Association analysis
In order to study the propagation of batch effect to the significantly associated SNPs, all genotype calling results of the raw data of 270 HapMap samples using BRLMM with different batch sizes and compositions were analyzed using Chi 2 statistics test for associations between the SNPs and the case-control settings.
Prior to association analysis, quality control (QC) of the calling results was conducted to remove markers and samples with low quality. For each of the calling results, call rate of 90% was used to remove SNPs and samples. Minor allele frequency was used to filter SNPs and its cut-off was set to 0.01. Departure from Hardy-Weinberg equilibrium (HWE) was check for all SNPs. The p-value of Chi 2 test for Hardy-Weinberg equilibrium was calculated for all SNPs at first and then the p-values were adjusted for multiple tests using Benjamini and Hochberg false discovery rate (FDR) [41]. FDR of 0.01 was set as the cut-off for HWE test. There were no samples removed because of low quality. 54942 (10.97%) to 55496 (11.084%) SNPs were removed in the QC, mainly because of departure from HWE.
To mimic "case-control" in GWAS, for each of the genotype calling results, each of the three population groups (European, African, and Asian) was assigned as "case" while the other two as "control" to form a data set for association analysis for identifying the SNPs significantly associated with the "case" population group.
In the association analyses, a 2 × 3 contingency table was generated for each SNP and a case-control setting. Then Chi 2 statistics test was applied on the contingency table to calculate a p-value for measuring the statistical significance of the association between the testing SNP and the corresponding case-control setting. After raw p-values for CR i 1 CR 1 2