Genotyping and inflated type I error rate in genome-wide association case/control studies

Background One common goal of a case/control genome wide association study (GWAS) is to find SNPs associated with a disease. Traditionally, the first step in such studies is to assign a genotype to each SNP in each subject, based on a statistic summarizing fluorescence measurements. When the distributions of the summary statistics are not well separated by genotype, the act of genotype assignment can lead to more potential problems than acknowledged by the literature. Results Specifically, we show that the proportions of each called genotype need not equal the true proportions in the population, even as the number of subjects grows infinitely large. The called genotypes for two subjects need not be independent, even when their true genotypes are independent. Consequently, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical. To address these problems, we propose two new tests designed to reduce the inflation in the type I error rate caused by these problems. The first algorithm, logiCALL, measures call quality by fully exploring the likelihood profile of intensity measurements, and the second algorithm avoids genotyping by using a likelihood ratio statistic. Conclusion Genotyping can introduce avoidable false positives in GWAS.


Background
One common goal of a genome wide association (GWAS) study is to search the entire genome for single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) associated with a disease or some other phenotype. In this article, we focus our analysis on SNPs. The two possible alleles at a SNP are arbitrarily labeled A and B, and association is often tested by measuring and comparing the frequencies of the genotypes AA, AB, and BB, in case and control groups. As technology currently allows close to one million SNPs to be examined simultaneously, there is a need for fast, automated methods to test for association. As only a small minority of SNPs are expected to be associated with the disease, even a modest false positive rate could bury true associations beneath those occurring by chance.
Using Affymetrix 500 k GeneChips as an example, each of the 500,000+ SNPs is represented by a series of probes on a pair of arrays. Each probe is an oligonucleotide designed to bind to either the A or B allele. A subject's fluorescently labeled DNA is allowed to hybridize with these probes, and then a spectrometer measures the relative fluorescent levels between the A and B probes. Each genotyping algo-rithm (see Methods) summarizes the fluorescence information, or the likelihood a subject has allele A at that SNP, by its own statistic. In any population, these statistics usually cluster into three groups, corresponding to the three genotypes. Studies have noted that 1) The mean and variance of these clusters, or the shape of their distribution in general, varies by SNP and 2) For a single SNP, because of differences in processing or duration of storage, the shape of the statistic's distribution can differ between the case and control groups [1][2][3].
The current test of association requires calling, or assigning a genotype, to each SNP for each study subject and then comparing the called genotypes between the case and control groups. The majority of SNPs are easy to call and any of the available methods will call them correctly. Unfortunately, there is a difficult minority that cannot be easily clustered into three distinct groups. Because there can be as many as 500,000 SNPs, this minority can greatly inflate the type I error rate and cause the large, characteristic, deviations from the x = y line in the qq-plots of test statistics. Most studies assume these consequences are the unavoidable results of population substructure and poor data. In this paper, we dispel the myth that these are the sole issues. In fact, the inflated type I error rate and general misbehavior of the test statistic may also result from the act of genotype assignment and a poor choice of statistical methodology.
The goal of this manuscript is two-fold. First, our primary goal is to show that genotyping overlapping clusters can lead to potential problems that we have yet to see fully acknowledged in the literature. The proportions of each called genotype need not equal their true proportions in the population, even as the number of subjects grows infinitely large. As we compare genotype calls, p-values from tests of association will be anti-conservative when the distribution of the summary statistic differs between cases and controls. Moreover, the called genotypes for two subjects need not be independent, even when their true genotypes are independent. Therefore, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical, a fact we believe has yet to be fully demonstrated. Although previous studies have examined the effects of genotyping error on tests of association [4][5][6], studies has neither fully explored the effects caused by case/control differences in distributions nor dependence of error. Second, we discuss two new tests that can circumvent these potential problems. One test compares calls made from a genotyping algorithm designed to minimize the type I error. The second test compares the fluorescence distributions, instead of the called genotypes. We start the Methods section by discussing currently available methods for genotyping SNPs and testing for association. Concurrently, we introduce logiCALL, our new genotyping algorithm, and the likeli-hood ratio-based test of association. In the Results and Discussion section, we start by showing that the proportion of genotypes called AA, AB, and BB need not converge to the true population proportions. Then we discuss how the called genotypes can be dependent. We conclude the section by comparing the proposed tests of association with the current standards through both simulation studies and real data analysis. Then a short Conclusions section summarizes the key points.

Calling Genotypes
There is currently a wide variety of programs available for genotyping SNPs. The most popular supporting Affymetrix are RLMM [7], BRLMM [8], CRLMM [9], CHIAMO [10], SNiPer-HD [11], and MAMS [12]. The most popular program for Illumina is their own proprietary software, BeadStudio, but other methods have been recently suggested by Moorhead et. al. [3], Teo et. al. [13], and Dunning et. al. [14] (Table 1). To introduce these methods, we start by defining notation. Let there be n subjects. Let G ij ∈ {AA, AB, BB} be the true genotype of SNP j in subject i, 1 ≤ i ≤ n. For Affymetrix chips, assume there are n p probe quartets representing each SNP on an array, and let be the normalized probe intensities for subject i, SNP j, and probe k. Here, the subscripts PM A and MM A signify the perfect match and mismatch probes for allele A. PM B and MM B are similarly defined for allele B. The log transformed intensities will be . To maintain notational consistency, for Illumina chips, we denote the BeadStudio intensity values of the two SNP alleles by . As we will only discuss a single SNP for the majority of the paper, we will omit the superscript "j" from all future notation.
With the exception of dynamic modeling (DM) [15], all calling algorithms share the same general form, and we exploit this form to summarize their key features. The process of transforming raw signal into genotypes can be divided into four steps: 1) Normalize the intensity values; 2) Describe the normalized intensity values by a single, possibly multivariate, summary statistic; 3) Estimate the mean and variance of the summary statistic for the three possible genotypes, AA, AB, and BB; and 4) Compare the value of the statistic from a subject to the group parameters found in the third step to make the call. The universal first step, normalizing the intensity values, is tangential to our discussion here. The second step is to choose a statistic, S i that Summarizes the intensities. For example, RLMM models the probe intensities as and . Then, . In the updated version, CRLMM models sense(+) and antisense(-) probes separately, resulting in a 4D statistic, Each method assumes that the distribution, φ M (S i |θ q ), of their statistic in a given population q is a Mixture of multivariate normal distributions where θ, to be defined below, are the parameters characterizing the distribution in population q. When it is clear we are discussing only a single population, we will omit the superscript q. Although problems can arise when the distribution is not a true mixture of normals, those complications are beyond the scope of this paper [16]. For completeness, we point out that a minority of programs, including CRLMM, allow these parameters to vary within a group (e.g. to be subject/array specific). Ignoring that some methods allow for an additional null distribution, the general form is where ψ(·) is the multivariate normal density. Often, for a given value of S i , the assigned genotype maximizes a similarity function: = argmax g D(g, S i |θ). The similarity function is usually a modified version of one of The similarity function is also modified to ensure monotonicity of assignment. When we let S i =  Table 1 describes the details of the four steps for popular methods. For many purposes or to understand the details of the method, especially in handling rare alleles, this table will seem an oversimplification. For our purposes here, it highlights the features of interest.

Tests of Association
The current tests of association start by calling genotypes for a given SNP j in a group of subjects with the disease and in a group of controls.
and n q is the number of subjects with disease status q. In this manuscript, any 'p-value' from a genotype-based association test will be calculated using ANOVA on the logistic regression model with Q i and genotype (unordered grouping) as the dependent and independent variables.
Standard tests tend to err anti-conservatively as we will discuss below. We will propose four alterations that can reduce type I error rate, with only a minimal decrease in power. These are the four differences that separate logi-CALL from standard methods. The first is based on the observation, which is discussed later, that the likelihood profile of φ M (S i |θ) will have multiple local maxima near the overall maximum. When estimating θ, the EM algorithm converges to multiple solutions. For many of those solutions, the resulting parameter set, , satisfies . For each parameter set satisfying this inequality, we will make a new group of genotype assignments, . If more than 10% of such assignments disagree with (τ = 0.06), we label that subject's call as questionable. We also continue the prac- The second alteration is that we do not discard questionable calls, an act which can create false positives. Instead, we assign questionable S i so the proportions of genotypes in the cases and controls are as similar as possible, which is defined as minimizing , with the restriction that the final call for subject i must be either or g 2i . Here, we let g 2i be the genotype which is either the runner-up in terms of distance or the most common genotype among the dissenting calls, depending on why the genotype was labeled as questionable. The third alteration, which is already incorporated into other programs is to perform the EM algorithm under the null hypothesis that the genotype proportions in the two populations are identical [3]. The fourth is the use of a weighted Mahalonobis distance, which is defined later. Given these changes, logiCALL then compares the estimated genotype proportions in cases and controls using logistic regression. Note that none of the changes affect calls for the vast majority of SNPs.
We also introduce a completely new method for testing association based on a likelihood ratio statistic. For our method, steps 1 and 2, normalization and choice of summary statistic, can mimic any of the previously described methods. As our real data to be analyzed was collected on Illumina chips, we choose the statistic from Moorhead, et al. [3], for exposition. We then assume that S i follows the mixture model described by equation (1), but allow the parameters to differ by disease status: is a subset of the unrestricted parameter space, Ω. In an ideal scenario, the distribution of 2log(LR) would converge to a chi-squared distribution, , with 2 degrees of freedom. Therefore, the 'p-value' from a likelihood ratio-based test will be calculated as 1 -(2log(LR)).

Data Source
To demonstrate the problems of genotyping and compare the genotype-and likelihood ratio-based tests of association, we use three types of data. First, for discussion, we may assume a hypothetical study measuring a one dimensional summary statistic, S i , for a SNP j with only two possible genotypes, G i ∈ {0, 1}. Furthermore, to show problems can exist even under the best conditions, where model and truth coincide, we assume that S i follows a normal distribution given Q i and G ij , and that the full distri- . We generated 10 simulated datasets, containing 1000 subjects (500 cases, 500 controls) and 303,100 SNPs for each of 18 scenarios. For each gene j and each subject i, we generated a 2D summary statistic (M ji , Combing the degree of shift for poor quality SNPs and the effect size of truly associated SNPs, we have a total of 18 scenarios used in our simulation. The next set of data is from a recent GWAS of Inflammatory Bowel Disease (IBD) that compared 983 subjects with IBD to 1004 subjects without the disease. Using Illumina microarrays, 308,330 SNPs on the autosomal chromosomes were tested. Jewish and non-Jewish cohorts, approximately equal in size, were analyzed separately, a practice continued here. Details have been previously published [17,18]. Because the overwhelming majority of the SNPs are easy to genotype, as any of the summary statistics neatly divide into three clusters, we chose a 3137 difficult SNP subset where at least two clusters overlap

Two Genotype Example: Parameters
We choose to use the hypothetical, two-genotype, study, to highlight that the estimated parameters can be inconsistent even in the simplest scenario, where the summary statistic is distributed normally and our fitted model is correct. When dealing with only a single population, we define the parameter p 0 (p 1 ) to be the probability that a subject's true genotype is 0 (1).
We define the parameter to be the probability that a subject's called genotype is 0 (1).
Probabilities of called genotypes implicitly depend on the number of subjects in the sample. We then define the parameter c* to be the point which is equidistant to the two genotype groups when θ is known.
Therefore, μ 0 ≤ c* ≤ μ 1 is defined to be a solution to equation 6 For the remainder of the paper, we shall assume such a c* exists. This assumption is safe in practice as genes with extremely rare minor alleles are discarded. When D is the Mahalanobis distance, c* is a solution to We define the final parameter, , to be the probability that a subject's S i value is less than c* given their true genotype is 0 (1). Therefore, by convergence of the MLE, we know that

Our first goal in this
which will useful for the next section.

Two Genotype Example:
We start by assigning genotypes according to their Mahalanobis distance, as done in BRLMM. Recall, we assign wise. Therefore, the probability, , that a subject with genotype 1 is misclassified as genotype 0 will be = P(S i <C|n, θ, G i = 1), and in the limit, we know . Now, it's easy to show that must also be the limiting probability that a subject with genotype 0 is assigned as genotype 1.
Therefore, given the genotype, the limiting conditional probabilities, P( = 1 Ω ≠ G i |G i = 0) and P( = 0 Ω ≠ G i |G i = 1) are equal. If p 0 > p 1 , then the unconditional probabilities cannot be equal, specifically Obviously the opposite inequality holds if p 0 <p 1 . Therefore, with the Mahalanobis distance, the bias will be Clearly, when p 0 = p 1 = 0.5, for all values of . However, when p 0 ≠ p 1 , the bias is a non-zero function , and therefore depends on the parameter group, ( Figure 1). As shown by Figure 1, the bias can be quite large when either and/or |p 0 -0.5 | is large.
Next, assume that genotype assignments are based on a likelihood, or weighted likelihood measure. For a value ω ∈ c(0, 1), the probability that φ g (S i ) exceeds ω, or , changes with , where returns a value greater than μ g . Therefore, φ 0 (s) = φ 1 (s) does not imply anything about the relationship between Φ 0 (s) = Φ 1 (s). We illustrate the potential for bias by a simple example where μ 0 = 0, = 1, and μ 1 = 1. Let p 0 = p 1 = 0.5. Then, for any value of . Equivalently, when two normal densities intersect at the threshold value, the probability of misclassifying a genotype 0 subject will not equal the probability of misclassifying a genotype 1 subject, and therefore . For a given threshold, c*, we can define the bias by . Unlike the previous example, with the Mahalanobis distance, Figure 2 shows that describing the bias as a function of and p 0 can be difficult. In current tests of associa-   Having just discussed cases where equation (9) holds, our standard estimates of p 0 and p 1 are not consistent. Specifically, for these scenarios.

Return of Consistency: Modifying the Mahalanobis Distance
The Mahalanobis Distance, D(g, s i |θ), measures the conditional probability of getting a value as extreme as s i given genotype g. Therefore, we could achieve the same results using where g ∈ {0, 1} and S i is again presumed to be normally distributed. As we saw, the current estimators suffer because they don't account for the genotype probabilities. Borrowing Bayesian terminology, we simply need to weight our distance measure by the prior probability of a subject having each genotype. Therefore, returning to step 4 of our genotyping process, we now define = g max We next examine the behavior of , and .
First, as is common, the dependence increases the variance of these estimates. For any population,   Next, we offer an example to demonstrate the effect of dependence among called genotypes. Specifically, using the IBD data, we show that a is a poor approximation for the distribution of . Because we will use real data, we have purposely chosen to discuss instead of T .
There are many reasons that T may not follow a distribution, including being a poor estimate of p g , the distributions of S i being far from normal and population substructure. However, for large n, the only reason that will not be approximately normal is if are not independent. Also, we chose to focus on the AB genotype as this is certain to be one of the genotypes with an overlapping cluster.
For each of our 3137 SNPs in the IBD data, we bootstrap 40 samples of 500 subjects, and calculate 40 values of , where is estimated by . The qq-plot in Figure 3 compares these 40 × 3137 values with a N(0,1). The distribution is far from normal, which implies that are dependent. Some SNPs were more likely to contribute skewed values than others, but the top and bottom 100 values (200 total) are from 64 different SNPs, indicating that no one SNP, or small number of SNPs, is responsible for the deviation in the qq plot. In contrast, the qq-plot from well-behaved SNPs, where the contributions to the density of S from the three genotypes were separated, was the expected straight line, showing that it was not the normal approximation skewing the results (Figure 3). Because the magnitude of the observed values were larger than predicted by theory, the practical implication is that tests based on the statistic, T, estimated by (T), will be anti-conservative(i.e. too many significant p-values after adjusting for multiple testing) under the null hypothesis.
Here we also note that if S i were truly normal, the impact of the dependency would be much less. For more details on the origin of dependency, please see Table 1. Table 2 shows the results from the simulations, and lists the percentage of the influential SNPs that were ranked among the top 100 most significant SNPs, where the rank-

Comparing Tests of Association
ings were determined by either LogiCALL, a likelihood ratio test, or a standard test, similar to Bead-Studio. As there were 100 influential SNPs in each simulation, an ideal scenario would have 100% of the influential SNPs in the top 100 SNPs. As expected, the percentages increase as the relative risk, comparing the two homogeneous genotypes, increases from 1.5 to 2.5. So long as the densities of φ (M ji |G ji = AA) and φ (M ji |G ji = AB) were distinct, which was the case when the distance between μ AA and exceeded 0.5 (and was less than 1.239), all three tests performed equally well. As the shifts were decreased in simulations, many of these shifted SNPs started to appear in the top 100 most significant SNPs, when ranked by a standard test. Furthermore, the loss in power was exaggerated when the amount of overlap differed between cases and controls, φ(M ji |G ji = AB, In GWAS, each marker is tested for association with the disease. Here, we compare four methods for testing the 3137 chosen SNPs in the IBD study. In the first method, we let Bead-Studio call the genotypes and perform its (Dependency of Calls) The density of is compared to a N(0,1) density for 500 subject samples in a quantile-quantile plot Figure 3 (Dependency of Calls) The density of is compared to a N(0,1) density for 500 subject samples in a quantile-quantile plot. The deviation from the Y = X line indicates that is not distributed as a binomial(n, p AB ) variable.   (Table 3).
When genotyping all SNPs and all subjects, the proportion of Bead-Studio 'p-values' below the three thresholds far exceeded 0.005, 0.001, and 0.0005. Even after removing low quality SNPs and allowing missing calls, the proportion of 'p-values' below the three thresholds were 0.015, 0.005, and 0.004. In contrast, LogiCALL eliminated nearly all false positives. The proportion below the three thresholds were 0.004, 0.002, and 0.002. If the majority of these SNPs are presumed to be null associations, logiCALL appears to be the superior method, so long as the power loss is minimal. Assigning conservative 'p-values' to problematic SNPs is nearly equivalent to removing them. However, because of the tendency for there to be multiple, nearly equivalent, maximum likelihood estimates, the relative distances to the AA, AB, and BB genotypes using the single set of maximum likelihood estimates may not be adequate in identifying questionable calls. Therefore, logiCALL gains an advantage by combining two methods for identifying suspect calls. Additionally, it avoids false positives caused by differential bias, where the proportion of missing calls differs between cases and controls. This new method simplifies testing by requiring no preprocessing and testing all SNPs.
The power loss from logiCALL depends on the quality of the data. When the statistic for a SNP cleanly separates into an AA, AB, and BB group, there is no power loss. In our IBD example, the 'p-values' reported for rs2066843 and rs2076756, the two SNPs that are believed to be truly associated with IBD were similar for the two methods, Bead-Studio (2.9 × 10 -9 and 5.1 × 10 -10 ) and logiCALL (1.5 × 10 -8 and 1.6 × 10 -9 ). Among those subjects meeting the 96% Bead-Studio call rate, logiCALL found no questionable calls.
The likelihood ratio method had mixed results. Clearly, when the distribution of the summary statistic is a mixture of normals, the estimated genotype proportions are asymptotically unbiased. Unfortunately, this method still resulted in an increased number of false positives. However, if we removed those SNPs where at least one call changed when switching from to , the false positive rate decreased to the expected level. In theory, all calls should be identical and only the resulting likelihoods should differ. When assigning genotypes, the cost, in likelihood, incurred from forcing the vectors of genotype proportions to be equal should be far less than the cost of switching calls. When the reverse is true, and calls switch, the statistic for the three groups cannot be well separated, and the p-value is suspect.

Conclusion
In genome-wide association tests, under the null hypothesis, the test statistic rarely follows the expected chisquared distribution. This deviation tends to result in an excess of false positives. Unfortunately, the investigation into the origin of this deviation has yet to be completed. The problems associated with poor signal quality and population substructure have been thoroughly explored. However, the overlap of fluorescent signals has only been identified as a serious problem, and has yet to be fully explained. In this paper, we have provided two reasons, parameter inconsistency and called genotype dependency, that help explain how overlap causes this deviant behavior. Furthermore, we propose two methods, logi-CALL and a method based on the likelihood ratio statistic that better handle the problems of inconsistency and dependency. These methods will perform similarly to the common, genotype-based, test statistics for the wellbehaved SNPs and appear to create fewer false positives for the difficult-to-call SNPs. We have also identified a new characteristic of some false positives, that the call differs when using vs. , that will help distinguisĥ θ jθ R jθ jθ R jθθ R θ R jθ j We have demonstrated that increasing sample size alone will not eliminate type I error, as genotyping, in its current form, leads to inconsistent estimates of population parameters. To alleviate this inconsistency, the distance measure used for assignment would need to be switched to , defined in the Results and Discussion Section.
Moreover, we have proven that the called genotypes can be dependent under certain conditions, and that tests based on called genotypes need to account for the increased variance caused by dependence. Finally, we illustrated that the likelihood profile of the data can be relatively flat near the MLEs. Therefore, judging the quality of calls from distances based only on the MLE may not provide an adequate means to identify questionable calls. Hence logiCALL, which looks at all locally-maximal likelihood estimates of the parameters can reduce the type I error rate. Testing association by the likelihood ratio statistic is another promising method for addressing the problems associated with overlapping signals.

Availability and requirements
Computer programs are available on author's website, http://bioinformatics.med.yale.edu/group/josh/ index.html.