Genotyping and inflated type I error rate in genome-wide association case/control studies
- Joshua N Sampson^{1}Email author and
- Hongyu Zhao^{1}
https://doi.org/10.1186/1471-2105-10-68
© Sampson and Zhao; licensee BioMed Central Ltd. 2009
Received: 12 December 2008
Accepted: 23 February 2009
Published: 23 February 2009
Abstract
Background
One common goal of a case/control genome wide association study (GWAS) is to find SNPs associated with a disease. Traditionally, the first step in such studies is to assign a genotype to each SNP in each subject, based on a statistic summarizing fluorescence measurements. When the distributions of the summary statistics are not well separated by genotype, the act of genotype assignment can lead to more potential problems than acknowledged by the literature.
Results
Specifically, we show that the proportions of each called genotype need not equal the true proportions in the population, even as the number of subjects grows infinitely large. The called genotypes for two subjects need not be independent, even when their true genotypes are independent. Consequently, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical. To address these problems, we propose two new tests designed to reduce the inflation in the type I error rate caused by these problems. The first algorithm, logiCALL, measures call quality by fully exploring the likelihood profile of intensity measurements, and the second algorithm avoids genotyping by using a likelihood ratio statistic.
Conclusion
Genotyping can introduce avoidable false positives in GWAS.
Background
One common goal of a genome wide association (GWAS) study is to search the entire genome for single nucleotide polymorphisms (SNPs) and copy number variations (CNVs) associated with a disease or some other phenotype. In this article, we focus our analysis on SNPs. The two possible alleles at a SNP are arbitrarily labeled A and B, and association is often tested by measuring and comparing the frequencies of the genotypes AA, AB, and BB, in case and control groups. As technology currently allows close to one million SNPs to be examined simultaneously, there is a need for fast, automated methods to test for association. As only a small minority of SNPs are expected to be associated with the disease, even a modest false positive rate could bury true associations beneath those occurring by chance.
Using Affymetrix 500 k GeneChips as an example, each of the 500,000+ SNPs is represented by a series of probes on a pair of arrays. Each probe is an oligonucleotide designed to bind to either the A or B allele. A subject's fluorescently labeled DNA is allowed to hybridize with these probes, and then a spectrometer measures the relative fluorescent levels between the A and B probes. Each genotyping algorithm (see Methods) summarizes the fluorescence information, or the likelihood a subject has allele A at that SNP, by its own statistic. In any population, these statistics usually cluster into three groups, corresponding to the three genotypes. Studies have noted that 1) The mean and variance of these clusters, or the shape of their distribution in general, varies by SNP and 2) For a single SNP, because of differences in processing or duration of storage, the shape of the statistic's distribution can differ between the case and control groups [1–3].
The current test of association requires calling, or assigning a genotype, to each SNP for each study subject and then comparing the called genotypes between the case and control groups. The majority of SNPs are easy to call and any of the available methods will call them correctly. Unfortunately, there is a difficult minority that cannot be easily clustered into three distinct groups. Because there can be as many as 500,000 SNPs, this minority can greatly inflate the type I error rate and cause the large, characteristic, deviations from the x = y line in the qq-plots of test statistics. Most studies assume these consequences are the unavoidable results of population substructure and poor data. In this paper, we dispel the myth that these are the sole issues. In fact, the inflated type I error rate and general misbehavior of the test statistic may also result from the act of genotype assignment and a poor choice of statistical methodology.
The goal of this manuscript is two-fold. First, our primary goal is to show that genotyping overlapping clusters can lead to potential problems that we have yet to see fully acknowledged in the literature. The proportions of each called genotype need not equal their true proportions in the population, even as the number of subjects grows infinitely large. As we compare genotype calls, p-values from tests of association will be anti-conservative when the distribution of the summary statistic differs between cases and controls. Moreover, the called genotypes for two subjects need not be independent, even when their true genotypes are independent. Therefore, p-values from tests of association can be anti-conservative, even when the distributions of the summary statistic for the cases and controls are identical, a fact we believe has yet to be fully demonstrated. Although previous studies have examined the effects of genotyping error on tests of association [4–6], studies has neither fully explored the effects caused by case/control differences in distributions nor dependence of error. Second, we discuss two new tests that can circumvent these potential problems. One test compares calls made from a genotyping algorithm designed to minimize the type I error. The second test compares the fluorescence distributions, instead of the called genotypes. We start the Methods section by discussing currently available methods for genotyping SNPs and testing for association. Concurrently, we introduce logiCALL, our new genotyping algorithm, and the likelihood ratio-based test of association. In the Results and Discussion section, we start by showing that the proportion of genotypes called AA, AB, and BB need not converge to the true population proportions. Then we discuss how the called genotypes can be dependent. We conclude the section by comparing the proposed tests of association with the current standards through both simulation studies and real data analysis. Then a short Conclusions section summarizes the key points.
Methods
Calling Genotypes
Programs available for genotyping SNPs.
Name | Summary Statistic | MM^{1} | Data^{2} | Data^{3} | Notes |
---|---|---|---|---|---|
RLMM | {Θ_{ A }, Θ_{ B }} | No | T | M | |
BRLMM | $\frac{1}{asinh(4)}asinh\left(\frac{4({I}_{A}-{I}_{B})}{{I}_{A}-{I}_{B}}\right)$ | No | E-U | M | Assumes genotypes in "training" data are known. "Training" data only uses high quality SNPs. Incorporates info from other SNPs as a B ayesian Prior. |
CRLMM | {Θ_{A+}- Θ_{B+}, Θ_{A-}- Θ_{B-}} | No | T | L | C orrects for the effect of total intensity level and probe length on {S_{ ij }} through more complex method, and allows corrections to vary by array. |
CHIAMO | $\left\{\frac{1}{np}{\displaystyle {\sum}_{k=1}^{{n}_{p}}{Y}_{Ak},\frac{1}{np}{\displaystyle {\sum}_{k=1}^{{n}_{p}}{Y}_{Ak}}}\right\}$ | Yes | E-L, T* | W | CHIAMO is a Bayesian hierarchical mixture model and is greatly simplified by this brief summary |
SNiPer-HD | $\begin{array}{l}\{{R}_{1},\mathrm{...},{R}_{{n}_{p}}\},\text{where}\\ {R}_{k}=({I}_{P{M}_{A}}^{ijk})/({I}_{P{M}_{A}}^{ijk}+{I}_{P{M}_{B}}^{ijk})\end{array}$ | No | E-U | W | Assumes genotypes in "training" data are unknown and requires the EM algorithm. "Training" data should only use high quality SNPs. |
Moorhead | $\frac{1}{sinh(2)}sinh\left(\frac{{I}_{P{M}_{A}}+{I}_{P{M}_{B}}}{{I}_{P{M}_{B}}}\right)$ | N/A | E-U | W | Originally for MIP, but applicable to Affymetrix. Plagnol demonstrated how to link genotype probabilities between cases and controls. |
logiCALL | $\left\{\frac{1}{sinh(2)}sinh\left(\frac{{I}_{P{M}_{A}}+{I}_{P{M}_{B}}}{{I}_{P{M}_{B}}}\right),{I}_{P{M}_{A}}+{I}_{P{M}_{B}}\right\}$ | No | E-L | W-F | Designed to lower false positive rate and assigns calls based on cumulative distribution, not density functions. |
where ψ(·) is the multivariate normal density. The three mean vectors, $\mu \equiv \{{\overrightarrow{\mu}}_{AA},{\overrightarrow{\mu}}_{AB},{\overrightarrow{\mu}}_{BB}\}$, variance matrices, Σ ≡ {Σ_{ AA }, Σ_{ AB }, Σ_{ BB }}, and probabilities, p ≡ {p_{ AA }, p_{ AB }, p_{ BB }}, correspond to the three possible genotypes, AA, AB, and BB. Define $\theta \equiv \{\overrightarrow{\mu},\Sigma ,\overrightarrow{p}\}$ and later, we let Φ(·) and Φ^{ M }(·) be the cumulative distribution for a normal variable and a mixture of normals. The third step is to estimate θ. Some algorithms, such as RLMM, use a training data set, where the genotypes are known. Other algorithms, such as SNiPer-HD, use no training data, and find the best estimates that describe their experimental values. The fourth step is to assign a genotype, ${\stackrel{\u02c6}{G}}_{ij}$, to SNP j in subject i. Often, for a given value of S_{ i }, the assigned genotype maximizes a similarity function: ${\stackrel{\u02c6}{G}}_{i}$ = argmax_{ g }D(g, S_{ i }|θ). The similarity function is usually a modified version of one of the following three quantities: Mahalanobis distance: $-({S}_{i}-{\overrightarrow{\mu}}_{g}){\Sigma}_{g}^{-1}{({S}_{i}-{\overrightarrow{\mu}}_{g})}^{t}$, Unweighted Likelihood: $\varphi ({S}_{i}|{\overrightarrow{\mu}}_{g},{\Sigma}_{g})$, or the Weighted Likelihood: p_{ g }$\varphi ({S}_{i}|{\overrightarrow{\mu}}_{g},{\Sigma}_{g})$. The similarity function is also modified to ensure monotonicity of assignment. When we let S_{ i }= {M_{ i }, A_{ i }}, we force D(AB, S_{ i }|θ) = D(BB, S_{ i }|θ) = 0 if M_{ i }≤ μ_{ AA }, D(BB, S_{ i }|θ) = 0 if μ_{ AA }≤ M_{ i }≤ μ_{ AB }, D(AA, S_{ i }|θ) = 0 if μ_{ AB }≤ M_{ i }≤ · μ_{ BB }, and D(AB, S_{ i }|θ) = D(AA, S_{ i }|θ) = 0 if M_{ i }≥ μ_{ BB }, where μ_{ g }, in this case, is the mean of M_{ i }when G_{ i }= g. This modification is standard in calling algorithms. As we do not know the true value of the parameters in experiments, we replace D(g, S_{ i }|θ) by D(g, S_{ i }|$\stackrel{\u02c6}{\theta}$). A subject's SNP may not be called, or assigned a missing value, if the difference or ratio between D(${\stackrel{\u02c6}{G}}_{i}$, S_{ i }|$\stackrel{\u02c6}{\theta}$) and D(g_{2i}, S_{ i }|$\stackrel{\u02c6}{\theta}$) is not large enough, where g_{2i}is the genotype with the second largest value of D(g, S_{ i }|$\stackrel{\u02c6}{\theta}$). A SNP may be omitted from further study if too many values were set to missing. Table 1 describes the details of the four steps for popular methods. For many purposes or to understand the details of the method, especially in handling rare alleles, this table will seem an oversimplification. For our purposes here, it highlights the features of interest.
Tests of Association
The current tests of association start by calling genotypes for a given SNP j in a group of subjects with the disease and in a group of controls. They then compare the resulting proportions, ${\stackrel{\u02c6}{P}}^{A}\equiv \{{\stackrel{\u02c6}{P}}_{AA}^{A},{\stackrel{\u02c6}{P}}_{AB}^{A},{\stackrel{\u02c6}{P}}_{BB}^{A}\}$ and ${\stackrel{\u02c6}{P}}^{U}\equiv \{{\stackrel{\u02c6}{P}}_{AA}^{U},{\stackrel{\u02c6}{P}}_{AB}^{U},{\stackrel{\u02c6}{P}}_{BB}^{U}\}$, from these Affected and Unaffected groups using either a Cochran-Armitage test or logistic regression. Here ${\stackrel{\u02c6}{p}}_{g}^{q}\equiv \frac{1}{nq}{\displaystyle {\sum}_{i:{Q}_{i=q}}1({\stackrel{\u02c6}{G}}_{i}=g)}$, where the indicator function is defined by 1(x) = 1 if x is true, 0 otherwise, Q_{ i }∈ {A, U} is the disease status for individual i, and n_{ q }is the number of subjects with disease status q. In this manuscript, any 'p-value' from a genotype-based association test will be calculated using ANOVA on the logistic regression model with Q_{ i }and genotype (unordered grouping) as the dependent and independent variables.
Standard tests tend to err anti-conservatively as we will discuss below. We will propose four alterations that can reduce type I error rate, with only a minimal decrease in power. These are the four differences that separate logiCALL from standard methods. The first is based on the observation, which is discussed later, that the likelihood profile of φ^{ M }(S_{ i }|θ) will have multiple local maxima near the overall maximum. When estimating θ, the EM algorithm converges to multiple solutions. For many of those solutions, the resulting parameter set, ${\stackrel{\u02c6}{\theta}}_{lm}$, satisfies ${\prod}_{i}{\varphi}^{M}({S}_{i}|\stackrel{\u02c6}{\theta})\le}{\displaystyle {\prod}_{i}{\varphi}^{M}({S}_{i}|{\stackrel{\u02c6}{\theta}}_{lm})}+\tau n$. For each parameter set satisfying this inequality, we will make a new group of genotype assignments, $\left\{{\stackrel{\u02c6}{G}}_{i}^{lm}\right\}$. If more than 10% of such assignments disagree with ${\stackrel{\u02c6}{G}}_{i}$ (τ = 0.06), we label that subject's call as questionable. We also continue the practice of marking calls with small values of D(${\stackrel{\u02c6}{G}}_{i}$, S_{ i }|$\stackrel{\u02c6}{\theta}$)/D(g_{2i}, S_{ i }|$\stackrel{\u02c6}{\theta}$) as questionable. The second alteration is that we do not discard questionable calls, an act which can create false positives. Instead, we assign questionable S_{ i }so the proportions of genotypes in the cases and controls are as similar as possible, which is defined as minimizing ${\sum}_{g}|{\stackrel{\u02c6}{p}}_{g}^{U}-{\stackrel{\u02c6}{p}}_{g}^{A}|$, with the restriction that the final call for subject i must be either ${\stackrel{\u02c6}{G}}_{i}$ or g_{2i}. Here, we let g_{2i}be the genotype which is either the runner-up in terms of distance or the most common genotype among the dissenting calls, depending on why the genotype was labeled as questionable. The third alteration, which is already incorporated into other programs is to perform the EM algorithm under the null hypothesis that the genotype proportions in the two populations are identical [3]. The fourth is the use of a weighted Mahalonobis distance, which is defined later. Given these changes, logiCALL then compares the estimated genotype proportions in cases and controls using logistic regression. Note that none of the changes affect calls for the vast majority of SNPs.
Clearly, the restricted parameter space, $\{{\Omega}_{R}:{p}_{AA}^{A},{p}_{AA}^{U},{p}_{AB}^{A}={p}_{AB}^{U},{p}_{BB}^{A}={p}_{BB}^{U}\}$ is a subset of the unrestricted parameter space, Ω. In an ideal scenario, the distribution of 2log(LR) would converge to a chi-squared distribution, ${F}_{{\chi}_{2}^{2}}$, with 2 degrees of freedom. Therefore, the 'p-value' from a likelihood ratio-based test will be calculated as 1 - ${F}_{{\chi}_{2}^{2}}$(2log(LR)).
Data Source
To demonstrate the problems of genotyping and compare the genotype- and likelihood ratio-based tests of association, we use three types of data. First, for discussion, we may assume a hypothetical study measuring a one dimensional summary statistic, S_{ i }, for a SNP j with only two possible genotypes, G_{ i }∈ {0, 1}. Furthermore, to show problems can exist even under the best conditions, where model and truth coincide, we assume that S_{ i }follows a normal distribution given Q_{ i }and G_{ ij }, and that the full distribution can be described by ${\varphi}^{M}({S}_{i}|\theta )={p}_{0}\varphi ({S}_{i}|{\mu}_{0},{\sigma}_{0}^{2})+{p}_{1}\varphi ({S}_{i}|{\mu}_{1},{\sigma}_{1}^{2})$.
We compare our two new tests of association to a standard method using simulated data. The standard method mimics the general Bead-Studio approach by a) fitting parameters with the EM algorithm; b) calling genotypes based on the Mahalanobis Distance; c) removing all calls where D(${\stackrel{\u02c6}{G}}_{i}$, S_{ i }|$\stackrel{\u02c6}{\theta}$)/D(g_{2}, S_{ i }|$\stackrel{\u02c6}{\theta}$) > 0.5; and d) comparing the two sets of resulting estimates, $\{{\stackrel{\u02c6}{P}}_{AA}^{A},{\stackrel{\u02c6}{P}}_{AB}^{A},{\stackrel{\u02c6}{P}}_{BB}^{A}\}$ and $\{{\stackrel{\u02c6}{P}}_{AA}^{U},{\stackrel{\u02c6}{P}}_{AB}^{U},{\stackrel{\u02c6}{P}}_{BB}^{U}\}$. We generated 10 simulated datasets, containing 1000 subjects (500 cases, 500 controls) and 303,100 SNPs for each of 18 scenarios. For each gene j and each subject i, we generated a 2D summary statistic (M_{ ji }, A_{ ji }). The distribution of M_{ ji }depended on genotype. If G_{ ji }= AA, then M_{ ji }~ 2X - 1, and if G_{ ji }= BB, then M_{ ji }~1 - 2X, where X ~ beta(α = 3, β = 30). If G_{ ji }= AB, then M_{ ji }also followed a beta distribution, but the parameters varied by SNP, disease status, and scenario. For all SNPs and all subjects, A_{ ji }~ N (10, 1.5). We generated three types of SNPs, background, shifted, and influential. First, 300,000 background SNPs were generated and included in all 10 × 18 = 180 data sets. For each SNP, a single minor allele frequency was generated from a uniform(0.2, 0.4) distribution and genotype probabilities $({\overrightarrow{p}}^{U}={\overrightarrow{p}}^{A})$ were generated assuming Hardy-Weinberg Equilibrium. Here, E [M_{ ji }|G_{ ji }= AB] = 0 and was independent of disease status. These SNPs, which formed three distinct clusters, can be easily identified and represent a well-behaved group. For 3,000 shifted SNPs, MAF ~uniform(0.2, 0.4) and genotype probabilities $({\overrightarrow{p}}^{U}={\overrightarrow{p}}^{A})$ were generated assuming Hardy-Weinberg Equilibrium. Here, E [M_{ ji }|G_{ ji }= AB] ∈ {-0.739 + 0.2, - 0.739 + 0.3 - 0.739 + 0.5}, where we note E [M_{ ji }|G_{ ji }= AA] = -0.739 and E [M_{ ji }|G_{ ji }= AB, Q_{ i }= A] - E [M_{ ji }|G_{ ji }= AB, Q_{ i }= U] ∈ {0, 0.2}. This group represents difficult to call SNPs. For 100 influential SNPs, MAF ~uniform(0.2, 0.4) and genotype probabilities $({\overrightarrow{p}}^{U}\ne {\overrightarrow{p}}^{A})$ were chosen so that, under a disease prevalence of 0.01 and a model of additive effects, the genotype relative risk for subjects homogeneous for the minor allele, P(Q_{ i }= A|BB)/P (Q_{ i }= A|AA) ∈ {1.5, 2.0, 2.5}. Combing the degree of shift for poor quality SNPs and the effect size of truly associated SNPs, we have a total of 18 scenarios used in our simulation.
The next set of data is from a recent GWAS of Inflammatory Bowel Disease (IBD) that compared 983 subjects with IBD to 1004 subjects without the disease. Using Illumina microarrays, 308,330 SNPs on the autosomal chromosomes were tested. Jewish and non-Jewish cohorts, approximately equal in size, were analyzed separately, a practice continued here. Details have been previously published [17, 18]. Because the overwhelming majority of the SNPs are easy to genotype, as any of the summary statistics neatly divide into three clusters, we chose a 3137 difficult SNP subset where at least two clusters overlap (definition below). Because association was tested separately in Jewish and non-Jewish cohorts, a total of 3137 × 2 = 6274 tests were possible. To demonstrate called genotype dependency, we bootstrapped 40 samples, ignoring case/control status, of 500 subjects for each SNP. A sample would be discarded, and replaced by another random selection, if one or both of the groups were lacking an AA (S_{ i }<-0.7) or a BB (S_{ i }> 0.7) genotype.
We defined difficult SNPs as follows. For SNP j, we first estimated the density, $\stackrel{\u02c6}{f}$(M_{j.}) nonparametrically using the R function 'density(adjust = 0.3)'. In theory, $\stackrel{\u02c6}{f}$(M_{j.}) is a mixture of three normal distributions, corresponding to the three genotypes. If the SNP is well-behaved, then the three underlying densities will not overlap, and the empirical density $\stackrel{\u02c6}{f}$(M_{j.}) will attain minima near 0 in the valleys between μ_{ jAA }and μ_{ jAB }and between μ_{ jAB }and μ_{ jBB }. If either of these minima exceeded 0.2, then at least two of underlying densities overlapped, and that SNP was defined as difficult. To speed the process, we found that approximating the center of the peaks (i.e. μ_{ jAA }, μ_{ jAB }, and μ_{ jBB }) by the median values of M_{j.}in the three windows, {M_{j.}≤ -0.6; -0.3 ≤ M_{j.}≤ 0.3, M_{j.}≥ 0.5}, worked well.
Results and discussion
Two Genotype Example: Parameters
We choose to use the hypothetical, two-genotype, study, to highlight that the estimated parameters can be inconsistent even in the simplest scenario, where the summary statistic is distributed normally and our fitted model is correct. When dealing with only a single population, we define the parameter p_{0} (p_{1}) to be the probability that a subject's true genotype is 0 (1).
p_{0} ≡ P (G_{ i }= 0) and p_{1} ≡ P(G_{ i }= 1) (4)
Probabilities of called genotypes implicitly depend on the number of subjects in the sample. We then define the parameter c* to be the point which is equidistant to the two genotype groups when θ is known.
Therefore, μ_{0} ≤ c* ≤ μ_{1} is defined to be a solution to equation 6
D(G_{ i }= 0, S_{ i }= c*|θ) = D(G_{ i }= 1, S_{ i }= c*|θ) (6)
Therefore, by convergence of the MLE, we know that
lim_{n → ∞}C →_{ p }c* (12)
which will useful for the next section.
Two Genotype Example: ${p}_{0}^{*}\ne {p}_{0}$
Here β is a measure of the power to reject ${H}_{0}^{*}$ and we assume β > α. Therefore, the current method of rejecting H_{0} whenever T* > ${t}_{\alpha}^{*}$ is actually anti-conservative if the stated p-value is α
Two Genotype Example: Inconsistency
for these scenarios.
Return of Consistency: Modifying the Mahalanobis Distance
With this change, our estimate ${\stackrel{\u02c6}{p}}_{0}\equiv \frac{1}{n}{\displaystyle {\sum}_{i}1({\stackrel{\u02c6}{G}}_{i}=0)}$ would now be consistent.
Dependence/Correlation of Called Genotypes
The central limit theorem allows us to be confident that we have an α-level test because $\sqrt{n}({\tilde{p}}_{g}^{q}-{p}_{g}^{q})\approx N(0,{p}_{g}^{q}(1-{p}_{g}^{q}))$ when {G_{1},..., G_{ n }is a vector of independent Bernoulli random variables. Again, we have returned to the two genotype scenario to simplify our discussion.
This proof, which uses Jensen's inequality, clearly shows that the two variables, ${\stackrel{\u02c6}{G}}_{{i}_{1}}$ and ${\stackrel{\u02c6}{G}}_{{i}_{2}}$, are not independent as that would have implied
$P({\stackrel{\u02c6}{G}}_{{i}_{1}}={\stackrel{\u02c6}{G}}_{{i}_{2}})=P({\stackrel{\u02c6}{G}}_{{i}_{1}}=0,{\stackrel{\u02c6}{G}}_{{i}_{2}}=0)+P({\stackrel{\u02c6}{G}}_{{i}_{1}}=1,{\stackrel{\u02c6}{G}}_{{i}_{2}}=1)={({p}_{0}^{n*})}^{2}+{({p}_{1}^{n*})}^{2}$. The consequence of this dependence is that $\sqrt{n}({\stackrel{\u02c6}{p}}_{g}^{q}-{p}_{g}^{q})$ does not follow a $N(0,{p}_{g}^{q}(1-{p}_{g}^{q}))$ distribution, and in turn, that $T({\stackrel{\u02c6}{p}}_{0}^{A},{\stackrel{\u02c6}{p}}_{0}^{U})$ neither follows a ${\chi}_{2}^{2}$ distribution nor has P($T({\stackrel{\u02c6}{p}}_{0}^{A},{\stackrel{\u02c6}{p}}_{0}^{U})$ > t_{ α }) = α.
The first term, $nE[var(\frac{{\displaystyle {\sum}_{i=1}^{n}1({\stackrel{\u02c6}{G}}_{i}=0)}}{n}|C)]$, represents the uncertainty in the true number of subjects with genotype 0, and can be roughly approximated by ${\stackrel{\u02c6}{p}}_{0}-{({\stackrel{\u02c6}{p}}_{0})}^{2}$. The second term, $nvar(E[\frac{{\displaystyle {\sum}_{i=1}^{n}1({\stackrel{\u02c6}{G}}_{i}=0)}}{n}|C])$ reflects that there will be a subpopulation, of a random size ~n|Φ^{ M }(C) - Φ^{ M }(E[C])|, that is assigned the 'non-ideal' genotype, where a call is labeled 'non-ideal' if it would have been different had the threshold been E [C]. We can approximate this second term, the overall increase in the variance, by nvar(Φ^{ M }(C)) if P(${\stackrel{\u02c6}{G}}_{i}$ = 0|C) ≈ Φ^{ M }(C) and the cor(${\stackrel{\u02c6}{G}}_{{i}_{1}}$, ${\stackrel{\u02c6}{G}}_{{i}_{2}}$|C) ≈ 0. In experiments, we can estimate nvar(Φ^{ M }(C)) by bootstrapping samples of C or ${\stackrel{\u02c6}{\Phi}}^{M}$(C).
Note that ${\stackrel{\u02c6}{p}}_{0}\equiv {\stackrel{\u02c6}{\Phi}}^{M}(C)$ and ${p}_{0}^{n*}\equiv E[{\stackrel{\u02c6}{\Phi}}^{M}(C)]$. The appropriate multiple of the first term, n(${\stackrel{\u02c6}{\Phi}}^{M}$(C) - Φ^{ M }(C)), should be well approximated by X-E [C], where X ~binomial(n, E [C]). From our own experience, we have seen that the third term, (Φ^{ M }(E[C]) - E${\stackrel{\u02c6}{\Phi}}^{M}$(C)]), is a constant close to 0. The second term, (Φ^{ M }(C) - Φ^{ M }(E[C])) is the variable which causes deviation from normality. Again, we could approximate the distribution of this term by bootstrapping C.
Next, we offer an example to demonstrate the effect of dependence among called genotypes. Specifically, using the IBD data, we show that a $N(0,E[{\stackrel{\u02c6}{p}}_{AB}](1-E[{\stackrel{\u02c6}{p}}_{AB}]))$ is a poor approximation for the distribution of $\sqrt{n}({\stackrel{\u02c6}{p}}_{AB}-E[{\stackrel{\u02c6}{p}}_{AB}])$. Because we will use real data, we have purposely chosen to discuss $\sqrt{n}({\stackrel{\u02c6}{p}}_{AB}-E[{\stackrel{\u02c6}{p}}_{AB}])/\sqrt{E[{\stackrel{\u02c6}{p}}_{AB}]-{(E[{\stackrel{\u02c6}{p}}_{AB}])}^{2}}$ instead of T . There are many reasons that T may not follow a ${\chi}_{2}^{2}$ distribution, including ${p}_{g}^{*}$ being a poor estimate of p_{ g }, the distributions of S_{ i }being far from normal and population substructure. However, for large n, the only reason that $\sqrt{n}({\stackrel{\u02c6}{p}}_{AB}-E[{\stackrel{\u02c6}{p}}_{AB}])/\sqrt{E[{\stackrel{\u02c6}{p}}_{AB}]-{(E[{\stackrel{\u02c6}{p}}_{AB}])}^{2}}$ will not be approximately normal is if $\{{\stackrel{\u02c6}{G}}_{1},\mathrm{...},{\stackrel{\u02c6}{G}}_{n}\}$ are not independent. Also, we chose to focus on the AB genotype as this is certain to be one of the genotypes with an overlapping cluster.
Comparing Tests of Association
The percentage of influential genes among the top 100 most significant SNPs, as ranked by LogiCALL, a likelihood ratio test, and a standard test.
RR = 1.5 | RR = 2.0 | RR = 2.5 | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Shift | Difference | LogiCALL | LR | Standard | LogiCALL | LR | Standard | LogiCALL | LR | Standard |
0.5 | 0 | 4.4 | 4.6 | 4.5 | 41.8 | 41.8 | 41.3 | 78.8 | 79.2 | 79.3 |
0.2 | 4.4 | 4.6 | 4.5 | 42.0 | 41.8 | 41.3 | 78.8 | 79.3 | 79.4 | |
0.3 | 0 | 4.4 | 4.6 | 4.6 | 42.0 | 41.9 | 41.2 | 78.8 | 79.2 | 79.4 |
0.2 | 4.4 | 1.9 | 0.2 | 41.6 | 29.5 | 14.4 | 78.8 | 67.3 | 42.7 | |
0.2 | 0 | 4.4 | 4.4 | 2.1 | 42.0 | 41.4 | 25.9 | 78.8 | 78.9 | 47.1 |
0.2 | 3.7 | 0.1 | 0.0 | 38.9 | 3.4 | 0.0 | 75.3 | 21.3 | 0.0 |
The percentage of 'p-values' less than traditional^{α}-levels (0.005,0.001,0.005) are listed for four tests of association.
Method | n | p < 0.005 | p < 0.001 | p < 0.0005 |
---|---|---|---|---|
BeadStudio | 3487 | 0.015 | 0.005 | 0.004 |
logiCALL | 5533 | 0.004 | 0.002 | 0.002 |
LR | 5533 | 0.087 | 0.061 | 0.052 |
LR-(same) | 3014 | 0.006 | 0.002 | 0.001 |
When genotyping all SNPs and all subjects, the proportion of Bead-Studio 'p-values' below the three thresholds far exceeded 0.005, 0.001, and 0.0005. Even after removing low quality SNPs and allowing missing calls, the proportion of 'p-values' below the three thresholds were 0.015, 0.005, and 0.004. In contrast, LogiCALL eliminated nearly all false positives. The proportion below the three thresholds were 0.004, 0.002, and 0.002. If the majority of these SNPs are presumed to be null associations, logiCALL appears to be the superior method, so long as the power loss is minimal. Assigning conservative 'p-values' to problematic SNPs is nearly equivalent to removing them. However, because of the tendency for there to be multiple, nearly equivalent, maximum likelihood estimates, the relative distances to the AA, AB, and BB genotypes using the single set of maximum likelihood estimates may not be adequate in identifying questionable calls. Therefore, logiCALL gains an advantage by combining two methods for identifying suspect calls. Additionally, it avoids false positives caused by differential bias, where the proportion of missing calls differs between cases and controls. This new method simplifies testing by requiring no preprocessing and testing all SNPs. The power loss from logiCALL depends on the quality of the data. When the statistic for a SNP cleanly separates into an AA, AB, and BB group, there is no power loss. In our IBD example, the 'p-values' reported for rs2066843 and rs2076756, the two SNPs that are believed to be truly associated with IBD were similar for the two methods, Bead-Studio (2.9 × 10^{-9} and 5.1 × 10^{-10}) and logiCALL (1.5 × 10^{-8} and 1.6 × 10^{-9}). Among those subjects meeting the 96% Bead-Studio call rate, logiCALL found no questionable calls.
The likelihood ratio method had mixed results. Clearly, when the distribution of the summary statistic is a mixture of normals, the estimated genotype proportions are asymptotically unbiased. Unfortunately, this method still resulted in an increased number of false positives. However, if we removed those SNPs where at least one call changed when switching from $\stackrel{\u02c6}{\theta}$ to ${\stackrel{\u02c6}{\theta}}_{R}$, the false positive rate decreased to the expected level. In theory, all calls should be identical and only the resulting likelihoods should differ. When assigning genotypes, the cost, in likelihood, incurred from forcing the vectors of genotype proportions to be equal should be far less than the cost of switching calls. When the reverse is true, and calls switch, the statistic for the three groups cannot be well separated, and the p-value is suspect.
Conclusion
In genome-wide association tests, under the null hypothesis, the test statistic rarely follows the expected chi-squared distribution. This deviation tends to result in an excess of false positives. Unfortunately, the investigation into the origin of this deviation has yet to be completed. The problems associated with poor signal quality and population substructure have been thoroughly explored. However, the overlap of fluorescent signals has only been identified as a serious problem, and has yet to be fully explained. In this paper, we have provided two reasons, parameter inconsistency and called genotype dependency, that help explain how overlap causes this deviant behavior. Furthermore, we propose two methods, logiCALL and a method based on the likelihood ratio statistic that better handle the problems of inconsistency and dependency. These methods will perform similarly to the common, genotype-based, test statistics for the well-behaved SNPs and appear to create fewer false positives for the difficult-to-call SNPs. We have also identified a new characteristic of some false positives, that the call differs when using ${\stackrel{\u02c6}{\theta}}_{R}^{j}$ vs. ${\stackrel{\u02c6}{\theta}}^{j}$, that will help distinguish which low p-values represent significant disease/marker association.
We have demonstrated that increasing sample size alone will not eliminate type I error, as genotyping, in its current form, leads to inconsistent estimates of population parameters. To alleviate this inconsistency, the distance measure used for assignment would need to be switched to ${p}_{g}{d}_{g}^{\u2020}(s)$, defined in the Results and Discussion Section. Moreover, we have proven that the called genotypes can be dependent under certain conditions, and that tests based on called genotypes need to account for the increased variance caused by dependence. Finally, we illustrated that the likelihood profile of the data can be relatively flat near the MLEs. Therefore, judging the quality of calls from distances based only on the MLE may not provide an adequate means to identify questionable calls. Hence logiCALL, which looks at all locally-maximal likelihood estimates of the parameters can reduce the type I error rate. Testing association by the likelihood ratio statistic is another promising method for addressing the problems associated with overlapping signals.
Availability and requirements
Computer programs are available on author's website, http://bioinformatics.med.yale.edu/group/josh/index.html.
Declarations
Acknowledgements
We thank Dr. Judy Cho for generously sharing her IBD data. This project was supported by Dr. Zhao's NIH grant GM 50507 and Dr. Sampson's Ruth L Kirschstein post-doctoral fellowship.
Authors’ Affiliations
References
- Plagnol V, Cooper J, Todd J, Clayton D: A method to Address Differential Bias in Genotyping in Large-Scale Association Studies. PLoS Genet. 2007, 3 (5): 759-767.View ArticleGoogle Scholar
- Clayton D, Walker N, Smyth D, Pask R, Cooper J, Maier L, Smink L, Lam A, Ovington N, Stevens H, Nutlad S, Howson J, Faham M, Moorhead M, Jones H, Falkowski M, Hardenbol P, Willis T, Todd J: Population Structure, Differential Bias, and Genomic Control in a Large-Scale, Case-Control Association Study. Nature Genetics. 2005, 37: 1243-1246.View ArticlePubMedGoogle Scholar
- Moorhead M, Hardenbol P, Siddiqui F, Falkowski M, Bruckner C, Ireland J, Jones H, Jain M, Willis T, Faham M: Optimal Genotype Determination in Highly Multiplexed SNP Data. Eur J Hum Genet. 2006, 14: 207-15.View ArticlePubMedGoogle Scholar
- Miller M, Schwander K, Rao D: Genotyping Errors and Their Impact on Genetic Analysis. Advances in Genetics. 2008, 60: 141-152.View ArticlePubMedGoogle Scholar
- Sobel E, Papp J, K L: Detection and Integration of Genotyping Errors in Statistical Genetics. American Journal of Human Genetics. 2002, 70: 496-508.PubMed CentralView ArticlePubMedGoogle Scholar
- Rice K, Holmans P: Allowing for Genotyping Error in Analysis of Unmatched Case-Control Studies. Annals of Human Genetics. 2003, 67: 165-174.View ArticlePubMedGoogle Scholar
- Rabbee N, Speed TP: A genotype calling algorithm for affymetrix SNP arrays. Bioinformatics. 2006, 22: 7-12.View ArticlePubMedGoogle Scholar
- Affymetrix: BRLMM: an improved Genotype Calling Method for the GeneChip Human Mapping 500 K Array Set. [http://www.affymetrix.com/support/technical/whitepapers/brlmm whitepaper.pdf]
- Carvalho B, Bengtsson H, Speed TP, Irizarry RA: Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics. 2007, 8 (2): 485-499.View ArticlePubMedGoogle Scholar
- Consortium TWTCC: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007, 447: 661-678.View ArticleGoogle Scholar
- Hua J, Craig DW, Brun M, Webster J, Zismann V, Tembe W, Joshipura K, Huentelman MJ, Dougherty ER, Stephan DA: SNiPer-HD: improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays. Bioinformatics. 2007, 23: 57-63.View ArticlePubMedGoogle Scholar
- Xiao Y, Segal MR, Yang Y, Yeh RF: A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays. Bioinformatics. 2007, 23 (12): 1459-1467.View ArticlePubMedGoogle Scholar
- Teo YY, Inouye M, Small KS, Gwilliam R, Deloukas P, Kwiatkowski DP, Clark TG: A genotype calling algorithm for the Illumina BeadArray platform. Bioinformatics. 2007, 23 (20): 2741-2746.PubMed CentralView ArticlePubMedGoogle Scholar
- Dunning MJ, Smith ML, Ritchie ME, Tavare S: beadarray: R classes and methods for Illumina bead-based data. Bioinformatics. 2007, 23 (16): 2183-2184.View ArticlePubMedGoogle Scholar
- Di X, Matsuzaki H, Webster TA, Hubbell E, Liu G, Dong S, Bartell D, Huang J, Chiles R, Yang G, Shen Mm, Kulp D, Kennedy GC, Mei R, Jones KW, Cawley S: Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays. Bioinformatics. 2005, 21 (9): 1958-1963.View ArticlePubMedGoogle Scholar
- Murphy EA, Bolling DR: Testing of single locus hypotheses where there is incomplete separation of the phenotypes. American Journal of Human Genetics. 1967, 19: 322-334.PubMed CentralPubMedGoogle Scholar
- Duerr RH, Taylor KD, Brant SR, Rioux JD, Silverberg MS, Daly MJ, Steinhart AH, Abraham C, Regueiro M, Griffths A, Dassopoulos T, Bitton A, Yang H, Targan S, Datta LW, Kistner EO, Schumm LP, Lee AT, Gregersen PK, Barmada MM, Rotter JI, Nicolae DL, Cho JH: A Genome-Wide Association Study Identifies IL23R as an Inflammatory Bowel Disease Gene. Science. 2006, 314 (5804): 1461-1463.PubMed CentralView ArticlePubMedGoogle Scholar
- Rioux JD, Xavier R, Taylor KD, Silverberg MS, Goyette P, Huett A, Green T, Kuballa P, Barmada MM, Datta LW, Shugart YY, Griffths A, Targan S, Ippoliti AF, Bernard EJ, Mei L, Nicolae DL, Regueiro M, Schumm LP, Steinhart AH, Rotter J, Duerr RH, Cho JH, Daly MJ, Brant SR: Genome-Wide Association Study Identifies New Susceptibility Loci for Crohn Disease and Implicates Autophagy in Disease Pathogenesis. Nature Genetics. 2007, 39: 596-604.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.