Open Access

Robust joint analysis allowing for model uncertainty in two-stage genetic association studies

BMC Bioinformatics201112:9

DOI: 10.1186/1471-2105-12-9

Received: 29 June 2010

Accepted: 7 January 2011

Published: 7 January 2011

Abstract

Background

The cost efficient two-stage design is often used in genome-wide association studies (GWASs) in searching for genetic loci underlying the susceptibility for complex diseases. Replication-based analysis, which considers data from each stage separately, often suffers from loss of efficiency. Joint test that combines data from both stages has been proposed and widely used to improve efficiency. However, existing joint analyses are based on test statistics derived under an assumed genetic model, and thus might not have robust performance when the assumed genetic model is not appropriate.

Results

In this paper, we propose joint analyses based on two robust tests, MERT and MAX3, for GWASs under a two-stage design. We developed computationally efficient procedures and formulas for significant level evaluation and power calculation. The performances of the proposed approaches are investigated through the extensive simulation studies and a real example. Numerical results show that the joint analysis based on the MAX3 test statistic has the best overall performance.

Conclusions

MAX3 joint analysis is the most robust procedure among the considered joint analyses, and we recommend using it in a two-stage genome-wide association study.

Background

The two-stage design is often adopted in genome-wide association studies (GWASs) to search for genetic variants underlying susceptibility for complex diseases. The advantages of the two-stage design have been investigated extensively (see e.g., [112]). In a typical two-stage design for GWASs, a proportion of the available samples are genotyped at the initial stage on a large number of single nucleotide polymorphisms (SNPs) using a commercial genotyping platform. Based on association test results obtained at this stage, a small percentage of SNPs are selected and further genotyped on the remaining samples in the second stage. To analyze data generated from such a two-stage design, the joint analysis strategy has been recommended, which combines the test statistics from both stages as the final test statistic, and is shown to be more powerful than the replication-based analysis that only utilizes the second stage data [12].

The efficiency of joint analysis based on the allele-frequency-difference-based test (AFDT) was evaluated in detail in comparison to the replication-based analysis [12]. It is commonly adopted as a single marker test in GWASs. The AFDT is valid when Hardy-Weinberg equilibrium (HWE) holds in the target population, and is powerful when the underlying genetic models are additive or multiplicative. The Cochran-Armitage trend test (CATT) [13, 14] derived under the additive (in log scale) genetic risk model is also used in single-maker analysis, which is optimal when the underlying additive genetic model is true. However, both tests are not so powerful compared with other methods such as MAX3 [15] when the underlying genetic model is not additive. Since in most cases there is no evidence suggesting that the additive risk model is most appropriate for the underlying disease model, especially in the typical GWASs where we most likely evaluate only the tagging SNPs, but not the causal SNPs directly. Thus, it is advantageous to adopt a more robust single marker test that has a relatively good performance under all possible disease models. To this end, two types of such robust tests, the MERT (maximin efficiency robust test) [15, 16] and MAX3 (the maximum values of CATTs under recessive, additive and dominant models) have been recently considered [15, 17]. Nevertheless, their performances under the two-stage design have not been thoroughly investigated.

In this report we propose two types of joint test statistics for the two-stage design based on the two robust tests, MERT and MAX3. We derive closed-form formula to calculate the power of the MERT-based joint analysis, and propose a computationally efficient Monte Carlo procedure to evaluate the significance level of the MAX3-based joint analysis. Facilitated by these two procedures, we evaluate the performances of the two robust test based joint analyses, in comparison with the ones based on AFDT, under various two-stage design setups and disease models.

Methods

Notations

Suppose that r cases and s controls are randomly sampled from the source population in a GWAS. Denote the number of SNPs genotyped and the proportion of the subjects in Stage 1 by m and π, respectively. Throughout, we only consider biallelic SNPs with two alleles G and g, with G being the risk allele. Then there are three genotypes: gg, Gg, and GG. Using the disease risk at gg as the baseline, we define the relative risks of Gg and GG as λ1 = f1/f0 and λ2 = f2/f0, respectively, where f0 = Pr(case|gg) > 0, f1 = Pr(case|Gg), f2 = Pr(case|GG) are the penetrances. Let K = Pr(case) be the disease prevalence. Denote the genotype frequencies in case population as p0 = Pr(gg|case) = Pr(gg)f0/K, p1 = Pr(Gg|case) = Pr(Gg)f1/K, p2 = Pr(GG|case) = Pr(GG)f2/K and in control population as q0 = Pr(gg|control) = Pr(gg)(1-f0)/(1-K), q1 = Pr(Gg|control) = Pr(Gg)(1-f1)/(1-K), q2 = Pr(GG|control) = Pr(GG)(1-f2)/(1-K). Then the null hypothesis of no association is H0 : p i = q i , i = 0, 1, 2, which is equivalent to H0 : λ1 = λ2 = 1. The alternative hypothesis is H1 : λ2λ1 ≥ 1 with λ2 > 1. The commonly used three genetic models, recessive, additive and dominant models are corresponding to λ2 >λ1 = 1, 2λ1 = λ2 +1 and λ1 = λ2 > 1, respectively. We assume that SNPs with p-values less than γ in Stage 1 will be further investigated in Stage 2 and α be the whole genome-wide type I error.

The notations for genotype frequencies in case population and control population of Stage 1 and Stage 2 are given in Table 1. It should be noted that p1i= p2iand q1i= q2ifor i = 0, 1, 2 in the table using the first subscript on behalf of Stage 1 or Stage 2 since they are the population parameters. However, the estimates of p1iand q1ifor i = 0, 1, 2 based on the data of Stage 1 and those of p2iand q2ifor i = 0, 1, 2 based on the data of Stage 2 might be different although the data of Stage 1 and Stage 2 are drawn from the same source population.
Table 1

Genotype frequencies in case population and control population for both stages

 

cases

controls

 

gg

Gg

GG

gg

Gg

GG

Stage 1

p 10

p 11

p 12

q 10

q 11

q 12

Stage 2

p 20

p 21

p 22

q 20

q 21

q 22

Allele-Frequency-Difference-Based Joint Analysis

Denote the risk allele frequencies in case population and control population by θ and ϖ, respectively. Let θ ^ 1 and ϖ ^ 1 be their maximum likelihood estimates in Stage 1, respectively. Then the test statistic for Stage 1 is
Z 1 = θ ^ 1 ϖ ^ 1 1 2 r π + 1 2 s π × 1 [ θ ^ 1 ξ + ϖ ^ 1 ( 1 ξ ) ] [ 1 θ ^ 1 ξ ϖ ^ 1 ( 1 ξ ) ] ,  where  ξ = r r + s .
The threshold for selecting SNPs in Stage 1 is b1 = Φ-1(1-γ/2), where Φ(·) is the cumulative standard normal distribution function. Similarly, we can get the maximum likelihood estimates of the risk allele frequencies in case population and control population using the data from Stage 2, denoted by θ ^ 2 and ϖ ^ 2 . Then the test statistic for Stage 2 can be written as
Z 2 = θ ^ 2 ϖ ^ 2 1 2 r ( 1 π ) + 1 2 s ( 1 π ) × 1 [ θ ^ 2 ξ + ϖ ^ 2 ( 1 ξ ) ] [ 1 θ ^ 2 ξ ϖ ^ 2 ( 1 ξ ) ] .

The joint statistic is Z J = π Z 1 + 1 π Z 2 . The Bonferroni correction threshold (b J ) for Z J is the solution of the equation Pr H 0 ( | Z 1 | > b 1 , | Z J | > b J ) = α / m ,

where ( Z 1 , Z J ) ' | H 0 ~ N 2 ( ( 0 , 0 ) ' , Γ Γ ' ) , Γ = ( 1 0 π 1 π ) . So the power of the joint test under the alternative hypothesis is given by Pr H 1 ( | Z 1 | > b 1 , | Z J | > b J ) , where ( Z 1 Z J ) | H 1 ~ N 2 ( ( μ 1 π μ 1 + 1 π μ 2 ) , Γ Δ 1 Γ ' ) , with μ 1 = θ 1 ϖ 1 1 2 r π + 1 2 s π × 1 [ θ 1 ξ + ϖ 1 ( 1 ξ ) ] [ 1 θ 1 ξ ϖ 1 ( 1 ξ ) ] , μ 2 = θ 2 ϖ 2 1 2 r ( 1 π ) + 1 2 s ( 1 π ) × 1 [ θ 2 ξ + ϖ 2 ( 1 ξ ) ] [ 1 θ 2 ξ ϖ 2 ( 1 ξ ) ] , Δ 1 = ( δ 1 0 0 δ 2 ) , δ 1 = ( 1 ξ ) θ 1 ( 1 θ 1 ) + ξ ϖ 1 ( 1 ϖ 1 ) [ θ 1 ξ + ϖ 1 ( 1 ξ ) ] [ 1 θ 1 ξ ϖ 1 ( 1 ξ ) ] and δ 2 = ( 1 ξ ) θ 2 ( 1 θ 2 ) + ξ ϖ 2 ( 1 ϖ 2 ) [ θ 2 ξ + ϖ 2 ( 1 ξ ) ] [ 1 θ 2 ξ ϖ 2 ( 1 ξ ) ] .

The calculation of Pr H 1 ( | Z 1 | > b 1 , | Z J | > b J ) is based on two-fold integration which can be computed using the built-in function, "pmvnorm", in the R package "mvtnorm" [1820].

The above approach is slightly different from the one considered in [12], where the authors constructed the test statistics by estimating the variance of the differences of allele frequency between case population and control population using the cases and controls separately under the null hypothesis. In our joint analysis, we estimated the variance using the combined data of case sample and control sample. Results (not show here) show that the two approaches have very similar performance.

Cochran-Armitage Trend Test under the Additive Model-Based Joint Analysis

Cochran-Armitage trend test under the additive model (CATTA) (see e.g., [13, 15]) is often used in the genetic association studies including GWASs. Denote CATTA for both stages by T 1 A and T 2 A , respectively. Then the threshold for selecting SNPs in Stage 1 is d1 = Φ-1(1-γ/2). The joint test statistic is T J A = π T 1 A + 1 π T 2 A . The threshold (d J ) for T J A can be obtained by solving the equation Pr H 0 ( | T 1 A | > d 1 , | T J A | > d J ) = α / m . The power of the joint analysis is Pr H 1 ( | T 1 A | > d 1 , | T J A | > d J ) , which can be calculated again using the R package "mvtnorm". The joint distributions of ( T 1 A , T J A ) ' under the null and alternative hypotheses are given in Appendix A in Additional file 1.

MERT-Based Joint Analysis

MERT was originally proposed in [16] to find robust test statistic in situations when multiple alternative models are plausible. It was used to define a robust test for single-marker analysis [15]. Here we apply the test to two-stage design. Similar to T 1 A and T 2 A , we can obtain CATTs T 1 R and T 2 R under the recessive model and CATTs T 1 D and T 2 D under the dominant model for both stages. So MERT for both stages are T 1 m e r t = T 1 R + T 1 D [ 2 ( 1 + ρ 1 R D ) ] 1 / 2 and T 2 m e r t = T 2 R + T 2 D [ 2 ( 1 + ρ 2 R D ) ] 1 / 2 , respectively, where ρ 1 R D and ρ 2 R D are the correlation coefficients of T 1 R and T 1 D , and T 2 A and T 2 D under the null hypothesis, respectively, which are shown in Appendix B in Additional file 1. The joint analysis based on MERT can be defined as T J m e r t = π T 1 m e r t + 1 π T 2 m e r t . The threshold for selecting SNPs in Stage 1 is u1 = Φ-1(1-γ/2). To control the false positive rate of the joint analysis, we can obtain the threshold u J , which is the solution to the equation
Pr H 0 ( | T 1 m e r t | > u 1 , | T J m e r t | > u J ) = α / m .

The power of the test is given by Pr H 1 ( | T 1 m e r t | > u 1 , | T J m e r t | > u J ) , whose numerical values can be calculated using the R package "mvtnorm". The joint distributions of ( T 1 m e r t , T J m e r t ) ' under the null and alternative hypotheses are derived in Appendix B in Additional file 1.

MAX3-Based Joint Analysis

MAX3, the maximal value of CATT under three genetic models, is another commonly used robust test in the current GWASs (see e.g., [7, 15, 17]). Once we have ( T 1 R , T 1 A , T 1 D ) and ( T 2 R , T 2 A , T 2 D ) , the test statistic in Stage 1 is T 1 max = max { | T 1 R | , | T 1 A | , | T 1 D | } and the joint analysis based on MAX3 can be defined as T J max = max { | T J R | , | T J A | , | T J D | } , where T J R = π T 1 R + 1 π T 2 R , T J A = π T 1 A + 1 π T 2 A , and T J D = π T 1 D + 1 π T 2 D . For a given significance level γ in Stage 1, the threshold (v1) can be obtained by solving the equation Pr H 0 ( max { | T 1 R | , | T 1 A | , | T 1 D | } > v 1 ) = γ .

According to Chapter 6 of [21], we have T 1 A = ω 11 T 1 R + ω 12 T 1 D , where ω 11 = ρ 1 R A ρ 1 R D ρ 1 A D 1 ( ρ 1 R D ) 2 and ω 12 = ρ 1 A D ρ 1 R D ρ 1 R A 1 ( ρ 1 R D ) 2 , with ρ 1 R A and ρ 1 A D given in Appendix C in Additional file 1. Because ( T 1 R T 1 D ) | H 0 ~ N 2 ( ( 0 0 ) , ( 1 ρ 1 R D ρ 1 R D 1 ) ) , we can obtain v1 using the R package "mvtnorm". After that, we use the following computationally efficient algorithm to approximate the threshold (v J ) for the joint analysis:
  1. 1)

    Generate B identical and independently distributed bivariate normal random variates ( T 11 R , T 11 D ) ' , ( T 12 R , T 12 D ) ' , , ( T 1 B R , T 1 B D ) ' ~ N 2 ( ( 0 0 ) , ( 1 ρ 1 R D ρ 1 R D 1 ) ) . Then calculate T 1 i A = ω 11 T 1 i R + ω 12 T 1 i D , and T 1 i max = max { | T 1 i R | , | T 1 i A | , | T 1 i D | } for i = 1, 2, , B. Without loss of generality, we assume T 1 i max > v 1 for i = 1, 2, , B 1 and T 1 i max v 1 for i = B 1 + 1, B 1 + 2, . B.

     
  1. 2)
    Generate B 1 identical and independently distributed bivariate normal random variates ( T 21 R , T 21 D ) ' , ( T 22 R , T 22 D ) ' , , ( T 2 B 1 R , T 2 B 1 D ) ' ~ N 2 ( ( 0 0 ) , ( 1 ρ 2 R D ρ 2 R D 1 ) ) . Then calculate T 2 i A = ω 21 T 2 i R + ω 22 T 2 i D , ω 21 = ρ 2 R A ρ 2 R D ρ 2 A D 1 ( ρ 2 R D ) 2 and ω 22 = ρ 2 A D ρ 2 R D ρ 2 R A 1 ( ρ 2 R D ) 2 , with ρ 2 R A and ρ 2 A D given in Appendix C in Additional file 1. For i = 1, 2, , B 1, calculate
    T J i max = max { | π T 1 i R + 1 π T 2 i R | , | π T 1 i A + 1 π T 2 i A | , | π T 1 i D + 1 π T 2 i D | } .
     
  2. 3)

    Find v J that yields min | # { T J i max > v J , i = 1 , 2 , , B 1 } B 1 α m γ | with # { T J i max > v J , i = 1 , 2 , , B 1 } B 1 α m γ and v J { T J i max , i = 1 , 2 , , B 1 } .

     

Once we have v1 and v J , we generate the data under the alternative hypothesis to calculate the power empirically. In the simulation studies, we generate 10,000 data sets under the alternative hypothesis. For the ith data set (i = 1, 2. , 10000), we calculate T 1 max and T J max , denote them again by T 1 i max and T J i max , respectively. Then the empirical power is # { T 1 i max > v 1 , T J i max > v J ; i = 1 , 2 , , 10000 } 10000 .

Results

Simulation Setup

In order to mimic the real GWAS, we choose the simulation parameters similar to [12, 22]. In a typical GWAS, there are thousands of individuals randomly chosen from the source population and the number of SNPs being examined in Stage 1 is usually from 0.1 million to 1 million. Based on the results of Stage 1 (p-values), the number of SNPs to be genotyped in Stage 2 is in tens or hundreds. For example, in a diabetes mellitus GWAS [7], there were 392,935 SNPs genotyped on 1,363 subjects in Stage 1, and 57 SNPs were genotyped in Stage 2 after removing those SNPs with p-values greater than 0.0001 based on the data of Stage 1. In a GWAS, the significance level in the whole genome is often set to be 0.05, and the Bonferroni-correction is often used to adjust for multiple comparisons and to control the false positive rate. So, in our simulation studies, we set the number of SNPs at Stage 1 m = 500,000 and the p-value threshold for significant SNPs to be 0.05/m = 1 ×10-7. The proportion of subjects genotyped in Stage 1 is set to be 0.5, 0.4 and 0.3, and the p-value threshold for SNPs selection at the end of Stage 1 be 0.0001 and 0.0002. The disease prevalence is set to be K = 0.1. Throughout our simulation procedures, we assume that Hardy-Weinberg equilibrium (HWE) holds in the general population. Furthermore, the risk allele is assumed to be the minor allele, with frequency (MAF) equal to 0.15, 0.25, 0.35 and 0.45. The considered genetic models are the recessive, additive, and dominant models. We specified different genotype relative risks λ1 and λ2 for the three genetic models (see details in Table 2, 3, 4 and 5). The critical values for MAX3 joint analysis are simulated, while thresholds for other three joint analysis are exactly calculated based on their asymptotic distributions under the null hypothesis where the genotype probabilities (p0, p1, p2) for cases and (q0, q1, q2) for controls are calculated by p0 = q0 = Pr(gg) = (1-MAF)2, p1 = q1 = Pr(Gg) = 2 × MAF ×(1-MAF) and p2 = q2 = Pr(GG) = MAF2. Under the alternative hypothesis, the genotype frequencies can be obtained using the formulas given in the Notations Subsection and f0 = K/[Pr(gg) + λ1 Pr(Gg) + λ2 Pr(GG)]. More details could be referred to [23] and [24]. The genotype counts in case sample and control sample were generated from a multinomial distribution.
Table 2

Power comparison for MAF = 0.15 (K = 0.1, α = 0.05, m = 5 × 105)

 

π

γ

ALLEJ

CATAJ

MERTJ

MAX3J

Recessive Model

0.5

0.0001

0.070

0.058

0.385

0.759

r = s = 5000

 

0.0002

0.076

0.064

0.420

0.798

λ1 = 1, λ2 = 2

0.4

0.0001

0.049

0.042

0.273

0.583

  

0.0002

0.058

0.048

0.317

0.646

 

0.3

0.0001

0.029

0.025

0.155

0.346

  

0.0002

0.036

0.031

0.193

0.423

Additive Model

0.5

0.0001

0.601

0.613

0.440

0.555

r = s = 2000

 

0.0002

0.643

0.655

0.477

0.599

λ1 = 1.4, λ2 = 1.8

0.4

0.0001

0.450

0.460

0.317

0.406

  

0.0002

0.507

0.517

0.364

0.451

 

0.3

0.0001

0.271

0.277

0.183

0.226

  

0.0002

0.326

0.334

0.226

0.281

Dominant Model

0.5

0.0001

0.679

0.711

0.356

0.726

r = s = 2000

 

0.0002

0.720

0.752

0.388

0.768

λ1 = λ2 = 1.5

0.4

0.0001

0.520

0.551

0.254

0.552

  

0.0002

0.579

0.611

0.293

0.621

 

0.3

0.0001

0.322

0.345

0.146

0.339

  

0.0002

0.383

0.408

0.181

0.400

Table 3

Power comparison for MAF = 0.25 (K = 0.1, α = 0.05, m = 5 × 105)

 

π

γ

ALLEJ

CATAJ

MERTJ

MAX3J

Recessive Model

0.5

0.0001

0.075

0.066

0.220

0.517

r = s = 5000

 

0.0002

0.083

0.073

0.242

0.546

λ1 = 1, λ2 = 1.5

0.4

0.0001

0.053

0.047

0.154

0.365

  

0.0002

0.062

0.055

0.180

0.408

 

0.3

0.0001

0.031

0.028

0.087

0.197

  

0.0002

0.039

0.035

0.110

0.254

Additive Model

0.5

0.0001

0.835

0.846

0.782

0.799

r = s = 2000

 

0.0002

0.868

0.878

0.820

0.838

λ1 = 1.4, λ2 = 1.8

0.4

0.0001

0.687

0.700

0.625

0.639

  

0.0002

0.742

0.754

0.683

0.700

 

0.3

0.0001

0.462

0.474

0.405

0.413

  

0.0002

0.530

0.542

0.472

0.476

Dominant Model

0.5

0.0001

0.717

0.757

0.511

0.826

r = s = 2000

 

0.0002

0.758

0.796

0.551

0.853

λ1 = λ2 = 1.5

0.4

0.0001

0.557

0.597

0.375

0.651

  

0.0002

0.617

0.656

0.427

0.726

 

0.3

0.0001

0.350

0.382

0.221

0.425

  

0.0002

0.413

0.447

0.270

0.495

Table 4

Power comparison for MAF = 0.35 (K = 0.1, α = 0.05, m = 5 × 105)

 

π

γ

ALLEJ

CATAJ

MERTJ

MAX3J

Recessive Model

0.5

0.0001

0.420

0.384

0.536

0.824

r = s = 4000

 

0.0002

0.456

0.418

0.578

0.860

λ1 = 1, λ2 = 1.5

0.4

0.0001

0.302

0.274

0.393

0.657

  

0.0002

0.348

0.317

0.447

0.717

 

0.3

0.0001

0.175

0.158

0.231

0.436

  

0.0002

0.216

0.196

0.282

0.492

Additive Model

0.5

0.0001

0.891

0.900

0.882

0.864

r = s = 2000

 

0.0002

0.916

0.925

0.909

0.895

λ1 = 1.4, λ2 = 1.8

0.4

0.0001

0.760

0.773

0.747

0.715

  

0.0002

0.809

0.821

0.797

0.767

 

0.3

0.0001

0.537

0.551

0.522

0.481

  

0.0002

0.604

0.618

0.590

0.548

Dominant Model

0.5

0.0001

0.558

0.607

0.464

0.758

r = s = 2000

 

0.0002

0.600

0.649

0.502

0.806

λ1 = λ2 = 1.5

0.4

0.0001

0.413

0.455

0.337

0.599

  

0.0002

0.468

0.512

0.385

0.660

 

0.3

0.0001

0.246

0.274

0.197

0.374

  

0.0002

0.298

0.330

0.242

0.437

Table 5

Power comparison for MAF = 0.45 (K = 0.1, α = 0.05, m = 5 × 105)

 

π

γ

ALLEJ

CATAJ

MERTJ

MAX3J

Recessive Model

0.5

0.0001

0.282

0.253

0.263

0.542

r = s = 2000

 

0.0002

0.308

0.277

0.288

0.572

λ1 = 1, λ2 = 1.5

0.4

0.0001

0.199

0.178

0.184

0.380

  

0.0002

0.231

0.207

0.215

0.442

 

0.3

0.0001

0.114

0.101

0.104

0.220

  

0.0002

0.142

0.127

0.131

0.263

Additive Model

0.5

0.0001

0.886

0.896

0.894

0.854

r = s = 2000

 

0.0002

0.912

0.921

0.919

0.881

λ1 = 1.4, λ2 = 1.8

0.4

0.0001

0.753

0.767

0.765

0.701

  

0.0002

0.803

0.815

0.813

0.760

 

0.3

0.0001

0.529

0.543

0.542

0.473

  

0.0002

0.597

0.610

0.609

0.545

Dominant Model

0.5

0.0001

0.279

0.317

0.302

0.590

r = s = 2000

 

0.0002

0.305

0.346

0.329

0.626

λ1 = λ2 = 1.5

0.4

0.0001

0.197

0.225

0.214

0.428

  

0.0002

0.229

0.260

0.248

0.483

 

0.3

0.0001

0.112

0.128

0.123

0.241

  

0.0002

0.140

0.160

0.153

0.291

Simulation Results

For convenience, we refer to the aforementioned four joint analysis approaches as, respectively, ALLEJ (allele-frequency-difference-based joint analysis), CATAJ (Cochran-Armitage trend test under the additive model-based joint analysis), MERTJ (MERT-based joint analysis), MAX3J (MAX3-based joint analysis). Table 2, 3, 4 and 5 report powers of the four joint analysis methods corresponding to MAF equal to 0.15, 0.25, 0.35 and 0.45, respectively. From these tables, we have the following observations. Under the recessive model, MERTJ and MAX3J are more powerful than ALLEJ and CATAJ, with MAX3J being most powerful among the four methods under consideration. In some cases, the advantage of MAX3J is quite impressive. For example, in Table 2, with π = 0.5, γ = 0.0001, the powers of ALLEJ, CATAJ, MERTJ and MAX3J are 0.070, 0.058, 0.385 and 0.759, respectively. Under the additive model, CATAJ and ALLEJ have comparable power and are more powerful than the other two tests. However, the power difference between CATAJ and MAX3J is mostly at the level of 6.6%, with the largest discrepancy of 7%. Under the dominant model, CATAJ and MAX3J are more powerful than ALLEJ and MERTJ. Both tests have comparable power when MAF = 0.15, and MAX3J is much more powerful than CATAJ when MAF = 0.25, 0.35 and 0.45. In summary, it appears that MAX3J has the best overall performance.

A Real Example: Type 2 Diabetes Mellitus

Type 2 diabetes mellitus is one of the most common diseases, and has been found to be associated with environmental factors and genetic variants. A two-stage GWAS for type 2 diabetes mellitus was reported in [7]. In this study, 392,935 SNPs were genotyped on 1,363 subjects in Stage 1. Based on the statistical significance level of 1 × 10-4, 57 SNPs were selected and further screened on 2,617 cases and 2,894 controls in Stage 2. We applied the above four considered methods to two SNPs, rs1005316 and rs2876711, which were not reported in their Table 1, but were shown in their Appendix. Table 6 gives the genotype counts and p-values of these two SNPs. We found a genome-wide significant association between rs2876711 and the outcome. Although the association between rs2876711 and type 2 diabetes mellitus has not been reported by [7], our results show that we should be concerned with this SNP and its neighborhood area. Additional experiments should be further conducted to validate this association.
Table 6

Genotype counts and p-values of SNPs rs1005316 and rs2876711 for type 2 diabetes mellitus

SNP ID

 

r0

r1

r2

s0

s1

s2

ALLEJ

CATAJ

MERTJ

MAX3J

rs1005316

Stage 1

13

224

457

44

211

399

6.13 × 10-5

7.78 × 10-6

3.87 × 10-6

8.12 × 10-7

 

Stage 2

89

669

1708

89

913

1856

    

rs2876711

Stage 1

99

322

272

121

351

182

2.92 × 10-7

2.07 × 10-8

5.97 × 10-8

3.10 × 10-8

 

Stage 2

389

1191

989

484

1404

987

    

Note: r0, r1, and r2 denote the number of individuals carrying genotype gg, Gg, and GG in case sample, respectively; s0, s1, and s2 denote the number of individuals carrying genotype gg, Gg, and GG in control sample, respectively.

Discussion and Conclusions

In genetic association studies, the underlying genetic inheritance model is often unknown, and thus hinders the use of methods such as CATT, which has to be derived under an assumed genetic model. Robust tests, such as MERT and MAX3, had been proposed to relax the dependence on the underlying genetic models. Extending these tests to a two-stage setting, we construct two robust joint analyses based on MERT and MAX3. Numerical results show that MAX3J has the best overall performance among the four considered joint analysis approaches. For type 2 diabetes mellitus, based on MAX3J, we found that SNP rs2876711 was significantly associated with type 2 diabetes mellitus besides their findings.

Pearson Chi-square test is a robust test that was used in genetic association studies (see e.g., [25]). Recently, a comprehensive power comparison between MAX3 and Pearson Chi-square test and Cochran-Armitage trend test under the additive model was conducted in [17]. They reported that MAX3 has the most robust performances. The proposed joint analysis combing the test statistics of both stages considers the between-stage heterogeneity. It is intractable for Pearson Chi-square test to consider the relative risk heterogeneity of both stages, especially when the relative risk in Stage 1 is larger than one and that in Stage 2 is less than one.

Recently, a joint analysis based on genetic model selection [26] to overcome the genetic model uncertainty was proposed in [22]. Based on the data in Stage 1, they used Hardy-Weinberg disequilibrium trend test studied in [27] to determine a score that corresponds to a genetic model. This score was then used to construct the trend test based on the data of Stage 2. Results (not shown here) show that the proposed joint analysis has comparable power. Therefore, the proposed MAX3J can be used as an alternative procedure in two-stage genome-wide association studies.

Abbreviations

GWAS: 

genome-wide association study

SNP: 

single nucleotide polymorphism

MAF: 

minor allele frequency

AFDT: 

allele-frequency-difference-based test

CATT: 

Cochran-Armitage trend test

MERT: 

maximin efficiency robust test

MAX3: 

maximum values of Cochran-Armitage trend tests under recessive, additive and dominant models

ALLEJ: 

allele-frequency-difference-based joint analysis

CATAJ: 

Cochran-Armitage trend test under the additive model-based joint analysis

MERTJ: 

MERT-based joint analysis

MAX3J: 

MAX3-based joint analysis.

Declarations

Acknowledgements

We would like to thank the editor and three anonymous reviewers for their very constructive comments and suggestions, which significantly improved our presentation. This work is partially support by the National Young Science Foundation of China, No. 10901155.

Authors’ Affiliations

(1)
Department of Statistics, Yunnan University
(2)
Academy of Mathematics and Systems Science, Chinese Academy of Sciences
(3)
Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development
(4)
Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute

References

  1. Zuo YJ, Zou GH, Zhao HY: Two-stage designs in case-control association analysis. Genetics 2006, 173: 1747–1760. 10.1534/genetics.105.042648PubMed CentralView ArticlePubMedGoogle Scholar
  2. Goll A, Bauer P: Two-stage designs applying methods differing in costs. Bioinformatics 2007, 23: 1519–1526. 10.1093/bioinformatics/btm140View ArticlePubMedGoogle Scholar
  3. Muller HH, Pahl R, Schafer H: Including sampling and phenotyping costs into the optimization of two-stage designs for genome-wide association studies. Genetic Epidemiology 2007, 31: 844–852. 10.1002/gepi.20245View ArticlePubMedGoogle Scholar
  4. Satagopan JM, Elston RC: Optimal two-stage genotyping in population-based association studies. Genetic Epidemiology 2003, 25: 149–157. 10.1002/gepi.10260View ArticlePubMedGoogle Scholar
  5. Satagopan JM, Venkatraman ES, Begg CB: Two-stage designs for gene-disease association studies with sample size constraints. Biometrics 2004, 60: 589–597. 10.1111/j.0006-341X.2004.00207.xView ArticlePubMedGoogle Scholar
  6. Skol AD, Scott LJ, Abecasis GR, Boehnke M: Optimal designs for two-stage genome-wide association studies. Genetic Epidemiology 2007, 31: 776–788. 10.1002/gepi.20240View ArticlePubMedGoogle Scholar
  7. Sladek R, Rocheleau G, Rung J, Dina C, Shen L, Serre D, Boutin P, Vincent D, Belisle A, Hadjadj S, Balkau B, Heude B, Charpentier G, Hudson TJ, Montpetit A, Pshezhetsky AV, Prentki M, Posner BI, Balding DJ, Meyre D, Polychronakos C, Froguel P: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 2007, 445: 881–885. 10.1038/nature05616View ArticlePubMedGoogle Scholar
  8. Thomas D, Xie R, Gebregziabher M: Two-stage sampling designs for gene association studies. Genetic Epidemiology 2004, 27: 401–414. 10.1002/gepi.20047View ArticlePubMedGoogle Scholar
  9. Wang H, Thomas DC, Pe'er I, Stram DO: Optimal two-stage designs for genome-wide association scans. Genetic Epidemiology 2006, 30: 356–368. 10.1002/gepi.20150View ArticlePubMedGoogle Scholar
  10. Yu K, Chatterjee N, Wheeler W, Li Q, Wang S, Rothman N, Wacholder S: Flexible design for following up positive findings. American Journal of Human Genetics 2007, 81: 540–551. 10.1086/520678PubMed CentralView ArticlePubMedGoogle Scholar
  11. Zheng G, Meyer M, Li W, Yang Y: Comparison of two-phase analyses for case-control genetic association studies. Statistics in Medicine 2008, 27: 5054–5075. 10.1002/sim.3336View ArticlePubMedGoogle Scholar
  12. Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nature Genetics 2006, 38: 209–213. 10.1038/ng1706View ArticlePubMedGoogle Scholar
  13. Sasieni PD: From genotypes to genes: Doubling the sample size. Biometrics 1997, 53: 1253–1261. 10.2307/2533494View ArticlePubMedGoogle Scholar
  14. Zheng G, Gastwirth JL: On estimation of the variance in Cochran-Armitage trend tests for genetic association using case-control studies. Statistics in Medicine 2006, 25: 3150–3159. 10.1002/sim.2250View ArticlePubMedGoogle Scholar
  15. Zheng G, Friedlin B, Gastwirth JL: Comparison of robust tests for genetic association using case-control studies. In Optimality: The Second Erich L. Lehmann Symposium. Edited by: Rojo J. Beachwood: Institute of Mathematical Statistics; 2006:253–265. [DasGupta A (Series Editor): Lecture Notes-Monograph Series, vol 49.] [DasGupta A (Series Editor): Lecture Notes-Monograph Series, vol 49.]View ArticleGoogle Scholar
  16. Gastwirth JL: The use of maximin efficiency robust tests in combining contingency tables and survival analysis. Journal of the American Statistical Association 1985, 80: 380–384. 10.2307/2287901View ArticleGoogle Scholar
  17. Li Q, Zheng G, Liang X, Yu K: Robust tests for single-marker analysis in case-control genetic association studies. Annals of Human Genetics 2009, 73: 245–252. 10.1111/j.1469-1809.2009.00506.xView ArticlePubMedGoogle Scholar
  18. Genz A: Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics 1992, 1: 141–150. 10.2307/1390838Google Scholar
  19. Genz A: Comparison of methods for the computation of multivariate normal probabilities. Computing Science and Statistics 1993, 25: 400–405.Google Scholar
  20. Tong YL: The multivariate normal distribution. New York: Springer-Verlag; 1990.View ArticleGoogle Scholar
  21. Zheng G, Yang Y, Zhu X, Elston RC: Case-control studies of genetic association. New York: Springer; 2010.Google Scholar
  22. Kwak M, Joo J, Zheng G: A robust test for two-stage design in genome-wide association studies. Biometrics 2009, 65: 1288–1295. 10.1111/j.1541-0420.2008.01187.xView ArticlePubMedGoogle Scholar
  23. Schaid DJ, Sommer SS: Genotype relative risks: methods for design and analysis of candidate-gene association studies. American Journal of Human Genetics 1993, 53: 1114–1126.PubMed CentralPubMedGoogle Scholar
  24. Terwilliger JD, Ott J: Handbook of human genetic linkage. Baltimore: Johns Hopkins University Press; 1994.Google Scholar
  25. Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, Minichiello MJ, Fearnhead P, Yu K, Chatterjee N, Wang Z, Welch R, Staats BJ, Calle EE, Feigelson HS, Thun MJ, Rodriguez C, Albanes D, Virtamo J, Weinstein S, Schumacher FR, Giovannucci E, Willett WC, Cancel-Tassin G, Cussenot O, Valeri A, Andriole GL, Gelmann EP, Tucker M, Gerhard DS, Fraumeni JF Jr, Hoover R, Hunter DJ, Chanock SJ, Thomas G: Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nature Genetics 2007, 39: 645–649. 10.1038/ng2022View ArticlePubMedGoogle Scholar
  26. Zheng G, Ng HKT: Genetic model selection in two-phase analysis for case control association studies. Biostatistics 2008, 9: 391–399. 10.1093/biostatistics/kxm039PubMed CentralView ArticlePubMedGoogle Scholar
  27. Song K, Elson RC: A powerful method of combining measures of association and Hardy-Weinberg disequilibrium for fine-mapping in case-control studies. Statistics in Medicine 2006, 25: 105–126. 10.1002/sim.2350View ArticlePubMedGoogle Scholar

Copyright

© Pan et al; licensee BioMed Central Ltd. 2011

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.