A two-phase procedure for non-normal quantitative trait genetic association study

Background The nonparametric trend test (NPT) is well suitable for identifying the genetic variants associated with quantitative traits when the trait values do not satisfy the normal distribution assumption. If the genetic model, defined according to the mode of inheritance, is known, the NPT derived under the given genetic model is optimal. However, in practice, the genetic model is often unknown beforehand. The NPT derived from an uncorrected model might result in loss of power. When the underlying genetic model is unknown, a robust test is preferred to maintain satisfactory power. Results We propose a two-phase procedure to handle the uncertainty of the genetic model for non-normal quantitative trait genetic association study. First, a model selection procedure is employed to help choose the genetic model. Then the optimal test derived under the selected model is constructed to test for possible association. To control the type I error rate, we derive the joint distribution of the test statistics developed in the two phases and obtain the proper size. Conclusions The proposed method is more robust than existing methods through the simulation results and application to gene DNAH9 from the Genetic Analysis Workshop 16 for associated with Anti-cyclic citrullinated peptide antibody further demonstrate its performance. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0888-x) contains supplementary material, which is available to authorized users.


Derivations of ρ R , ρ A and ρ D under the null hypothesis
The covariance of Z 1 and Z x under the null hypothesis can be expressed as Using the notations in the main text, we have, under the null hypothesis, and cov H 0 (f 01 − 1/2,f 12 − 1/2 ) .

The numerators of cov
Hence, asymptotically,

Consistent estimators of ρ R , ρ A and ρ D under the null hypothesis
Because the distribution functions of Y 0 , Y 1 and Y 2 are unknown, we need to estimate the distribution functions F 0 , F 1 and F 2 in order to obtain the expressions of ρ R , ρ A and ρ D under H 0 . We point out that the direct empirical estimates of these correlations using the observed data are biased. So we modify the sample to make that the means of the three sets { y 1 , y 2 , · · · , y n 0 } , { y n 0 +1 , y n 0 +2 , · · · , y n 0 +n 1 } , and { y n 0 +n 1 +1 , y n 0 +n 1 +2 , · · · , y n } are equal. This adjustment will make sure the estimation procedure of ρ R , ρ A and ρ D being calculated under the null hypothesis. First, we calculate the sample medians of the above three sets, denoted them by a 0 , a 1 and a 2 , respectively, where a 0 = ( ∑ n 0 i=1 y i )/n 0 , a 1 = ( ∑ n 0 +n 1 j=n 0 +1 y j )/n 1 and a 2 = ( ∑ n k=n 0 +n 1 +1 y k )/n 2 . Then we change the sample sets a 2 , y n 0 +n 1 +2 − a 2 , · · · , y n − a 2 } , respectively. Denote the corresponding transformed sam- var H 0 (f 01 ), var H 0 (f 12 ) and cov H 0 (f 01 − 1/2,f 12 − 1/2) under the null hypothesis can be estimated by σ 2 01 , σ 2 12 , and σ 2 01,12 , respectively, which results in var H 0 (f 01 −f 12 ) = σ 2 01 − 2 σ 2 01,12 + σ 2 12 , where σ 2 01 , σ 2 12 , and σ 2 01,12 are given by The estimates of the variancef R ,f A andf D are given by Next we will give the estimate of cov can be written as Asymptotically, the expression of cov H 0 .
By now, we obtain the estimates of the correlations ρ

Additional simulation results for the model selection procedure
In this section, we show the performances of the proposed genetic model selection procedure. The simulation settings are the same as those in the main text. We consider the t-distributed error term with 3 degrees of freedom. Table S1-S3 shows the results of We compare the performances of four procedures: KW, Z A , MAX3, and TPP. Here, we consider the linear model Y = β 0 + Gβ 1 + ϵ, where Y denotes the phenotype value, G denotes the genotype value, and the error term ϵ follows a generalized extreme value distribution (a heavy-tailed distribution) with the shape parameter 0, the location parameter 0, and the scale parameter d (denoted as GEV(0, 0, d)).    Table S5. The empirical type I errors of KW, Z A , MAX3, and TPP when the error term follows a GEV distribution. The total number of the subjects is n = 1, 000. The left panel is for the significant level α = 0.05 and d=2 and the right panel is for the significant level α = 0.001 and d = 1.  Figure S1. The powers of KW, Z A , MAX3, and TPP with the GEV-distributed error term under three different models. The first column is for the nominal level α = 0.05 and The second column is for the nominal level α = 0.001. The total number of the subjects is n = 1, 000.

Simulation results for the error term following the centralized t distribution.
We compare the performances of four procedures: KW, Z A , MAX3, and TPP. Here, we consider the linear model Y = β 0 + Gβ 1 + ϵ, where Y denotes the phenotype value, G denotes the genotype value, and the error term ϵ follows a centralized t distribution with b degrees of freedom (denoted as t(b)). Let b = 3 and the sample size is 1,000.    Table S7. The empirical type I errors of KW, Z A , MAX3, and TPP when the error term follows the t-distribution with 3 degrees of freedom. The sample size is 1,000. The left panel is for the significant level α = 0.05 and the right panel is for the significant level α = 0.001.  Figure S2. The powers of KW, Z A , MAX3, and TPP with the t-distribution error under three genetic models. The first column is for the nominal level α = 0.05 and the second column is for the nominal level α = 0.001. The total number of the subjects is n = 1, 000.

Simulation results for the model with covariates.
Here, we simulated the data from the linear model with covariates: Y = β 0 +Xγ +Gβ 1 +ϵ, where Y denotes the phenotype value, G denotes the genotype value at a SNP locus, X denotes the covariate value which follows a standard normal distribution, and ϵ follows a truncated generalized extreme value distribution (a heavy-tailed distribution, denoted as tGEV(0, 0, d, 0)) with the shape parameter 0, the location parameter 0, the scale parameter d, and the truncated point 0. We set β 0 = 0.5, γ = 0.5, β 1 = {0.25, 0.50}, d = 5, and the MAF p ∈ {0.05, 0.1, · · · , 0.5}. The total sample size is 1,500. And we still use ξ = Φ −1 (0.90).   Table S9 show that all of the five tests could control the type I error correctly  Table S9. The empirical type I errors of KW, Z R , Z A , MAX3, and TPP for the model with covariates.
The error term follows tGEV(0,0,5,0). The sample size is 1,500. The left panel is for the significant level α = 0.05 and the right panel is for the significant level α = 0.001.