Skip to main content
  • Methodology Article
  • Open access
  • Published:

A two-phase procedure for non-normal quantitative trait genetic association study

Abstract

Background

The nonparametric trend test (NPT) is well suitable for identifying the genetic variants associated with quantitative traits when the trait values do not satisfy the normal distribution assumption. If the genetic model, defined according to the mode of inheritance, is known, the NPT derived under the given genetic model is optimal. However, in practice, the genetic model is often unknown beforehand. The NPT derived from an uncorrected model might result in loss of power. When the underlying genetic model is unknown, a robust test is preferred to maintain satisfactory power.

Results

We propose a two-phase procedure to handle the uncertainty of the genetic model for non-normal quantitative trait genetic association study. First, a model selection procedure is employed to help choose the genetic model. Then the optimal test derived under the selected model is constructed to test for possible association. To control the type I error rate, we derive the joint distribution of the test statistics developed in the two phases and obtain the proper size.

Conclusions

The proposed method is more robust than existing methods through the simulation results and application to gene DNAH9 from the Genetic Analysis Workshop 16 for associated with Anti-cyclic citrullinated peptide antibody further demonstrate its performance.

Background

The past decades have witnessed many biological and epidemiological discoveries through the experimental design of genetic association studies based on the development of biological technology. Many variants have been identified to be associated with the quantitative traits. For example, in studying genetic loci in association with various phenotypes, 180 were reported to be associated with human height [1], 106 were associated with age at menarche [2], 97 were identified to be associated with body mass index [3], and the single-nucleotide polymorphism (SNP) rs4702 was associated with both diastolic and systolic blood pressure levels [4]. A standard approach to conduct an association test in a quantitative trait genetic study is to fit a linear model based on the assumption that the original or transformed trait values follow a normal distribution. However, the normal assumption is often violated for many traits even though some transformations such as the Log-transformation are carried out. For example, the number of tumors per subject in mouse follows a negative binomial distribution [5] and the survival time of a person follows a truncated distribution [6]. A good alternative to address this issue is to use the nonparametric tests.

Although there are various nonparametric tests in the literature, the most commonly used ones in genetic studies are the Kruskal-Wallis test (denote it by KW) [7] and the Jonckheere-Tepstra test (denote it by JT) [8, 9]. Originally, the KW was designed to detect the differences of the response variable in the medians of three groups and it was a nonparametric version of one-way analysis of variance based on ranking. The JT was also a rank-based test for an ordered alternative hypothesis which was particularly sensitive to the genetic mode of inheritance. Recently, Zhang and Li [10] defined the nonparametric risk and nonparametric odds and proposed a nonparametric trend test (NPT) that has been shown to be more powerful than KW and JT under a given genetic model. These methods, however, would suffer from loss of power when the underlying genetic model is misspecified.

In the present paper, we propose a two-phase robust procedure to test the genetic-phenotypic association. We first construct a test to classify the genetic model in a nonparametric way. We find that the test statistic tends to be positive when the genetic model is dominant, and negative when the model is recessive. Then based on the chosen model, the association test is conducted. We derive the correlation coefficient of the test used for choosing the genetic model and that for doing association study and obtain the proper size for a given nominal significance level. Extensive simulation studies are conducted to show the new approach to have empirical size less than the nominal level, and to compare this new approach with KW and MAX3, the maximum value of three NPTs. The results show that the proposed two-phase procedure is more robust than MAX3 and KW in the sense that its minimum power in a set of plausible models is the highest among the tests under consideration. Finally, a real data analysis is used for further illustration.

Methods

Notations and genetic models

Consider a biallelic marker whose genotype is coded as 0,1, and 2, corresponding to the count of a certain candidate risk allele or a minor allele. Suppose that there are n subjects that are independently sampled from a source population in a quantitative trait genetic association study. Let (y i ,g i ),i=1,2,,n be the observed sample, where y i is the trait value and g i denotes the genotype value of the ith subject, i=1,2,,n. For brevity, let the first n 0 subjects have genotype 0, the second n 1 subjects have genotype 1, and the last n 2 subjects possess genotype 2. Denote f ij =Pr(Y i <Y j ),i,j=0,1,2, where Y 0,Y 1 and Y 2 are the random variables that take values in three sets \(\{y_{1},y_{2},\cdots,y_{n_{0\phantom {\dot {i}\!}}}\}\), \(\{y_{n_{0}+1},y_{n_{0}+2},\cdots,y_{n_{0}+n_{1}} \}\) and \(\{y_{n_{0}+n_{1}+1}, y_{n_{0}+n_{1}+2}, \cdots, y_{n}\},\) respectively. The null hypothesis of no association is given by H 0:f 01=f 02=1/2. The alternative hypothesis is H 1:f 02f 01≥1/2 and f 02>1/2.

A genetic model specifies the mode of inheritance. The three genetic models are: recessive model (REC) if f 01=1/2 and f 12=f 02>1/2, additive model (ADD) if f 01=f 12>1/2 and f 02>1/2, and dominant model (DOM) if \(f_{01}=f_{02} > \frac {1}{2}\) and f 12=1/2.

Model selection

Denote Δ 1=f 01−1/2, Δ 2=f 12−1/2. We find that Δ 1Δ 2 tends to be negative value under the recessive model and take positive under the dominant model. The signs of (Δ 1,Δ 2) under the three genetic models are plotted in Fig. 1, where the line corresponding to the additive model is the straight line with a slope of 1 at the point C, C=(1/2,1/2)τ and τ denotes the transpose of a vector or a matrix, and the other two lines are for the recessive and dominant models, respectively. The recessive and dominant models form the boundaries of the space under the alternative hypothesis. The vertex C corresponds to the null hypothesis. Denote

$$\begin{aligned} {}\hat{f}_{01}&=\frac{1}{n_{0}n_{1}}\sum\limits_{i=1}^{n_{0}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}I(y_{i}<y_{j}),\\ {}\hat{f}_{12}&=\frac{1}{n_{1}n_{2}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{j}<y_{k}),\\ {}\widehat\sigma_{01}^{2}&=\frac{n_{1}-1}{{n_{0}^{2}}n_{1}}\sum\limits_{i=1}^{n_{0}}\left[\frac{1}{n_{1}} \sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}I(y_{i}<y_{j})-1/2\right]^{2}\\ {}&\quad+\!\frac{n_{0}\,-\,1}{n_{0}{n_{1}^{2}}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}\left[\frac{1}{n_{0}}\!\sum\limits_{i=1}^{n_{0}}I(y_{i}\!<\!y_{j})\,-\,1/2\right]^{2} \,+\,\frac{1}{4n_{0}n_{1}},\\ {}\widehat\sigma_{01,12}^{2}&=\frac{1}{{n_{1}^{2}}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}\left[\frac{1}{n_{0}}\sum\limits_{i=1}^{n_{0}} I(y_{i}<y_{j})-1/2\right]\\ &\quad\left[\frac{1}{n_{2}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{j}<y_{k})-1/2\right], \end{aligned} $$

and

$$\begin{aligned} {} \widehat\sigma_{12}^{2}&=\frac{n_{2}-1}{{n_{1}^{2}}n_{2}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}} \left[\frac{1}{n_{2}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{i}<y_{k})-1/2\right]^{2}\\ &\quad+\frac{n_{1}-1}{n_{1}{n_{2}^{2}}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}\left[\frac{1}{n_{1}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}I(y_{j}<y_{k})-1/2\right]^{2}\\ &\quad+\displaystyle\frac{1}{4n_{1}n_{2}}. \end{aligned} $$
Fig. 1
figure 1

The common three genetic models in the genetic model space. The point C=(1/2,1/2) corresponds to the null hypothesis

Then \(\hat f_{01}\) and \(\hat f_{12}\) are the consistent estimators of f 01 and f 12, respectively, \(\widehat \sigma _{01}^{2}\) and \(\widehat \sigma _{12}^{2}\) are, respectively, the consistent estimators of the variances of \(\hat f_{01}\) and \(\hat f_{12}\), and \(\widehat \sigma _{01,12}^{2}\) is the consistent estimator of the covariance between \(\hat f_{01}\) and \(\hat f_{12}\). Define a test statistic for genetic model selection as

$$\begin{array}{*{20}l} Z_{1}=\frac{\hat f_{01}-\hat f_{12}}{\sqrt{\hat\sigma_{01}^{2}-2\hat\sigma_{01,12}^{2}+\hat\sigma_{12}^{2}}}. \end{array} $$

Under the null hypothesis, Z 1 asymptotically follows the standard normal distribution. So the genetic models can be determined as follows: i) if Z 1>ξ (>0), then the genetic model is dominant; ii) if Z 1<−ξ, then the genetic model is recessive; otherwise, the additive model is claimed. Here, ξ is set to be the 90 % quantile of the standard normal distribution.

The nonparametric test under a given genetic model

Denote

$$\begin{aligned} {}\hat{f}_{02}&=\frac{1}{n_{0}n_{2}}\sum\limits_{i=1}^{n_{0}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{i}<y_{k}),\\ {}\hat{f}_{R}&=\frac{1}{(n_{0}+n_{1})n_{2}}\sum\limits_{i=1}^{n_{0}+n_{1}}\sum_{k=n_{0}+n_{1}+1}^{n}I\{y_{i}<y_{k}\}\\ &=\frac{n_{0}}{n_{0}+n_{1}}\hat f_{02} + \frac{n_{1}}{n_{0}+n_{1}}\hat f_{12},\\ {}\hat\sigma_{02}^{2}&=\frac{n_{2}-1}{{n_{0}^{2}}n_{2}}\sum\limits_{i=1}^{n_{0}}\left[\frac{1}{n_{2}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{i}<y_{k})-1/2\right]^{2}\\ &\quad\!\,+\,\frac{n_{0}\,-\,1}{n_{0}{n_{2}^{2}}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}\!\left[\!\frac{1}{n_{0}}\sum\limits_{i=1}^{n_{0}}I\!(y_{i}\!\!<\!\!y_{k})\,-\,\!1/2\!\right]^{2}\!\,+\,\!\frac{1}{4n_{0}n_{2}},\\ {}\widehat\sigma_{02,12}^{2}&=\frac{1}{{n_{2}^{2}}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}\left[\frac{1}{n_{0}}\sum\limits_{i=1}^{n_{0}}I(y_{i}<y_{k})-1/2\right]\\ &\quad\left[\frac{1}{n_{1}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}I(y_{j}<y_{k})-1/2\right], \end{aligned} $$

and

$${}{\widehat\sigma_{R}^{2}} =\frac{{n_{0}^{2}}}{(n_{0}+n_{1})^{2}}\widehat{\sigma}_{02}^{2} + \frac{2n_{0}n_{1}}{(n_{0}+n_{1})^{2}}\widehat{\sigma}_{02,12}^{2} + \frac{{n_{1}^{2}}}{(n_{0}+n_{1})^{2}}\widehat{\sigma}_{12}^{2}. $$

Then the NPT under the recessive model can be given by \(Z_{R} = (\hat f_{R}-1/2)/\hat \sigma _{R}\).

Let

$$\begin{aligned} {}\hat{f}_{01}&\,=\,\frac{1}{n_{0}n_{1}}\sum\limits_{i=1}^{n_{0}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}I(y_{i}<y_{j}),\\ {}{w}_{1}^{*}&\,=\,\sqrt{\!(n_{0}\,+\,n_{1})\!\big/\!\left[\!(n\,+\,n_{1}\!)\widehat \sigma_{01}^{2}\!\right]}, w_{2}^{*}\,=\,\sqrt{\!(n_{1}\,+\,n_{2})\!\big/\!\left[\!(n\,+\,n_{1}\!)\widehat \sigma_{12}^{2}\right]},\\ {}w_{1}&\,=\,\frac{w_{1}^{*}}{w_{1}^{*}+w_{2}^{*}}, w_{2}=\frac{w_{2}^{*}}{w_{1}^{*}+w_{2}^{*}} \end{aligned} $$

and

$$\begin{array}{*{20}l} {}{\hat{f}}_{A}=w_{1}\hat f_{01}+w_{2}\hat f_{12}, {\widehat\sigma^{2}_{A}}= {w_{1}^{2}}\widehat\sigma_{01}^{2} + 2w_{1}w_{2}\widehat\sigma_{01,12}^{2} + {w_{2}^{2}}\widehat\sigma_{12}^{2}. \end{array} $$

Then, the NPT under the additive genetic model is \(Z_{A} =(\hat f_{A}-1/2)/\hat \sigma _{A}\).

Similarly, denote

$${\fontsize{8.1pt}{9.6pt}{\begin{aligned} {}\widehat\sigma_{01}^{2}&=\frac{n_{1}-1}{{n_{0}^{2}}n_{1}}\sum\limits_{i=1}^{n_{0}}\left[\frac{1}{n_{1}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}} I(y_{i}<y_{j})-1/2\right]^{2}\\ &\quad+\!\frac{n_{0}\,-\,1}{n_{0}{n_{1}^{2}}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}}\left[\frac{1}{n_{0}}\sum\limits_{i=1}^{n_{0}}I(y_{i}\!<\!y_{j})\,-\,1/2\!\right]^{2} \,+\,\frac{1}{4n_{0}n_{1}},\\ {}\widehat{\sigma}_{01,02}^{2}&=\frac{1}{{n_{0}^{2}}}\sum\limits_{i=1}^{n_{0}}\left[\frac{1}{n_{1}}\sum\limits_{j=n_{0}+1}^{n_{0}+n_{1}} I(y_{i}<y_{j})-1/2 \right]\\ &\quad\left[\frac{1}{n_{2}}\sum\limits_{k=n_{0}+n_{1}+1}^{n}I(y_{i}<y_{k})-1/2\right].\\ {}\hat{f}_{D}&=\frac{1}{n_{0}(n_{1}+n_{2})}\sum\limits_{i=1}^{n_{0}}\sum_{j=n_{0}+1}^{n}I\{y_{i}\!<y_{j}\}=\frac{n_{1}}{n_{1}+n_{2}}\hat f_{01} + \frac{n_{2}}{n_{1}+n_{2}}\hat f_{02}, \end{aligned}}} $$

and

$$\begin{array}{*{20}l} {}{\widehat\sigma_{D}^{2}}=\frac{{n_{1}^{2}}}{(n_{1}+n_{2})^{2}}\widehat{\sigma}_{01}^{2} + \frac{2n_{1}n_{2}}{(n_{1}+n_{2})^{2}}\widehat{\sigma}_{01,02}^{2} + \frac{{n_{2}^{2}}}{(n_{1}+n_{2})^{2}}\widehat{\sigma}_{02}^{2}. \end{array} $$

Then the NPT under the dominant model is \(Z_{D} = (\hat f_{D}-1/2)/\hat \sigma _{D}\). Under the null hypothesis, Z R , Z A and Z D follow the standard normal distribution.

Two-phase procedure

We propose a two-phase procedure (TPP) for the quantitative trait association study by first determining the underlying genetic model in the first phase, followed by testing the association with the corresponding NPT for the selected model in the second phase. In details, the two-phase procedure can be described by the following two steps:

Step 1. Determine the genetic model using Z 1. If Z 1<−ξ, the recessive model is used, else if Z 1>ξ, we use the dominant model, otherwise, the additive model is used.

Step 2. We choose the association test statistic based on the chosen model in Step 1 and do the association study.

Size adjustment

To adjust the size of the two-phase procedure for a given overall nominal significance level, we need to derive the joint distribution of Z 1 and Z x , x{R,A,D}. From the Additional file 1, under the null hypothesis, (Z 1,Z x )τ asymptotically follows a bivariate normal distribution with mean (0,0) and Λ x , where

$$\Lambda_{x}=\left(\begin{array}{cc} 1&\rho_{x}\\ \rho_{x}&1 \end{array} \right),~~~x\in\{R, A, D\}. $$

Denote the cumulative distribution function of Y 0, Y 1 and Y 2 by F 0, F 1 and F 2, respectively. Then ρ R , ρ A and ρ D are functions of F 0,F 1,F 2 and p (the minor allele frequency, or MAF, for short), which can be estimated empirically based on the observed data. The consistent estimates can be obtained under the situation that the means of the trait values in the groups with different genotypes being equal. The technical details of derivations for ρ R , ρ A and ρ D under the null hypothesis are presented in the Additional file 1. Suppose that the null hypothesis is rejected at the level of α in the second phase. Then, to control the overall level at a given α, we have \(\alpha =\text {P}_{H_{0}}\left (Z_{1} < -\xi, |Z_{R}|>z(1-\alpha ^{*}/2)\right)+ \text {P}_{H_{0}}\left (|Z_{1}| < \xi, |Z_{A}|>z(1-\alpha ^{*}/2)\right) + \text {P}_{H_{0}}\left (Z_{1} > \xi, |Z_{D}|>\right. \left.z(1-\alpha ^{*}/2)\right)\), where z(α) is the α quantile of the standard normal distribution. So, this relation can be written as

$${\fontsize{8.1pt}{9.6pt}{\begin{aligned} {}\alpha&\,=\,\displaystyle\int_{\Omega_{R}}\left\{\Phi\left(\frac{\!-z(1\,-\,\alpha^{*}/2)\,-\,\rho_{R}u}{(1\,-\,{\rho_{R}^{2}})^{1/2}}\right)\,+\, \Phi\!\left(\frac{-z(1\,-\,\alpha^{*}/2)\,+\,\rho_{R}u}{(1\,-\,{\rho_{R}^{2}})^{1/2}}\right)\!\right\}\!\mathrm{d}\Phi(u) \\ {}&\quad+\displaystyle\int_{\Omega_{A}}\!\!\left\{\!\Phi\!\left(\!\frac{-z(1\,-\,\alpha^{*}/2)\,-\,\rho_{A}u}{(1\,-\,{\rho_{A}^{2}})^{1/2}}\!\right)\,+\, \Phi\!\left(\!\frac{-z(1\,-\,\alpha^{*}\!/2)\,+\,\rho_{A}u}{(1\,-\,{\rho_{A}^{2}})^{1/2}}\!\right)\!\right\}\!\mathrm{d}\Phi(u)\\ {}&\quad+\!\! \displaystyle \int_{\Omega_{D}}\!\left\{\!\Phi\!\left(\frac{\!-z(1\,-\,\alpha^{*}\!/2)\,-\,\rho_{D}u}{(1\,-\,{\rho_{D}^{2}})^{1/2}}\!\right)\!+ \!\Phi\!\left(\frac{\!-z(1\,-\,\alpha^{*}\!/2)\,+\,\rho_{D}u}{(1\,-\,{\rho_{D}^{2}})^{1/2}}\!\right)\!\right\}\!\mathrm{d}\Phi(u), \end{aligned}}} $$

where Ω R ={u:u<−ξ}, Ω A ={u:−ξuξ}, Ω D ={u:u>ξ}, and Φ(·) is the cumulative distribution function of the standard normal distribution. Under the null hypothesis, we can numerically calculate the adjusted significant level for the association test statistic in the second phase. Table 2 shows the mean and standard error of α with the nominal level of 0.05 and 0.001 based on 1,000 and 50,000 replicates, respectively. It indicates that α is more likely to be smaller than α, and also α is less vulnerable to the MAF. For example, when MAF=0.25, the adjusted levels for the nominal α=0.05 and α=0.001 are 0.0360 and 0.00065, and the corresponding standard error are 0.0003 and 0.000013, respectively.

Results

The performance of model selection procedure

We conduct simulation studies to explore the performance of the model selection procedure. We generate data considering three genetic models. Consider the linear model Y=β 0+G β 1+ε, where Y denotes the phenotype value, G denotes the genotype value at a SNP locus, and ε follows a truncated generalized extreme value distribution (a heavy-tailed distribution, denoted as tGEV(0, 0, d, 0)) with the shape parameter 0, the location parameter 0, the scale parameter d, and the truncated point 0. Here we specify β 0=0.50, β 1=0.50, d=5, and the MAF p{0.05,0.10,,0.50}. The total sample size is 1,500. 10,000 replicates are conducted to compute the true selection rate (TSR) under different scenarios. Table 1 shows the results for ξ=Φ −1(0.90)=1.282. The other results for ξ=Φ −1(0.80)=0.841, ξ=Φ −1(0.85)=1.036 and Φ −1(0.95)=1.645 are given in the Additional file 1. From Table 1, we can see that the TSR increases as MAF increases. For example, if the recessive model is true, the TSR is 19.48 % for MAF of 0.05, while it is 86.21 % for MAF of 0.50. It makes sense since the expected number of subjects with genotype 2 is increasing with the MAF increasing. We also find that the TSR for additive model is satisfactory with the TSR being around 80 %. For example, the TSR are 79.23 % and 80.09 % corresponding to MAF of 0.05 and 0.50, respectively. Besides this, we also conduct simulations with covariates considering Y=β 0+X γ+G β 1+ε, where X is a covariate. The detailed results are available in the Additional file 1.

Table 1 The true selecting rate (%) of genetic model using Z 1 with ξ=Φ −1(0.9) when the error follows tGEV(0,0,5,1)

The adjusted significant level

Table 2 shows the adjusted α of the TPP under the null hypothesis. The parameter setting is the same as above. When the nominal level is 0.05, we calculate the mean and standard deviation (SD) based on 2,000 replicates. And 50,000 replicates are conducted for the nominal level of 0.001. The results indicate that the adjusted level is always less than the nominal significant level α. For example, when MAF =0.25, the adjusted levels for the nominal level α=0.05 and α=0.001 are 0.0310 and 0.00059, respectively. And the value of α is relatively stable because its standard deviation can be omitted compared with the means. For example, when MAF =0.1, the adjusted levels for the nominal level α=0.05 and α=0.001 are 0.0335 and 0.00063, and the corresponding standard deviations are 0.00169 and 0.000039, respectively.

Table 2 The adjusted level α for the nominal significant level α of 0.05 and 0.001

Type I error rate

We evaluate the empirical type I error rates of five tests: KW, Z R , Z A , MAX3, and TPP. The simulation settings are similar as above. The sample size is 1,500. Here we use ξ=Φ −1(0.90), β 0=0.50, and p{0.05,0.10,,0.50}. 2,000 replicates are conducted for the nominal significant level of 0.05 and 50,000 replicates are conducted for the nominal significant level of 0.001. Table 3 shows the empirical type I errors of the five tests under the significant level of 0.05 and 0.001. The results show that all of the five tests could control the type I error correctly with the empirical values being close to the nominal significance level. For example, when MAF =0.20, the empirical type I error rates of KW, Z R , Z A , MAX3, and TPP test are 0.046, 0.048, 0.051, 0.045, and 0.041, respectively, under the significant level of 0.05. When MAF =0.35 and the nominal significant level is 0.001, the empirical type I error rates of KW, Z R , Z A , MAX3, and TPP test are 0.00090, 0.00086, 0.00098, 0.00090, and 0.00080, respectively.

Table 3 The empirical type I errors of KW, Z R , Z A , MAX3, and TPP when the error term follows tGEV(0,0,5,0)

Power

We compare the power among KW, Z R , Z A , MAX3 and TPP under the similar settings to those described above. Figures 2 and 3 report the power results for the nominal level of 0.05 and 0.001, respectively, under the recessive, additive, and dominant models. In order to make the power comparable, when the nominal level is 0.001, we specify d=3 for β 1=0.25 and d=5 for β 1=0.50, and when the nominal level is 0.05, we set d=5 and β 1={0.25,0.50}. The results indicate that, except the NPT test under the true genetic model, the proposed TPP is always more powerful than KW and MAX3. this makes sense because that NPT under a given model (Z R , Z A ) is the most powerful under that model, and the model selection procedure always gives a large probability of TSR. TPP is more powerful than KW, Z A , and MAX3 under the recessive model and in most scenarios under the dominant model. In some cases, there are 6 % power increase. For example, when MAF is 0.20, β 1=0.50, α=0.05 and the genetic model is recessive, the empirical powers of KW, Z A , MAX3, and TPP are 0.335, 0.202, 0.418, and 0.473, respectively. The performance of TPP is superior than the other three test KW, Z R and MAX3 when the true model is additive or dominant. For example, when MAF is 0.30 and the genetic model is additive, β 1=0.50, α=0.001, the empirical powers of KW, Z R , MAX3, and TPP are 0.321, 0.128, 0.407, and 0.431, respectively. Furthermore, using Z R under the additive or dominant model can result in substantial loss of power. The TPP has higher robustness against the genetic model than other four tests. For example, when α=0.05 and β 1=0.50, the minimum value of power for TPP over MAF from 0.10 to 0.50 is 0.137 under the recessive, additive and dominant model, which is larger than those of KW (0.099), Z R (0.103), Z A (0.070), and MAX3 (0.112).

Fig. 2
figure 2

The powers of KW, Z R , Z A , MAX3, and TPP with tGEV(0,0,5,0) error under three genetic models. The nominal level is 0.05. The first column is for β 1=0.25 and the second column is for β 1=0.50. The total number of the subjects is n=1,500

Fig. 3
figure 3

The powers of KW, Z R , Z A , MAX3, and TPP with tGEV(0,0,d,0) error under three genetic models. The nominal level is 0.001. The first column is for β 1=0.25,d=3 and the second column is for β 1=0.50,d=5. The total number of the subjects is n=1,500

Application to gene DNAH9 associated with anti-CCP measure

We apply KW, Z A , MAX3 and TPP to identify the deleterious SNPs in the gene DNAH9 [11] for the association with the anti-CCP measure using the data from Genetic Workshop 16 [12, 13]. The anti-CCP is present in the blood of the majority of patients with rheumatoid arthritis (RA). The data includes 867 cases (with anti-CCP) and 1,195 controls (without anti-CCP). Here we impute them with the minimum value of the anti-CCP values in cases, which is 20.053 following Zheng et al. (2012)[14]. We remove the effect of population stratification using four principal coordinates [15] following Zhang and Li [10] and take the residuals as the new outcome. There are 92 SNPs in gene DNAH9 on Chromosome 17. We calculate the p-values of these SNPs using the KW, Z A , MAX3 and TPP approaches. There are six SNPs in gene DNAH9 whose proportions of the missing genotype value are more than 15 %, so we only show the p-value of the remaining 86 SNPs. In the main text, we shows the results of the SNPs whose p-values are relatively small (almost less than 0.05) in Table 4 and the p-values of the other SNPs are summarized in Table S10 in the Additional file 1. We find that the SNP rs11655963 has the minimum p-value of 2.72×10−5 using the TPP. The corresponding p-values using KW, Z A , and MAX3 are 1.18×10−4, 8.40×10−5 and 7.44×10−5, respectively. Burton et al.(2007)[16] proposed to use the p-value threshold of 5×10−5 as the moderate association at the genome-wide level. Because the p-values of KW, Z A and MAX are all larger than 5×10−5, there are no moderate genome-wide associations. However, for the TPP, we calculate the adjusted p-value threshold for 5×10−5 and it is 3.64×10−5. This indicates that using the TPP can give the moderate-strong effect.

Table 4 The p-values of 17 SNPs in gene DNAH9 for the association with Anti-CCP Measure

Discussion and Conclusion

With the developments of biological technology, more and more data on quantitative traits and genotypes are generated and deposited in public database such as The National Center for Biotechnology Information database. It is urgent to develop new methods to excavate useful information to help understand the etiology of human complex diseases. A nonparametric two-phase procedure is proposed here to test the association between a di-allelic SNP and a non-normal distributed quantitative trait when the genetic model is unknown. Simulation results show that the proposed TPP is more robust than the existing methods.

If there are covariates needed to be adjusted for, we can first regress on the covariates and use the residuals as the new outcome and then employ TPP to conduct the association study. The detailed simulation results are presented in Additional file 1. Besides the truncated generalized extreme value distributional (a heavy-tailed distribution) error term with the truncation point 0, we also consider the error term following the centralized t distribution and general generalized extreme value distribution, respectively. The results are given in Additional file 1, where the similar results are observed.

References

  1. Lango AH, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variats clustered in genomic loci and biological pathways affect human height. Nature. 2010; 467:832–8.

    Article  Google Scholar 

  2. Perry JR, Day F, Elks CE, Sulem P, Thompson DJ, Ferreira T, et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature. 2014; 514(7520):92–7.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  3. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518(7538):197–206.

    Article  PubMed  CAS  PubMed Central  Google Scholar 

  4. Turpeinen H, Seppälä I, Lyytikäinen LP, Raitoharju E, Hutri-Kähönen N, Levula M, et al. A genome-wide expression quantitative trait loci analysis of proprotein convertase subtilisin/kexin enzymes identifies a novel regulatory gene variant for FURIN expression and blood pressure. Hum Genet. 2015; 134:627–636.

    Article  PubMed  CAS  Google Scholar 

  5. Drinkwater NR, Klotz JH. Statistical methods for the analysis of tumor multiplicity data. Cancer Res. 1981; 41:113–9.

    PubMed  CAS  Google Scholar 

  6. Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genet Epidemiol. 2014; 38:191–7.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952; 47:583–621.

    Article  Google Scholar 

  8. Jonckheere A. A distribution-free k-sample test against ordered alternatives. Biometrika. 1954; 41:133–45.

    Article  Google Scholar 

  9. Terpstra TJ. The asymptotic normality and consistency of Kendalls test against trend, when ties are present in one ranking. Indagationes Mathematicae. 1952; 14:327–33.

    Article  Google Scholar 

  10. Zhang W, Li Q. Nonparametric risk and nonparametric odds in quantitative genetic association studies. Sci Rep-UK. 2015; 5:12105.

    Article  CAS  Google Scholar 

  11. Lin Y, Zhang M, Wang L, Pungpapong V, Fleet JC, Zhang D. Simultaneous genome-wide association studies of anti-cyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonal-components regression. BMC Proc. 2009; 3. Suppl 7:S20.

  12. Black MH, Watanabe RM. A principal-components-based clustering method to identify multiple variants associated with rheumatoid arthritis and arthritis-related autoantibodies. BMC Proc. 2009; 3. Suppl 7:S129.

  13. Amos CI, Chen WV, Seldin MF, Remmers EF, Taylor KE, Criswell LA, et al. Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data. BMC Proc. 2009; 3. Suppl 7:S2.

  14. Zheng G, Wu CO, Kwak M, Jiang W, Joo J, Lima JAC. Joint analysis of binary and quantitative traits with data sharing and outcome-dependent sampling. Genet Epidemiol. 2012; 36:263–73.

    Article  PubMed  Google Scholar 

  15. Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008; 32(3):215–26.

    Article  PubMed  CAS  Google Scholar 

  16. Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447:661–78.

    Article  CAS  Google Scholar 

Download references

Acknowledgments

Q. Li was supported in part by the National Science Foundation of China, Grant No. (11371353, 61134013) and the Breakthrough Project of Strategic Priority Program of the Chinese Academy of Sciences, Grant No. XDB13040600. The authors thank Dr. Aiyi Liu of National Institute of Child Health and Human Development (NICHD), National Institutes of Health (NIH), for his helpful comments. We also thank the Editor, Associate Editor and three anonymous reviewers for their careful reading and insightful comments, which greatly improved our manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huiyun Li.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

WZ contributed to the design of the study and performed the analysis; HL and QL conceived the idea and drafted the manuscript; all authors participated in data interpretation, read and approved the final manuscript.

Additional file

Additional file 1

The derivations of ρ R , ρ A and ρ D under the null hypothesis. Consistent estimators of ρ R , ρ A and ρ D under the null hypothesis. Additional simulation results for the model selection procedure. Simulation results for the error term following the generalized extreme distribution. Simulation results for the error term following the centralized t distribution. Simulation results for the model with covariates. Additional p-value results of the SNPs in gene DNAH9 for the associated with Anti-CCP Measure. (PDF 179 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Li, H., Li, Z. et al. A two-phase procedure for non-normal quantitative trait genetic association study. BMC Bioinformatics 17, 52 (2016). https://doi.org/10.1186/s12859-016-0888-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-016-0888-x

Keywords