A two-phase procedure for non-normal quantitative trait genetic association study
- Wei Zhang^{1},
- Huiyun Li^{2}Email author,
- Zhaohai Li^{3} and
- Qizhai Li^{1}
https://doi.org/10.1186/s12859-016-0888-x
© Zhang et al. 2016
Received: 14 May 2015
Accepted: 6 January 2016
Published: 28 January 2016
Abstract
Background
The nonparametric trend test (NPT) is well suitable for identifying the genetic variants associated with quantitative traits when the trait values do not satisfy the normal distribution assumption. If the genetic model, defined according to the mode of inheritance, is known, the NPT derived under the given genetic model is optimal. However, in practice, the genetic model is often unknown beforehand. The NPT derived from an uncorrected model might result in loss of power. When the underlying genetic model is unknown, a robust test is preferred to maintain satisfactory power.
Results
We propose a two-phase procedure to handle the uncertainty of the genetic model for non-normal quantitative trait genetic association study. First, a model selection procedure is employed to help choose the genetic model. Then the optimal test derived under the selected model is constructed to test for possible association. To control the type I error rate, we derive the joint distribution of the test statistics developed in the two phases and obtain the proper size.
Conclusions
The proposed method is more robust than existing methods through the simulation results and application to gene DNAH9 from the Genetic Analysis Workshop 16 for associated with Anti-cyclic citrullinated peptide antibody further demonstrate its performance.
Keywords
Background
The past decades have witnessed many biological and epidemiological discoveries through the experimental design of genetic association studies based on the development of biological technology. Many variants have been identified to be associated with the quantitative traits. For example, in studying genetic loci in association with various phenotypes, 180 were reported to be associated with human height [1], 106 were associated with age at menarche [2], 97 were identified to be associated with body mass index [3], and the single-nucleotide polymorphism (SNP) rs4702 was associated with both diastolic and systolic blood pressure levels [4]. A standard approach to conduct an association test in a quantitative trait genetic study is to fit a linear model based on the assumption that the original or transformed trait values follow a normal distribution. However, the normal assumption is often violated for many traits even though some transformations such as the Log-transformation are carried out. For example, the number of tumors per subject in mouse follows a negative binomial distribution [5] and the survival time of a person follows a truncated distribution [6]. A good alternative to address this issue is to use the nonparametric tests.
Although there are various nonparametric tests in the literature, the most commonly used ones in genetic studies are the Kruskal-Wallis test (denote it by KW) [7] and the Jonckheere-Tepstra test (denote it by JT) [8, 9]. Originally, the KW was designed to detect the differences of the response variable in the medians of three groups and it was a nonparametric version of one-way analysis of variance based on ranking. The JT was also a rank-based test for an ordered alternative hypothesis which was particularly sensitive to the genetic mode of inheritance. Recently, Zhang and Li [10] defined the nonparametric risk and nonparametric odds and proposed a nonparametric trend test (NPT) that has been shown to be more powerful than KW and JT under a given genetic model. These methods, however, would suffer from loss of power when the underlying genetic model is misspecified.
In the present paper, we propose a two-phase robust procedure to test the genetic-phenotypic association. We first construct a test to classify the genetic model in a nonparametric way. We find that the test statistic tends to be positive when the genetic model is dominant, and negative when the model is recessive. Then based on the chosen model, the association test is conducted. We derive the correlation coefficient of the test used for choosing the genetic model and that for doing association study and obtain the proper size for a given nominal significance level. Extensive simulation studies are conducted to show the new approach to have empirical size less than the nominal level, and to compare this new approach with KW and MAX3, the maximum value of three NPTs. The results show that the proposed two-phase procedure is more robust than MAX3 and KW in the sense that its minimum power in a set of plausible models is the highest among the tests under consideration. Finally, a real data analysis is used for further illustration.
Methods
Notations and genetic models
Consider a biallelic marker whose genotype is coded as 0,1, and 2, corresponding to the count of a certain candidate risk allele or a minor allele. Suppose that there are n subjects that are independently sampled from a source population in a quantitative trait genetic association study. Let (y _{ i },g _{ i }),i=1,2,⋯,n be the observed sample, where y _{ i } is the trait value and g _{ i } denotes the genotype value of the ith subject, i=1,2,⋯,n. For brevity, let the first n _{0} subjects have genotype 0, the second n _{1} subjects have genotype 1, and the last n _{2} subjects possess genotype 2. Denote f _{ ij }=Pr(Y _{ i }<Y _{ j }),i,j=0,1,2, where Y _{0},Y _{1} and Y _{2} are the random variables that take values in three sets \(\{y_{1},y_{2},\cdots,y_{n_{0\phantom {\dot {i}\!}}}\}\), \(\{y_{n_{0}+1},y_{n_{0}+2},\cdots,y_{n_{0}+n_{1}} \}\) and \(\{y_{n_{0}+n_{1}+1}, y_{n_{0}+n_{1}+2}, \cdots, y_{n}\},\) respectively. The null hypothesis of no association is given by H _{0}:f _{01}=f _{02}=1/2. The alternative hypothesis is H _{1}:f _{02}≥f _{01}≥1/2 and f _{02}>1/2.
A genetic model specifies the mode of inheritance. The three genetic models are: recessive model (REC) if f _{01}=1/2 and f _{12}=f _{02}>1/2, additive model (ADD) if f _{01}=f _{12}>1/2 and f _{02}>1/2, and dominant model (DOM) if \(f_{01}=f_{02} > \frac {1}{2}\) and f _{12}=1/2.
Model selection
Under the null hypothesis, Z _{1} asymptotically follows the standard normal distribution. So the genetic models can be determined as follows: i) if Z _{1}>ξ (>0), then the genetic model is dominant; ii) if Z _{1}<−ξ, then the genetic model is recessive; otherwise, the additive model is claimed. Here, ξ is set to be the 90 % quantile of the standard normal distribution.
The nonparametric test under a given genetic model
Then the NPT under the recessive model can be given by \(Z_{R} = (\hat f_{R}-1/2)/\hat \sigma _{R}\).
Then, the NPT under the additive genetic model is \(Z_{A} =(\hat f_{A}-1/2)/\hat \sigma _{A}\).
Then the NPT under the dominant model is \(Z_{D} = (\hat f_{D}-1/2)/\hat \sigma _{D}\). Under the null hypothesis, Z _{ R }, Z _{ A } and Z _{ D } follow the standard normal distribution.
Two-phase procedure
We propose a two-phase procedure (TPP) for the quantitative trait association study by first determining the underlying genetic model in the first phase, followed by testing the association with the corresponding NPT for the selected model in the second phase. In details, the two-phase procedure can be described by the following two steps:
Step 1. Determine the genetic model using Z _{1}. If Z _{1}<−ξ, the recessive model is used, else if Z _{1}>ξ, we use the dominant model, otherwise, the additive model is used.
Step 2. We choose the association test statistic based on the chosen model in Step 1 and do the association study.
Size adjustment
Results
The performance of model selection procedure
The true selecting rate (%) of genetic model using Z _{1} with ξ=Φ ^{−1}(0.9) when the error follows tGEV(0,0,5,1)
True model | REC | ADD | DOM | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
MAF ∖Selection rate | REC | ADD | DOM | REC | ADD | DOM | REC | ADD | DOM | |||
0.05 | 19.48 | 75.59 | 4.93 | 8.21 | 79.23 | 12.56 | 2.40 | 73.34 | 24.26 | |||
0.10 | 34.80 | 63.67 | 1.53 | 8.85 | 80.21 | 10.94 | 1.37 | 64.52 | 34.11 | |||
0.15 | 50.25 | 49.30 | 0.45 | 8.96 | 81.14 | 9.90 | 0.59 | 52.69 | 46.72 | |||
0.20 | 61.19 | 38.60 | 0.21 | 9.63 | 80.22 | 10.15 | 0.27 | 39.68 | 60.05 | |||
0.25 | 71.12 | 28.84 | 0.04 | 9.44 | 80.69 | 9.87 | 0.08 | 30.44 | 69.48 | |||
0.30 | 77.44 | 22.54 | 0.02 | 9.62 | 80.33 | 10.05 | 0.05 | 22.96 | 76.99 | |||
0.35 | 81.94 | 18.04 | 0.02 | 10.00 | 80.37 | 9.63 | 0.04 | 18.00 | 81.96 | |||
0.40 | 84.64 | 15.34 | 0.02 | 9.56 | 80.45 | 9.99 | 0.02 | 15.00 | 84.98 | |||
0.45 | 85.69 | 14.30 | 0.01 | 9.85 | 80.33 | 9.82 | 0.00 | 13.91 | 86.09 | |||
0.50 | 86.21 | 13.75 | 0.04 | 10.16 | 80.09 | 9.75 | 0.02 | 14.14 | 85.84 |
The adjusted significant level
The adjusted level α ^{∗} for the nominal significant level α of 0.05 and 0.001
MAF | 0.05 | 0.10 | 0.15 | 0.20 | 0.25 | 0.30 | 0.35 | 0.40 | 0.45 | 0.50 | |
---|---|---|---|---|---|---|---|---|---|---|---|
α=0.05 | mean | 0.0364 | 0.0335 | 0.0327 | 0.0318 | 0.0310 | 0.0303 | 0.0297 | 0.0293 | 0.0290 | 0.0290 |
sd | 0.00689 | 0.00169 | 0.00173 | 0.00148 | 0.00120 | 0.00093 | 0.00074 | 0.00062 | 0.00058 | 0.00057 | |
α=0.001 | mean | 0.00071 | 0.00063 | 0.00063 | 0.00061 | 0.00059 | 0.00058 | 0.00057 | 0.00056 | 0.00056 | 0.00056 |
sd | 0.000151 | 0.000039 | 0.000034 | 0.000032 | 0.000024 | 0.000016 | 0.000010 | 0.000005 | 0.000003 | 0.000002 |
Type I error rate
The empirical type I errors of KW, Z _{ R }, Z _{ A }, MAX3, and TPP when the error term follows tGEV(0,0,5,0)
α=0.05 | α=0.001 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MAF | KW | Z _{ R } | Z _{ A } | MAX3 | TPP | KW | Z _{ R } | Z _{ A } | MAX3 | TPP | |
0.05 | 0.049 | 0.031 | 0.057 | 0.032 | 0.043 | 0.00064 | 0.00035 | 0.00114 | 0.00078 | 0.00082 | |
0.10 | 0.051 | 0.055 | 0.047 | 0.045 | 0.047 | 0.00076 | 0.00040 | 0.00092 | 0.00062 | 0.00060 | |
0.15 | 0.049 | 0.055 | 0.051 | 0.057 | 0.050 | 0.00098 | 0.00062 | 0.00092 | 0.00070 | 0.00080 | |
0.20 | 0.046 | 0.048 | 0.051 | 0.045 | 0.041 | 0.00098 | 0.00088 | 0.00094 | 0.00102 | 0.00092 | |
0.25 | 0.058 | 0.049 | 0.052 | 0.058 | 0.050 | 0.00090 | 0.00074 | 0.00076 | 0.00088 | 0.00068 | |
0.30 | 0.057 | 0.054 | 0.049 | 0.058 | 0.044 | 0.00112 | 0.00084 | 0.00088 | 0.00114 | 0.00086 | |
0.35 | 0.056 | 0.052 | 0.051 | 0.055 | 0.047 | 0.00090 | 0.00086 | 0.00098 | 0.00090 | 0.00080 | |
0.40 | 0.052 | 0.048 | 0.045 | 0.050 | 0.038 | 0.00114 | 0.00100 | 0.00082 | 0.00106 | 0.00070 | |
0.45 | 0.048 | 0.049 | 0.054 | 0.057 | 0.043 | 0.00090 | 0.00086 | 0.00090 | 0.00108 | 0.00074 | |
0.50 | 0.052 | 0.050 | 0.044 | 0.044 | 0.034 | 0.00078 | 0.00080 | 0.00068 | 0.00090 | 0.00064 |
Power
Application to gene DNAH9 associated with anti-CCP measure
The p-values of 17 SNPs in gene DNAH9 for the association with Anti-CCP Measure
snpid | KW | Z _{ A } | MAX3 | TPP | Genetic model | α ^{∗} |
---|---|---|---|---|---|---|
rs9896319 | 0.0024 | 0.0542 | 0.0060 | 0.0031 | REC | 3.16×10^{−5} |
rs736626 | 0.1008 | 0.0337 | 0.0720 | 0.0337 | ADD | 2.98×10^{−5} |
rs4791473 | 0.0621 | 0.0214 | 0.0436 | 0.0214 | ADD | 2.99×10^{−5} |
rs12946617 | 0.1182 | 0.0459 | 0.0962 | 0.0459 | ADD | 3.75×10^{−5} |
rs7223160 | 0.0894 | 0.0289 | 0.0624 | 0.0289 | ADD | 2.98×10^{−5} |
rs7207282 | 0.1039 | 0.0345 | 0.0738 | 0.0345 | ADD | 2.98×10^{−5} |
rs11657375 | 0.1359 | 0.0412 | 0.0880 | 0.0412 | ADD | 3.29×10^{−5} |
rs11651010 | 0.0002 | 0.0001 | 0.0001 | 0.0001 | ADD | 3.03×10^{−5} |
rs3744580 | 0.0804 | 0.0390 | 0.0561 | 0.0390 | ADD | 2.99×10^{−5} |
rs11655963 | 1.18×10^{−4} | 8.40×10^{−5} | 7.44×10^{−5} | 2.72×10^{−5} | REC | 3.64×10^{−5} |
rs12936861 | 0.0529 | 0.0594 | 0.0336 | 0.0146 | DOM | 2.87×10^{−5} |
rs9896309 | 0.0356 | 0.1863 | 0.0446 | 0.0205 | DOM | 2.86×10^{−5} |
rs7215021 | 0.1275 | 0.0384 | 0.0715 | 0.0384 | ADD | 3.48×10^{−5} |
rs9303041 | 0.0507 | 0.1855 | 0.0327 | 0.0149 | DOM | 3.16×10^{−5} |
rs10445247 | 0.0481 | 0.0320 | 0.0491 | 0.0320 | ADD | 2.92×10^{−5} |
rs3764845 | 0.0719 | 0.0325 | 0.0698 | 0.0325 | ADD | 2.91×10^{−5} |
rs1990236 | 0.0622 | 0.0217 | 0.0365 | 0.0217 | ADD | 3.27×10^{−5} |
Discussion and Conclusion
With the developments of biological technology, more and more data on quantitative traits and genotypes are generated and deposited in public database such as The National Center for Biotechnology Information database. It is urgent to develop new methods to excavate useful information to help understand the etiology of human complex diseases. A nonparametric two-phase procedure is proposed here to test the association between a di-allelic SNP and a non-normal distributed quantitative trait when the genetic model is unknown. Simulation results show that the proposed TPP is more robust than the existing methods.
If there are covariates needed to be adjusted for, we can first regress on the covariates and use the residuals as the new outcome and then employ TPP to conduct the association study. The detailed simulation results are presented in Additional file 1. Besides the truncated generalized extreme value distributional (a heavy-tailed distribution) error term with the truncation point 0, we also consider the error term following the centralized t distribution and general generalized extreme value distribution, respectively. The results are given in Additional file 1, where the similar results are observed.
Declarations
Acknowledgments
Q. Li was supported in part by the National Science Foundation of China, Grant No. (11371353, 61134013) and the Breakthrough Project of Strategic Priority Program of the Chinese Academy of Sciences, Grant No. XDB13040600. The authors thank Dr. Aiyi Liu of National Institute of Child Health and Human Development (NICHD), National Institutes of Health (NIH), for his helpful comments. We also thank the Editor, Associate Editor and three anonymous reviewers for their careful reading and insightful comments, which greatly improved our manuscript.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
- Lango AH, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of variats clustered in genomic loci and biological pathways affect human height. Nature. 2010; 467:832–8.View ArticleGoogle Scholar
- Perry JR, Day F, Elks CE, Sulem P, Thompson DJ, Ferreira T, et al. Parent-of-origin-specific allelic associations among 106 genomic loci for age at menarche. Nature. 2014; 514(7520):92–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic studies of body mass index yield new insights for obesity biology. Nature. 2015; 518(7538):197–206.PubMedPubMed CentralView ArticleGoogle Scholar
- Turpeinen H, Seppälä I, Lyytikäinen LP, Raitoharju E, Hutri-Kähönen N, Levula M, et al. A genome-wide expression quantitative trait loci analysis of proprotein convertase subtilisin/kexin enzymes identifies a novel regulatory gene variant for FURIN expression and blood pressure. Hum Genet. 2015; 134:627–636.PubMedView ArticleGoogle Scholar
- Drinkwater NR, Klotz JH. Statistical methods for the analysis of tumor multiplicity data. Cancer Res. 1981; 41:113–9.PubMedGoogle Scholar
- Chen H, Lumley T, Brody J, Heard-Costa NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genet Epidemiol. 2014; 38:191–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952; 47:583–621.View ArticleGoogle Scholar
- Jonckheere A. A distribution-free k-sample test against ordered alternatives. Biometrika. 1954; 41:133–45.View ArticleGoogle Scholar
- Terpstra TJ. The asymptotic normality and consistency of Kendalls test against trend, when ties are present in one ranking. Indagationes Mathematicae. 1952; 14:327–33.View ArticleGoogle Scholar
- Zhang W, Li Q. Nonparametric risk and nonparametric odds in quantitative genetic association studies. Sci Rep-UK. 2015; 5:12105.View ArticleGoogle Scholar
- Lin Y, Zhang M, Wang L, Pungpapong V, Fleet JC, Zhang D. Simultaneous genome-wide association studies of anti-cyclic citrullinated peptide in rheumatoid arthritis using penalized orthogonal-components regression. BMC Proc. 2009; 3. Suppl 7:S20.Google Scholar
- Black MH, Watanabe RM. A principal-components-based clustering method to identify multiple variants associated with rheumatoid arthritis and arthritis-related autoantibodies. BMC Proc. 2009; 3. Suppl 7:S129.Google Scholar
- Amos CI, Chen WV, Seldin MF, Remmers EF, Taylor KE, Criswell LA, et al. Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data. BMC Proc. 2009; 3. Suppl 7:S2.Google Scholar
- Zheng G, Wu CO, Kwak M, Jiang W, Joo J, Lima JAC. Joint analysis of binary and quantitative traits with data sharing and outcome-dependent sampling. Genet Epidemiol. 2012; 36:263–73.PubMedView ArticleGoogle Scholar
- Li Q, Yu K. Improved correction for population stratification in genome-wide association studies by identifying hidden population structures. Genet Epidemiol. 2008; 32(3):215–26.PubMedView ArticleGoogle Scholar
- Burton PR, Clayton DG, Cardon LR, Craddock N, Deloukas P, Duncanson A, et al. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 2007; 447:661–78.View ArticleGoogle Scholar