 Methodology Article
 Open Access
 Published:
An efficient genomewide association test for multivariate phenotypes based on the Fisher combination function
BMC Bioinformatics volume 17, Article number: 19 (2016)
Abstract
Background
In genomewide association studies (GWAS) for complex diseases, the association between a SNP and each phenotype is usually weak. Combining multiple related phenotypic traits can increase the power of gene search and thus is a practically important area that requires methodology work. This study provides a comprehensive review of existing methods for conducting GWAS on complex diseases with multiple phenotypes including the multivariate analysis of variance (MANOVA), the principal component analysis (PCA), the generalizing estimating equations (GEE), the traitbased association test involving the extended Simes procedure (TATES), and the classical Fisher combination test. We propose a new method that relaxes the unrealistic independence assumption of the classical Fisher combination test and is computationally efficient. To demonstrate applications of the proposed method, we also present the results of statistical analysis on the Study of Addiction: Genetics and Environment (SAGE) data.
Results
Our simulation study shows that the proposed method has higher power than existing methods while controlling for the type I error rate. The GEE and the classical Fisher combination test, on the other hand, do not control the type I error rate and thus are not recommended. In general, the power of the competing methods decreases as the correlation between phenotypes increases. All the methods tend to have lower power when the multivariate phenotypes come from long tailed distributions. The real data analysis also demonstrates that the proposed method allows us to compare the marginal results with the multivariate results and specify which SNPs are specific to a particular phenotype or contribute to the common construct.
Conclusions
The proposed method outperforms existing methods in most settings and also has great applications in GWAS on complex diseases with multiple phenotypes such as the substance abuse disorders.
Background
In the past decade, genomewide association studies (GWAS) have produced rich singlenucleotide polymorphism (SNP) data available to researchers. Among them, the large scale studies including the HapMap project [1] and the 1000 Genomes project [2] have provided publicly accessible databases of reference ancestral populations for imputation and quality control purposes. The idea of GWAS is to conduct fast SNPbased association tests to scan the whole genome using casecontrol samples. Yet, many complex diseases such as mental health disorders may have multiple phenotypic traits with continuous outcomes [3]. This pleiotropy in complex traits [4] provides several potential advantages to the direct modeling of pleiotropic associations. First, a model search for loci that are simultaneously associated with multiple phenotypes would likely have higher power than a model search that only considers each phenotype individually. Second, more exact modeling may yield more accurate prediction of either or both phenotypes. Third, pleiotropic genes may tend to have a more central role in the relevant functional pathways.
Existing statistical methods for complex diseases with multivariate phenotypes can be categorized into three types of approaches. The first approach is to conduct a GWAS for each marginal phenotype and then aggregate the results. The major issue with this approach is that it does not make use of the correlation structure among phenotypes. The second approach is to summarize multiple phenotypic traits into a composite score and then conduct a GWAS on the score. This approach, however, may have difficulty in identifying proper summary scores. The third approach involves multiple phenotypic traits simultaneously. Thus, it may gain power as well as avoid the issue of multiple testing. However, it is based on stronger assumptions that may not be satisfied in some practical settings.
In this study, we provide a comprehensive review of existing statistical methods for conducting GWAS on complex diseases with multiple phenotypic traits. We also propose a new statistical method based on the Fisher combination function. The performance of competing methods is evaluated by a simulation study. In order to demonstrate applications of the proposed method, we conduct statistical analysis on the database of the Study of Addiction: Genetics and Environment (SAGE).
Methods
Let X _{ i }(=0,1,2) be the number of reference alleles corresponding to a candidate SNP and Y _{ i }=(Y _{ i1},…,Y _{ im })^{′} be the measures of multiple phenotypes for the individual i. In this study, we conduct a comprehensive review of existing statistical methods that can be used to test the association between X _{ i } and Y _{ i }.
Existing methods
Multivariate analysis of variance (MANOVA)
When the phenotype is univariate, we can use the oneway analysis of variance (ANOVA) with three levels of the genotype for GWAS. When we have correlated multivariate phenotypic traits, the natural extension of the oneway ANOVA is the oneway multivariate analysis of variance (MANOVA) [5]. Similar to ANOVA, MANOVA tests the equality of mean phenotypic vectors by comparing the within genotypes and between genotypes variancecovariance matrices. The strength of MANOVA is that the multivariate normal distribution provides many good statistical properties for testing and estimation [6]. However, in practice, multivariate phenotype data are very unlikely to meet the multivariate normal assumption. Furthermore, MANOVA is most powerful when the phenotypes are negatively correlated and yet this situation is also unlikely in practice, especially when the number of phenotypes is larger than 2. With respect to its relevant applications, this method has been used in GWAS on doseresponse [7] and facial morphology [8].
Principal component analysis (PCA)
The principal component analysis (PCA) [9, 10] is another classical statistical method for multivariate analysis. The primary objective of PCA is to find a small set of linear combinations of the original variables (i.e. principal components) that account for the most variability in the original variables. Thus, it can be employed to reduce the dimension of multivariate phenotypes. The PCA has been used in genebased studies to increase the power of statistical testing [11, 12]. Furthermore, He et al. [13] has used PCA to combine four highly correlated obesity phenotypes for a whole genome linkage scan. When the phenotypes are highly correlated, the first principal component (corresponding to the largest eigenvalue) contains most information about the phenotype data. Thus, testing the association between a SNP and the first principal component is a commonly adopted approach to effectively change the multivariate setting associated with multiple phenotypes in GWAS to the univariate setting (e.g. Zhang et al. [14] and Karasik et al. [15]). In this study, we investigate the statistical properties of this approach.
Generalized estimating equations (GEE)
The method of generalized estimating equations (GEE) [16] was developed for analyzing correlated multivariate outcomes primarily from longitudinal studies. It can be applied to test the association between a candidate SNP and multivariate phenotypes. The GEE only requires specification of the link function and the working correlation matrix. The former depends on the measurement scale of the outcomes (e.g. the identify link for continuous outcomes). The latter assumes the correlation structure among multivariate outcomes. The estimation of GEE is usually robust against this assumption. GEE was widely used in GWAS. For example, GEE was proposed as one of the multivariate approaches in Solovieff et al. [4]. For another example, Liu et al. [17] proposed to use GEE for bivariate association analyses for the mixture of continuous and binary traits. However, to the best of our knowledge, none of the existing studies have conducted simulations to investigate whether GEE can control the type I error rate when multivariate traits are involved in GWAS. To fill in this knowledge gap, we conduct a simulation study to examine the statistical properties of this approach. In this study, we only consider the identity link because we are mainly interested in continuous phenotypic traits. We also assume the working correlation matrix to be compound symmetry because it only requires us to estimate one additional parameter.
Traitbased association test involving the extended Simes procedure
Recently, van der Sluis et al. [18] developed a traitbased association test involving the extended Simes procedure (TATES). The TATES calculates a global pvalue based on individual pvalues of association tests for marginal phenotypes. Specifically, for mvariate phenotypic traits, one can conduct m tests of the association between a candidate SNP and each marginal phenotypic trait and derive m pvalues: p _{1},…p _{ m }. Let p _{(1)},…p _{(m)} be the ordered pvalues from the smallest to the largest. The Simes multiple procedure declares significance between a SNP and multivariate phenotypic traits at the α level if any of the pvalues satisfy p _{(j)}<j α/m [19]. Hence, the global pvalue based on the Simes procedure is p _{traits}= min{m p _{(j)}/j,j=1,…,m}. The TATES improves this procedure by replacing m and j with the effective number of independent traits, m _{ e } and j _{ e }, which are estimated from the eigenvalues of the correlation matrix [20, 21]. Since m _{ e }≤m, this new adjusted global pvalue, defined as p _{traits}= min{m _{ e } p _{(j)}/j _{ e },j=1,…,m} is smaller than the Simes global pvalue. Therefore, the TATES is more powerful than the Simes. A simulation study also showed that the TATES is more powerful than MANOVA. In this study, we conduct a comprehensive simulation study to compare this method with not only the classical methods reviewed above but also the proposed methods.
Proposed methods
The methods based on the Fisher combination function
Combining independent tests of significance to form a join statistic has been used as an alternative approach to tackling complex multivariate location problems [22]. This approach is quite popular in practice because it is much easier to develop a univariate association test statistic than a multivariate association test statistic. Birnbaum [23] discussed various combination functions among which the Fisher combination function has been proven to be asymptotically Bahadur optimal [24, 25]. Thus, we focus on the Fisher combination function in this study.
Fisher combination test with the independence assumption
Based on the Fisher combination test [22], to test the association between a SNP and multivariate phenotypic traits, we only need to test the association between the SNP and each marginal phenotypic trait individually. Thus, for mmultivariate phenotypes, we have m marginal pvalues: p _{1},…,p _{ m }. The Fisher combination statistic is defined as
T is used to infer the association between the SNP and multivariate phenotypic traits. When the marginal pvalues are independent, the statistic T follows a chisquared distribution with 2m degrees of freedom so the pvalue of T can be obtained straightforwardly. In reality, however, the phenotypic traits are always correlated so the chisquared distribution with 2m degrees of freedom tends to underestimate the variance of the T statistic. The resulting chisquared test is, therefore, too liberal.
The permutation method
Because of the negative consequence of the independence assumption, it is desirable to conduct the Fisher combination test without the assumption. Ideally, we could calculate the exact pvalue of the T statistic in Eq. (1) using the permutation method, which does not require the unrealistic assumption and also controls for the type I error [26]. Yet, the permutation method is a very timeconsuming procedure, particularly in a genomewide context. Thus, an improvement in computational efficiency is warranted.
The proposed efficient method
For correlated phenotypic traits, T is the sum of dependent chisquared statistics. Brown [27] and Yang [28] have shown that, under the null hypothesis of no association between a SNP and multivariate phenotypic traits, the distribution of T statistic follows a scale chisquared distribution \((\gamma {\chi _{v}^{2}})\), or equivalently, a gamma distribution with the shape parameter v/2 and the scale parameter 2γ. Therefore, to calculate the global pvalue of T statistic, we only need to estimate the parameters v and γ. Suppose that the mean of T is μ and the variance of T is σ ^{2}. Using the first and second moments of T, the values of v and γ can be calculated as v=2μ ^{2}/σ ^{2} and γ=σ ^{2}/(2μ). The following are technical details of the derivation of the mean and variance of T statistic when the marginal pvalues are based on twosided tests (see Brown [27] and Yang [28] for the case of onesided marginal tests):
Without loss of generality, we assume that the association test statistic for the jth phenotypic trait is z _{ j } where j=1,…,m. The corresponding twosided pvalue is defined as p _{ j }=2Φ(−z _{ j }), where Φ is the standard Gaussian distribution function. Under the null hypothesis of no association between a SNP and multivariate phenotypic traits, the distribution of T statistic is approximated by a Gaussian distribution with the mean of T as
and the variance of T as
Therefore, in order to calculate the variance of T, we need to calculate the covariance for each pair (j,k) which can be expressed as
where F is the standard bivariate Gaussian distribution. Let
Thus, δ _{ jk } is a function of the correlation between z _{ j } and z _{ k }: ρ _{ j,k }. We explore the relationship between δ _{ jk } and ρ _{ j,k } by calculating δ _{ jk } numerically for the values of ρ _{ j,k } from 0.99 to 0.99 with the step of 0.01. The results are shown in Fig. 1. Since the curve of Fig. 1 is a convex curve symmetric about the yaxis, we can approximate the relationship between δ _{ jk } and ρ _{ j,k } using a tenthorder polynomial:
Using the adapt function in the R package fCopulae [29], we obtained the following estimates: c _{1}=3.9081,c _{2}=0.0313,c _{3}=0.1022,c _{4}=−0.1378 and c _{5}=0.0941; the maximum residual was less than 0.0001.
To estimate δ _{ jk } in Eq. (2) accurately, we have taken two steps to remove potential biases. First, since the sample correlation \(\hat {\rho }_{\text {\textit {j,k}}}\) is not an unbiased estimator of ρ _{ j,k } [30], we estimate ρ _{ j,k } by the biascorrected sample correlation \(\hat {r}_{\text {\textit {j,k}}}\):
where n is the samples size used to calculate \(\hat {\rho }_{\text {\textit {j,k}}}\). Now, let’s define the right hand side of Eq. (2) as
The estimate \(f(\hat {r}_{\text {\textit {j,k}}})\) is still a biased estimator of δ _{ jk }. Thus, we propose a second step to remove the bias. Using the Taylor series expansion, we can estimate the bias as
Therefore, the proposed unbiased estimator of δ _{ jk } is
Hence, based on Eqs. (3) and (4), the variance of T can be estimated as
Given the proposed estimators of μ and σ ^{2}, the global pvalue of T statistic can be computed efficiently using the gamma distribution function as follows:
where Γ(v/2,2γ) is the gamma distribution function with the shape parameter v/2 and the scale parameter 2γ.
In this study, we compare two alternative methods to calculate \(\hat {\rho }_{\text {\textit {j,k}}}\): the Pearson sample correlation coefficient and the rank correlation coefficient of Kendall’s τ. Kendall and Gibbons [31] have shown the relation between ρ and τ as
Thus, we can use Kendall’s τ to derive \(\hat {\rho }_{\text {\textit {j,k}}}\) as \(\sin \left (\frac {\pi \hat {\tau }_{\text {\textit {j,k}}}}{2}\right)\). In the simulation study, we evaluate the robustness of Kendall’s τ in comparison to Pearson’s sample correlation coefficient.
Results
Simulation study
We conducted a simulation study to evaluate the performance of the four methods reviewed (MANOVA, PCA, GEE, and TATES) as well as the methods based on the Fisher combination function. In the simulation, we adopted four different approaches to calculating the pvalue of the Fisher combination test:

1.
FC\(\chi _{2m}^{2}\): the chisquared distribution with 2m degrees of freedom under the independence assumption.

2.
FCPermutation: the permutation method based on 1000 permutes.

3.
FCPearson: the proposed method with the correlation \(\hat {\rho }_{\text {\textit {j,k}}}\) being estimated by the Pearson’s sample correlation coefficient.

4.
FCKendall: the proposed method with \(\hat {\rho }_{\text {\textit {j,k}}}\) being estimated by the Kendall’s τ.
Simulation configurations
For each subject, the relationship between the SNP, x, and the multivariate phenotypes, Y, was defined as
where β is the effect of the SNP on the phenotypes and ε is the error term. In this simulation study, we simulated x based on a minor allele frequency of 0.25. We evaluated the performance of competing methods under three different settings of effect sizes: Null hypothesis (no effects) β=(0,0,0,0,0)^{′}. Moderate equal effect sizes β=(0.3,0.3,0.3,0.3,0.3)^{′}. Varied effect sizes β=(0.1,0.2,0.3,0.4,0.5)^{′}.
For the error term, we considered two cases. In the first case, the error term was simulated from a multivariate normal distribution with the mean 0 and the variancecovariance Σ, which is a compound symmetry matrix with the value of 1 on the diagonal and the value of ϱ=0,0.25,0.5, or 0.75 on the offdiagonal. In the second case, the error term was simulated from a mixture of two multivariate normal distributions: 90 % from the same multivariate normal distribution in the first case and 10 % from the multivariate normal distribution with the mean 0 and the variancecovariance matrix 5Σ. The purpose of the second case was to simulate long tailed distributions of phenotypic traits which are common in real data. We generated simulated data of 100 subjects under each configuration. In addition, each configuration was repeated 10,000 times. The nominal type I error rate was set at 0.05 and the power was calculated as the proportion of pvalues less than 0.05.
Simulation results
Table 1 presents the simulation results when the multivariate phenotypes come from a multivariate normal distribution with the value of the correlation ϱ varied from 0 to 0.75. The numbers in each cell are the mean (standard deviation) of the indicator variable for pvalue <0.05 among the 10,000 replications. The top panel corresponds to the case of β=(0,0,0,0,0)^{′} (i.e. when the null hypothesis is true) and thus can be used to evaluate if each of the competing methods was able to control the type I error. The results indicate that GEE and the Fisher combination test with \(\chi _{2m}^{2}\) did not control the type I error rate when ϱ>0 while all the other methods did quite well under all values of ϱ. We, thus, did not find it meaningful to further compare these two methods with the other methods in terms of the statistical power.
The middle panel of Table 1 compares the power of competing methods under the situation that the SNP has the same level of association with each of the phenotypic traits: β=(0.3,0.3,0.3,0.3,0.3)^{′}. The power of MANOVA decreased rapidly as the correlation ϱ increased. When ϱ=0.75, for instance, all the other methods had the power of at least 0.4 but the power of MANOVA was only 0.2. Further, PCA and TATES had higher power than MANOVA when ϱ>0. Yet, none of these three methods can beat the three Fisher combination tests (FCPermutation, FCPearson, and FCKendall) that performed equally well.
The bottom panel of Table 1 evaluates the performance of competing methods under the situation that the strength of association between the SNP and phenotypic traits varies from 0.1 to 0.5. Similar to the previous situation, the three Fisher combination tests had almost identical performance and their power decreased as the correlation ϱ increased. Furthermore, the Fisher combination tests had higher power than the other three methods (MANOVA, PCA, and TATES) in most conditions. MANOVA only beat the Fisher when ϱ=0.75; TATES had higher power than the Fisher when ϱ≥0.5.
Table 2 shows the simulation results when the multivariate phenotypes come from a mixture of two multivariate normal distributions. In comparison to the corresponding settings under the multivariate normal distributions in Table 1, all the competing methods tended to have lower power under these long tailed distributions. Yet, Table 2 demonstrates similar patterns to the ones observed in Table 1 in general.
Real data analysis
The Study of Addiction: Genetics and Environment(SAGE)
The National Center for Biotechnology Information (NCBI) has been managing and distributing the large database of Genotypes and Phenotypes (dbGaP) for scientific investigation of various human diseases [32]. In order to demonstrate the application of the proposed method, we conducted statistical analysis on the Study of Addiction: Genetics and Environment (SAGE) data [33], http://www.ncbi.nlm.nih.gov/projects/gap/cgibin/study.cgi?study_id=phs000092.v1.p1. The institutional review board of the University of Michigan has approved this secondary data analysis project (HUM00084927). The SAGE is a casecontrol study that aggregated together the data from three large scale studies in the substance abuse field: the Collaborative Study on the Genetics of Alcoholism (COGA), the Family Study of Cocaine Dependence (FSCD), and the Collaborative Genetic Study of Nicotine Dependence (COGEND). The total number of individuals with individual level data available is 4121. Each individual was genotyped using the Illumina Human 1MDuo beadchip which contains over 1 million SNP markers.
We selected unrelated individuals that passed the quality control measures according to the Gene Environment Association Studies Initiative (GENEVA) quality control report. The final number of unrelated individuals is 3,741 (1,732 male, 2,079 female) and the total number of SNP markers is 917,694. Because the purpose of our analysis is to identify the genes that are associated with addiction, we used the symptomatology variables of four highly comorbid substance use disorders as the phenotype outcomes: the number of alcohol dependence symptoms endorsed (alc_sx_tot), the number of nicotine dependence symptoms endorsed (nic_sx_tot), the number of marijuana dependence symptoms endorsed (mj_sx_tot), and the number of cocaine dependence symptoms endorsed (coc_sx_tot).
SAGE data analysis results
The values of phenotype variables range from 0 to 7. Figure 2 shows the frequency distributions of the 4 phenotype variables. Since they are not normally distributed, we calculated their correlations using the Kendall rank correlation. Table 3 shows moderate correlations ranged from 0.34 to 0.51.
For each trait, we conducted a genomewide association test using the hurdle model [34] because of the discrete nature and excess zero values associated with the symptom counts. The hurdle regression model assumes the observed data are generated from two processes: one generates zero and the other generates positive values. Since our interest is in the severity of symptomatology, the pvalues from the positive component of the hurdle model were used for further analysis. For each phenotype of addiction, the estimated pvalues are summarized using QQplots in Fig. 3. The diagonal straight lines have the slope 1 and intercept 0. When the curve of the pvalues deviate far away from the diagonal line, it indicates that there are many SNPs significantly associated with the corresponding phenotype trait. By examining the 4 plots in Fig. 3, we obtain the following two findings: (1) For the nicotine symptoms, the pvalues fall on the diagonal line and this indicates that there is no SNP associated with nicotine symptoms; (2) For the symptoms of the other three substances, all the pvalues deviate from the diagonal lines (except for 0 and 1), with the pvalues of alcohol symptoms furthest away from the diagonal line and the pvalues of the remaining two symptoms closer to the diagonal line. Since the four symptomatology variables are moderately correlated and they all measure the common construct of addiction, we used them as the multivariate phenotype and applied the proposed Fisher combination approach (FCKendall) to identify the SNPs associated with it. The QQplot of the pvalues for this multivariate analysis is shown in Fig. 4 indicating that some SNPs are associated with addiction across substances.
To identify the SNPs associated with the phenotypes, we adopted the commonly used significance level for GWAS of 10^{−6} to account for multiplicity. Based on the results of the marginal tests, the numbers of SNPs identified to be associated with the individual phenotypes are 917 for alcohol dependence symptoms, 0 for nicotine symptoms, 9 for marijuana symptoms, and 0 for cocaine symptoms. Using the proposed Fisher combination method (FCKendall), on the other hand, we identified 6 SNPs associated with the multivariate phenotype of addiction. Among them, 5 SNPs were also identified by the marginal test for alcohol symptoms. This implies that if we ignore the correlations among the 4 phenotypic traits and conduct the marginal tests, we would identify many SNPs that may be specific to alcohol dependence. Thus, if our goal is to identify the genes associated with the construct of addiction that contributes to the 4 types of substance dependence symptomatology, the proposed method is a better approach.
Discussion
In GWAS for complex diseases, the association between a SNP and each phenotype is usually weak. Combining multiple related phenotypic traits can increase the power of gene search and thus is a practically important area that requires methodology work. This study provides a comprehensive review of existing methods for conducting GWAS on complex diseases with multiple phenotypes including MANOVA, PCA, GEE, TATES, and the classical Fisher combination test. Built upon the Fisher combination test, we proposed a new method that relaxes the unrealistic independence assumption and is also computationally efficient. Particularly, in an exploratory study where multiple sets of phenotypes may be of interest, when the set is changed, our proposed methods only require recalculation of the correlation between phenotypes and then the available marginal pvalues for each SNP can be reused. The competing methods which do not involve marginal pvalues such as the PCA, MANOVA, and GEE, on the other hand, would require a complete reanalysis.
We conducted a simulation study to compare the performance of the competing methods. The GEE and the Fisher combination test with the independence assumption did not control the type I error rate and thus are not recommended. In general, the power of the methods decreased as the correlation between phenotypes increased. Furthermore, all the competing methods tended to have lower power when the multivariate phenotypes come from long tailed distributions. The proposed method (with the correlation being estimated by the Pearson’s sample correlation coefficient or the Kendall’s τ) performed as well as the permutation method and yet only required 10^{−2} computational time. In most settings of the simulation, these three Fisher combination tests outperformed the other methods. The real data analysis also demonstrated that the Fisher combination tests allow us to compare the marginal results with the multivariate results and specify which SNPs are specific to a particular phenotype or contribute to the common construct.
In our simulation study, we only considered continuous multivariate phenotypes. Future studies may extend the methodology work to the case of correlated discrete phenotypes. For example, in the substance abuse field, many outcomes are zeroinflated count data [35] or ordinal data [36]. A future direction that is particularly challenging is how to analyze multivariate phenotypes with different measurement scales.
References
 1
The International HapMap Consortium. The international hapmap project. Nature. 2003; 426(6968):789–96.
 2
The 1000 Genomes Project Consortium. A map of human genome variation from populationscale sequencing. Nature. 2010; 467(7319):1061–73.
 3
McGue M, Zhang YW, Miller MB, Basu S, Vrieze S, Hicks B, et al. A genomewide association study of behavioral disinhibition. Behav Genet. 2013; 43(5):363–73.
 4
Solovieff N, Cotsapas C, Lee PH, Purcell SM, Smoller JW. Pleiotropy in complex traits: challenges and strategies. Nat Rev Genet. 2013; 14(7):483–95.
 5
Smith H, Gnanadesikan R, Hughes JB. Multivariate analysis of variance (manova). Biometrics. 1962; 18(1):22–41.
 6
Morrison DF. Multivariate Statistical Methods. McGrawHill series in probability and statistics. New York: McGrawHill; 1967.
 7
Brown CC, Havener TM, Medina MW, Krauss RM, McLeod HL, MotsingerReif AA. Multivariate methods and software for association mapping in doseresponse genomewide association studies. Biodata Mining. 2012; 5:21.
 8
Liu F, van der Lijn F, Schurmann C, Zhu G, Chakravarty MM, Hysi PG, et al. A genomewide association study identifies five loci influencing facial morphology in europeans. Plos Genetics. 2012; 8(9):1002932.
 9
Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933; 24:417–41.
 10
Jolliffe IT. Principal Component Analysis, 2nd edn. New York: Springer; 2002.
 11
Brown CC, Havener TM, Medina MW, Krauss RM, McLeod HL, MotsingerReif AA. Principal components regression: Multivariate, genebased tests in imaging genomics. Biomedical Imaging: From Nano to Macro, 2011 IEEE International Symposium on. 2011;:289–293.
 12
Wang K, Abbott D. A principal components regression approach to multilocus genetic association studies. Genet Epidemiol. 2008; 32(2):108–18.
 13
He LN, Liu YJ, Xiao P, Zhang L, Guo Y, Yang TL, et al. Genomewide linkage scan for combined obesity phenotypes using principal component analysis. Ann Hum Genet. 2008; 72:319–26.
 14
Zeger SL, Zeger SL, Zhang F, Guo X, Wu S, Han J, et al. GenomeWide Pathway Association Studies of Multiple Correlated Quantitative Phenotypes Using Principle Component Analyses. Plos One. 2012; 7(12):e53320.
 15
Zeger SL, Karasik D, Cheung CL, Zhou Y, Cupples LA, Kiel DP, et al. J Bone Miner Res. 2012; 27(2):319–30.
 16
Zeger SL, Liang KY, Albert PS. Models for longitudinal data  a generalized estimating equation approach. Biometrics. 1988; 44(4):1049–60.
 17
Liu J, Pei Y, Papasian CJ, Deng H. Bivariate Association Analyses for the Mixture of Continuous and Binary Traits With the Use of Extended Generalized Estimating Equations. Genet Epidemiol. 2009; 33(3):217–27.
 18
van der Sluis S, Posthuma D, Dolan CV. Tates: Efficient multivariate genotypephenotype analysis for genomewide association studies. PLoS Genet. 2013; 9(1):1003235.
 19
Simes RJ. An improved bonferroni procedure for multiple tests of significance. Biometrika. 1986; 73(3):751–4.
 20
Galwey NW. A new measure of the effective number of tests, a practical tool for comparing families of nonindependent significance tests. Genet Epidemiol. 2009; 33(7):559–68.
 21
Li MX, Yeung JMY, Cherny SS, Sham PC. Evaluating the effective numbers of independent tests and significant pvalue thresholds in commercial genotyping arrays and public imputation reference datasets. Hum Genet. 2012; 131(5):747–56.
 22
Fisher RA. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd; 1932.
 23
Birnbaum A. Combining independent tests of significance. J Am Stat Assoc. 1954; 49(267):559–74.
 24
Littell RC, Folks JL. Asymptotic optimality of fishers method of combining independent tests. J Am Stat Assoc. 1971; 66(336):802.
 25
Littell RC, Folks JL. Asymptotic optimality of fishers method of combining independent tests ii. J Am Stat Assoc. 1973; 68(341):193–4.
 26
Good P. Permutation, Parametric and Bootstrap Tests of Hypotheses. 3rd edn, Springer Series in Statistics. New York: Springer; 2005.
 27
Brown MB. Method for combining nonindependent, onesided tests of significance. Biometrics. 1975; 31(4):987–92.
 28
Yang JJ. Distribution of fisher’s combination statistic when the tests are dependent. J Stat Comput Simul. 2010; 80(1–2):1–12.
 29
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. R Foundation for Statistical Computing. http://www.Rproject.org.
 30
Olkin I, Pratt JW. Unbiased estimation of certain correlationcoefficients. Ann Math Stat. 1958; 29(1):201–11.
 31
Kendall M, Gibbons JD. Rank Correlation Methods, 5th edn. London: Oxford; 1990.
 32
Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. The ncbi dbgap database of genotypes and phenotypes. Nat Genet. 2007; 39(10):1181–6.
 33
Bierut LJ, Agrawal A, Bucholz KK, Doheny KF, Laurie C, Pugh E, et al. A genomewide association study of alcohol dependence. Proc Natl Acad Sci U S A. 2010; 107(11):5082–7.
 34
Mullahy J. Specification and Testing of Some Modified Count Data Models. J Econ. 1986; 33(3):341–65.
 35
Buu A, Li RZ, Tan XM, Zucker RA. Statistical models for longitudinal zeroinflated count data with applications to the substance abuse field. Stat Med. 2012; 31(29):4074–86.
 36
Hedeker D, Mermelstein RJ. Analysis of longitudinal substance use outcomes using ordinal randomeffects regression models. Addiction. 2000; 95:381–94.
Acknowledgement
This research was supported by National Institutes of Health (NIH) grants, R01 AI079139 & R01 HL118267 to L. K. Williams, and K01 AA016591 & R01 DA035183 to A. Buu. The funding agencies had no role in study design, analysis, interpretation of results, decision to publish, or preparation of the manuscript. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH. We also acknowledge the Study of Addiction: Genetics and Environment (SAGE), which is part of the Gene Environment Association Studies (GENEVA) initiative supported by the National Human Genome Research Institute (dbGaP study accession phs000092.v1.p1).
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JJY and JL conducted the literature review and simulation study; LWK and AB provided funding and obtained the approval of using the data; JJY, LKW, and AB performed real data analysis and interpreted the results; JJY and AB wrote the manuscript with input from the other authors. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Genomewide association study
 Fisher combination function
 Multivariate permutation
 Principal component analysis