Genetic data collected in case-control or cohort studies of well-defined disease phenotypes can be used to identify genetic variants, typically single nucleotide polymorphisms (SNPs), associated with disease status. In a genetic association study, the data consist of genotypes and corresponding phenotypes from large numbers of individuals with the disease phenotype of interest (cases) and disease-free controls. A significant difference in the frequency of a genetic variant between the case and control groups is taken to be indicative of an association between that variant (or, more probably, a nearby causal variant correlated with the typed variant) and the phenotype of interest. A framework also exists for analysing a continuous phenotypes. Once one or more associated variant have been identified, they can be included in predictive models for the estimation of disease risk in individuals for whom the (potentially future) disease status is unknown. Technological developments, including commercially available chips for typing millions of genetic variants simultaneously, and next-generation sequencing technologies such as those used in the 1000 Genomes Project [1], are enabling the collection of large amounts of genetic data from large numbers of individuals which means that analysis of contemporary genetic data sets involves the study of high-dimensional data.

A number of statistical approaches have successfully been used to investigate the strength of association between genetic variants and a phenotype of interest. These methods include testing the significance of the association of each variant with the phenotype independently using standard univariate tests (such as the Cochran-Armitage test for trend [2] or Fisher's exact test). However, there are disadvantages to relying on univariate methods for the analysis of genetic data. The data from a genome-wide association study typically consists of thousands or millions of SNPs, and this large number of predictors introduces a multiple testing problem. Univariate methods therefore require stringent significance thresholds due to the large number of tests being undertaken to prevent a high false-positive rate [3]. Further, univariate methods fail to take into account the combined effect of multiple SNPs. It is plausible that many genetic variants together contribute to the phenotype being studied [4], and one advantage of using multivariate methods is that they allow for the study of the combined effect of multiple SNPs. Multivariate methods allow for the control of confounding variables, as in the lung cancer replication study in this paper, where gender, smoking status and age were included as unpenalised covariates in the model.

Using multivariate methods, the regression model can be extended to include, for example, interaction or higher order terms, and in such a case a penalised regression approach would be appropriate [5]. Penalised regression methods have been applied to genetic data [6, 7]. Among a number of regression approaches used for prediction in high-dimensional data, ridge regression has been shown to perform best in terms of prediction error [8]. Ridge regression has successfully been used to analyse genetic data where SNPs were in high LD [9]; it is the test of significance used by Malo, Libiger & Schork that we evaluate here.

To begin, we consider two regression models commonly used in the analysis of genetic data - the linear and the logistic regression models, as follows.

The standard linear regression model is given by

where **Y** is a (*n* × 1) vector of dependent variables, *Y*
_{
i
}, *i* = (1, ..., *n*) and **X** is a (*n* × *m*) matrix of predictors. β is a (*m* × 1) vector of regression coefficients *β*
_{
j
}, *j* = (1, ..., *m*) and ε is a (*n* × 1) vector of normally distributed random errors, with
. An example would be a model of the relationship between a continuous phenotype (such as blood pressure or plasma lipid concentration) measured in *n* individuals, and the genotype of these *n* individuals at *m* SNPs.

The ordinary least squares estimator for

β is given by

The significance of individual OLS regression coefficients

in a multiple regression model can be estimated using a Wald test. The test statistic is

where
is an estimate of the standard error of the *j*
^{th} regression coefficient. Under the null hypothesis
, *T*
_{0} follows a Student *t* distribution with *n* - *m* degrees of freedom.

Binary outcomes commonly arise in biomedical data where they may represent, for example, cases and controls. In the logistic regression model,

**Y** is an

*n*-dimensional vector of response variables taking values 0 (controls) or 1 (cases), and

**X** the

*n* ×

*m* matrix of explanatory variables, as before. For the

*i*
^{th} individual we denote

The

*i*
^{th} response

*Y*
_{
i
} is a Bernoulli variable with probability of success equal to

*p*
_{
i
}. The logistic regression model relates the probability

*p*
_{
i
} that the

*i*
^{th} observation is a case to the predictor variables as

where β is a vector of parameters to be estimated.

The significance of individual logistic regression coefficients,

, can be estimated using the test statistic

where
is an estimate of the standard error of the *j*
^{th} regression coefficient. Under the null hypothesis that *β*
_{
j
} = 0, *T*
_{0} asymptotically follows a standard normal distribution.

Genetic data often comprises more predictor variables, *m*, than observations, *n*. In such a situation, unique maximum likelihood estimates of regression parameters do not exist. Further, collinearity in the predictors, due to linkage disequilibrium (LD) in genetic data, which typically increases with the increasing density of available markers, results in unstable maximum likelihood estimates of regression coefficients.

An extensive literature exists on the application of modified regression techniques to the analysis of high-dimensional data. Penalised regression constrains the magnitude of the estimated regression coefficients, allowing their estimation when ordinary least squares (OLS) estimates cannot be obtained. In a Bayesian context, these techniques are equivalent to the specification of a particular prior distribution on the coefficients. For example, Lasso regression [10] constrains the sum of the absolute value of the regression coefficients to be less than a constant. This is equivalent to imposing a double exponential prior centred at zero on the coefficients. Lasso regression can estimate some coefficients to be exactly zero, permitting dimension reduction in the model. Hoggart, Whittaker, De Iorio & Balding [6] considered a modified Lasso regression approach for the identification of causal SNPs in genome-wide or resequenced data, with the aim of identifying regions of association whilst considering all SNPs simultaneously. When SNPs are in high LD, their method offers improvement over both single-SNP analysis and Lasso regression in terms of the power to detect causal variants, and a notable improvement over single-SNP analysis in terms of false-positive rate.

Ridge regression [11] is a another penalised regression approach, in which a penalty is applied to the sum of the squared parameter estimates. Ridge regression has been used in a number of large-scale data analysis scenarios, including marker-assisted selection [12], expression data analysis [13], and genetic association studies when SNPs are in high LD [14]. From a Bayesian viewpoint, ridge regression can be considered as standard multiple regression with the coefficients estimates having a prior distribution that is normal with mean zero and known variance [15]. In genetic epidemiology, it is desirable to estimate the strength of the association between a variant and a phenotype. This is problematic when using ridge regression which, unlike other penalised regression approaches, does not reduce the number of parameters in the model, nor estimate the significance of each fitted coefficient.

A test of significance for coefficients estimated using ridge regression, based on an approximation of their distribution under the null hypothesis, was proposed by Halawa & El Bassiouni [16]. The test was originally developed and evaluated for data with continuous outcomes, when different methods were used to compute the shrinkage parameter, *λ*, which controls the degree of shrinkage of the regression coefficients and hence their distribution under the null hypothesis. Malo, Libiger & Schork [9] used the same test in an evaluation of the applicability of ridge regression as a means of accommodating LD in association studies. They used the test in a comparison of the performance of ridge regression, multiple regression and single-SNP analysis when SNPs are in varying degrees of LD. They found that ridge regression identified different SNPs as associated with phenotype compared to single-SNP analysis or multiple regression. However, they did not consider the performance of the test itself.

In this paper, we evaluate the performance of a test of significance for ridge regression coefficients. Our test is based on the test proposed by Halawa & El Bassiouni [16]. We extend the test, making it applicable in the *m > n* scenario that is common in contemporary genetic data sets. We evaluate the performance of the test in simulation studies, using scenarios representative of realistic high-desnity genetic marker data, considering a range of data set dimensions and degrees of shrinkage.

Ridge regression has also been applied in the logistic regression framework [17]. We extend the test proposed by Halawa & El Bassiouni [16] to the logistic ridge regression model, and again evaluate the test in a range of simulation scenarios at different values of *λ*.

In both linear and logistic ridge regression, we compare the approximate test of significance to a permutation test. We view the permutation test as a benchmark as it produces an estimate of the null distribution of the parameter estimates. However, the permutation test is computationally intensive and becomes more so when data are high-dimensional. The test we propose makes it feasible to estimate significance with a much lower computational burden.

We introduce the *p*-value trace, a plot of the negative logarithm of the *p*-values of the ridge regression coefficients with increasing shrinkage parameter. This plot enables the visualisation of the relative change in significance of each coefficient, and facilitates the identification of predictors most affected by increased penalisation in terms of significance.

We apply the approximate test of significance for logistic ridge regression coefficients to a lung cancer data set, demonstrating the utility of the test when correlation exists among the predictors.

This paper is organised as follows. We first describe the approximate test of significance and the permutation test to which it is compared. We then describe the simulation studies used in this paper. In the Results section we evaluate the performance of the proposed test in a range of simulation scenarios. Further, we apply the test to a lung cancer case control data set. In the Discussion we discuss the results and potential applications for the test.