Here, in the first few sub-sections, we will describe ProbABEL software, giving only the main outline of the underlying theory and with special emphasis on implementation and the options allowing to access specific analyzes within ProbABEL. In two last sub-sections, starting with the "Fixed effects model theory", we will give more in-depth review of the theory used by the package.
ProbABEL was implemented using code written in the C and C++ languages. The package consists of three executable files, used to perform linear, logistic, and Cox regressions, and a helper Perl script which facilitates the analysis of multiple chromosomes.
The package implements standard regression analysis methodology outlined in the section "Fixed effects model theory" and specific approximation to the mixed linear model described in the section "Two-step score test approximation to the mixed model". The key statistical tests performed by ProbABEL concern testing of the SNP effects. Here, we will describe the tests performed by ProbABEL using an example of linear regression; testing using other types of regression follows similar logic.
In linear regression, the expectation of the trait is described as
where Y is the vector of phenotypic values, X
g
is the design matrix containing data about predictors of interest (these involving SNP data), and X
x
is the design matrix containing other (nuisance) covariates. β
g
and β
x
are the vectors of corresponding fixed effects. The vector of phenotypes Y and the covariates matrix X
x
are provided in the phenotype file. The genotypic data are read from the genotype (dose or probability) files and are analyzed one SNP at a time.
Our interest lies in testing the (components of) β
g
. ProbABEL provides the estimates of the components of the vector β
g
and corresponding standard errors, and, in most cases, the test of the general hypothesis concerning the involvment of the SNP, obtained by comparison of the estimated model to the null model formulated as βg,0= 0, where 0 is the vector of zeros.
Under the general genotypic model, X
g
is a matrix with the number of rows equal to the number of people under consideration and with two columns. Each row of the matrix contains the estimated probabilities that a person has genotype "AA" or "AB". Then, the vector of genotypic effects is described with two parameters: β
g
= (β
AA
, β
AB
). Thus formulated, the model allows for the estimation of a general genotypic two-degree of freedom model. Further, a number of sub-models can be formulated by setting restrictions on these parameters. The "dominant B allele" model is formalized as β
AB
= 0, "dominant A" (the same as "recessive B") as β
AA
= β
AB
, the additive model as 2 · β
AB
= β
AA
, and the over-dominant model as β
AA
= 0. Note that the additive model is equivalent to performing linear regression on the estimated dose of allele "A" defined as P
AB
+ 2 · P
AA
. The latter model is tested when the allelic dosage file is provided as the input for ProbABEL, while the full range of described models is tested if the estimated probability files (option "--ngpreds = 2") are supplied.
ProbABEL can also test for interaction between a specified covariate and the set of SNPs; for that alternative, the interaction covariate should be specified using the "--interaction N" option, where N corresponds to the number of the column of the design matrix X
x
, which contains that covariate. If this option is used, the expectation of the trait is defined as
where W is a diagonal matrix, whose diagonal elements are formed by substituting the interaction covariate to the matrix and β
gxe
is the vector of interaction regression coefficients.
Analysis of population-based data
If the study subjects can be assumed to be genetically "independent", in the sense that they come from the general outbred population without a marked degree of stratification and that cryptic relatedness is absent, the data can be effectively analyzed using standard linear fixed effects regression methodology, as described in section "Fixed effects model theory". The (small) effects of confounding can be corrected posterior to analysis using the genomic control [10] procedure. If a marked degree of stratification is present, such methods as structured association analysis and EIGENSTRAT [11] can be combined with the standard methods.
Using standard methods, the estimates of the parameters can be obtained using the standard formula 1 (see "Fixed effects model theory" below), which provides maximum likelihood estimates if (XTX)-1 exists. The latter condition is fulfilled for virtually all analyses; practically, exceptions may occur for SNPs with very low minor allele frequencies or poor quality imputations.
The standard errors are computed as square roots of the diagonal elements of the parameter estimates' variance-covariance matrix. This matrix is computed using one of three different methods: the standard method, with residual variance estimated under the alternative (formula 2, see "Fixed effects model theory" below) or null hypothesis concerning SNPs (option "--score"), or using a "sandwich" estimator (formula 5, see "Fixed effects model theory"), resulting in robust standard errors (option "--robust"). The value of the global likelihood ratio test statistic, testing the joint significance of all terms involving SNP, is computed using the formula 3 (see "Fixed effects model theory"). In this test, the null model is formulated as βg,0= 0, where 0 is the vector of zeros. If an interaction term is present, that is also set to zero under the null: βgxe,0= 0. The likelihoods involved are computed using the formula 4 (see "Fixed effects model theory") with the values of the parameters fixed at the point of the maximum likelihood estimate obtained with 1 (see "Fixed effects model theory").
Analysis of data on subjects with differential relationships
In the case of a study involving subjects with markedly differential relationships (family-based designs, studies of human genetically isolated populations, studies in outbred animal populations), a mixed model approach may be used, in which a random effect ("heritability") accounts for similarities between the phenotypes of study subjects [12]. However, the estimation of the full mixed model using either maximum likelihood or the restricted maximum likelihood approach is computationally demanding, if not unfeasible, within the framework of GWAS [13], and therefore a two-step mixed model-based approach [13–15] is utilized in ProbABEL.
In this approach, the mixed model containing all terms but those involving SNP is first estimated by maximizing the likelihood function provided by the expression 7 (see section "Two-step score test approximation to the mixed model" for details). These estimates are then used in the second step to compute estimates of the SNP effects (formula 8 of "Two-step score test approximation to the mixed model") and the variance-covariance matrix of these estimates (formula 10, see "Two-step score test approximation to the mixed model"). These values can be used to perform a score test for association. The second step of a mixed-model based score test for association is available in ProbABEL using option "--mmscore IVFile", where IVFile is the name of a file containing the inverse of the variance-covariance matrix (
of formulas 8 and 10, see "Two-step score test approximation to the mixed model") evaluated at the point of the maximum likelihood estimates obtained in step one. The phenotypes analyzed in the second step are residuals (as specified by the formula 9, see "Two-step score test approximation to the mixed model") obtained by subtracting the trait values expected under the mixed model-based estimates of the fixed effects from the original trait values.
Step one of the regression procedure can be performed using our GenABEL software [3]. This software performs genomic data based estimation of the kinship matrix as described in section "Estimation of genomic kinship matrix" using the ibs (...,weight="freq") function, and performs maximum likelihood estimation of the step-one mixed model using the polygenic() function. The resulting object contains the inverse variance-covariance matrix (object$InvSigma), which can be saved as a text file and used in ProbABEL analysis. The residuals to be used as trait values in step two of the analysis can be accessed through object$residualY.
Input and output
The input consists of a phenotypic data file and a set of files describing the imputed genotypic data. The phenotypic file provides data on the outcome of interest and any additional covariates to be included in the analysis. The genotypic data files, at present, utilize the MACH imputation software output format. Minimally, a file with estimated probability distributions ("mlprob") or allelic dosages ("mldose") and the "mlinfo" file containing information about allele coding and overall imputation quality should be provided. Optionally, a map file in HapMap format, containing chromosome and location information, may be supplied. Information contained in the latter two files is not used in analysis, but is forwarded directly to the output. If the mixed-model based score test for association in related individuals is to be computed, a file containing the inverse matrix of variances and covariances between the phenotypes of study individuals should be supplied as a part of the input. The output of the program consists of one line for each SNP tested, containing information about the SNP supplied as part of the input, as well as the results from analysis (estimates of the coefficients of regression, standard errors of the coefficients, and test statistic values).
Fixed effects model theory
Most of the fixed effects model theory outlined here is standard and can be found in textbooks, such as "Generalized, Linear, and Mixed Models" [16]. Specific references are provided when this is not the case.
Linear regression assuming normal distribution
Standard linear regression theory is used to estimate coefficients of regression and their standard errors. We assume linear model with expectation
and variance-covariance matrix
where Y is the vector of phenotypes of interest, X is design matrix, β is the vector of regression parameters, σ2 is variance and I is identity matrix.
The maximum likelihood estimates (MLEs) for the regression parameters is given by
and MLE of the residual variance is
where N is the number of observations and r
X
is rank of X (number of columns of the design matrix).
The variance-covariance matrix for the parameter estimates under alternative hypothesis can be computed as
For the j-the element
(j) of the vector of estimates the standard error under alternative hypothesis is given by the square root of the corresponding diagonal element of the above matrix, var
(jj), and the Wald test can be computed with
which asymptotically follows the χ2 distribution with one degree of freedom under the null hypothesis. When testing significance for more than one parameter simultaneously, several alternatives are available. Let us partition the vector of parameters into two components, β = (β
g
, β
x
), and our interest is testing the parameters contained in β
g
(SNP effects), while β
x
(e.g. effects of sex, age, etc.) are considered nuisance parameters. Let us define the vector of the parameters of interest which are fixed to certain values under the null hypothesis as βg,0(usually, βg,0= 0, vector of zeros).
The likelihood ratio test can be obtained with
which under the null hypothesis is asymptotically distributed as χ2with number of degrees of freedom equal to the number of parameters specified by β
g
. Assuming the normal distribution, the log-likelihood of a model specified by the vector of parameters β and residual variance σ2 can be computed as
Secondly, the Wald test can be used; for that the inverse variance-covariance matrix of
g
should be computed as
where
correspond to sub-matrices of the inverse of the variance-covariance matrix of
, involving either only covariances between the parameters of interest (g, g), only the nuisance parameters (x, x) or between the parameters of interest and nuisance parameters, (x, g), (g, x).
The Wald test statistics is then computed as
which asymptotically follows the χ2 distribution with the number of degrees of freedom equal to the number of parameters specified by β
g
. The Wald test generally is computationally easier than the LRT, because it avoids estimation of the model specified by the parameter's vector (βg,0,
x
).
Lastly, similar to the Wald test, the score test can be performed by use of
instead of var
.
Logistic regression
For logistic regression, the procedure to obtain parameters estimates, their variance-covariance matrix, and tests are similar to these outlined above with several modifications.
The expectation of the binary trait is defined as expected probability of the event as defined by the logistic function
The estimates of the parameters are obtained not in one step, as is the case of the linear model, but using iterative procedure (iteratively re-weighted least squares). This procedure is not described here for the sake of brevity.
The log-likelihood of the data is computed using binomial probability formula:
where log
e
π is a vector obtained by taking the natural logarithm of every value contained in the vector π.
Robust variance-covariance matrix of parameter estimates
For computations of robust variance-covariance matrix we use White's sandwich estimator [17, 18], which is equivalent to the "HC0" estimator described by Zeilers and Lumley in "sandwich" package for R.
For linear model, the variance-covariance matrix of parameter estimates is computed using formula
where R is a diagonal matrix containing squares of residuals of Y. The same formula may be used for "standard" analysis, in which case the elements of the R matrix are constant, namely mean residual sum of squares (the estimate of residual variance,
).
Similar to that, the robust matrix is computed for logistic regression with
where W is the diagonal matrix of "weights" used in logistic regression.
Cox proportional hazards model
The implementation of the Cox proportional hazard model used in ProbABEL is entirely based on the code of R library survival developed by Thomas Lumley (function coxfit2), and is therefore not described here.
Two-step score test approximation to the mixed model
The framework for analysis of data containing differential relationships follows the two-step logic developed in the works of Aulchenko et al. [13] and Chen and Abecasis [14]. General analysis model is a linear mixed model which defines the expectation of the trait as
identical to that defined for linear model. To account for possible correlations between the phenotypes of study subjects the variance-covariance matrix is defined to be proportional to the linear combination of the identity matrix I and the relationship matrix Φ:
where h2 is the heritability of the trait. The relationship matrix Φ is twice the matrix containing the coefficients of kinship between all pairs of individuals under consideration; its estimation is discussed in a separate section "Estimation of genomic kinship matrix".
Estimation of thus defined model is possible by numerical maximization of the likelihood function, however, the estimation of such model for large data sets is not computationally feasible for hundreds of thousands to millions of SNPs tested in the context of GWAS, as we have demonstrated previously [13].
Two-step score test for association
A two-step score test approach is therefore used to decrease the computational burden. Let us re-write the expectation of the trait by splitting the design matrix in two parts, the "base" part X
x
, which includes all terms not changing across all SNP models fit in GWAS (e.g. effects of sex, age, etc.), and the part including SNP information, X
g
:
Note that the latter design matrix may include not only the main SNP effect, but e.g. SNP by environment interaction terms.
At the first step, linear mixed model not including SNP effects
is fitted. The maximum likelihood estimates (MLEs) of the model parameters (regression coefficients for the fixed effects
x
, the residual variance
and the heritability
) can be obtained by numerical maximization of the likelihood function
where
is the inverse and
is the determinant of the variance-covariance matrix.
At the second step, the estimates of the fixed effects of the terms involving SNP are obtained with
where
is the variance-covariance matrix at the point of the MLE estimates of
and
and
is the vector of residuals obtained from the base regression model. Under the null model, the inverse variance-covariance matrix of the parameter's estimates is defined as
Thus the score test for joint significance of the terms involving SNP can be obtained with
where βg,0are the values of parameters fixed under the null model. This test statistics under the null hypothesis asymptotically follows the χ2 distribution with the number of degrees of freedom equal to the number of parameters tested. The significance of an individual j-the elements of the vector
g
can be tested with
where
is square of the j-th element of the vector of estimates
g
, and
corresponds to the j-th diagonal element of
. This statistics asymptotically follows
.
Estimation of genomic kinship matrix
The relationship matrix Φ used in estimation of the linear mixed model is twice the matrix containing the coefficients of kinship between all pairs of individuals under consideration. This coefficient is defined as the probability that two gametes randomly sampled from each member of the pair are identical-by-descent (IBD), that is they are copies of exactly the same ancestral allele. The expectation of kinship can be estimated from pedigree data using standard methods, for example the kinship for two outbred sibs is 1/4, for grandchild-grandparent is 1/8, etc. However, in many situations, pedigree information may be absent, incomplete, or not reliable. Moreover, the estimates obtained using pedigree data reflect the expectation of kinship, while the true realization of kinship may vary around this expectation. In presence of genomic data it may therefore be desirable to estimate the kinship coefficient from these, and not from pedigree. It can be demonstrated that unbiased and positive semi-definite estimator of the kinship matrix [19] can be obtained by computing the kinship coefficients between individuals i and j with
where L is the number of loci, p
l
is the allelic frequency at l-th locus and g
l, j
is the genotype of j-th person at the l-th locus, coded as 0, 1/2, and 1, corresponding to the homozygous, heterozygous, and other type of homozygous genotype [11, 15, 19]. The frequency is computed for the allele which, when homozygous, corresponds to the genotype coded as "1'.