Identifying main effects and epistatic interactions from large-scale SNP data via adaptive group Lasso

Background Single nucleotide polymorphism (SNP) based association studies aim at identifying SNPs associated with phenotypes, for example, complex diseases. The associated SNPs may influence the disease risk individually (main effects) or behave jointly (epistatic interactions). For the analysis of high throughput data, the main difficulty is that the number of SNPs far exceeds the number of samples. This difficulty is amplified when identifying interactions. Results In this paper, we propose an Adaptive Group Lasso (AGL) model for large-scale association studies. Our model enables us to analyze SNPs and their interactions simultaneously. We achieve this by introducing a sparsity constraint in our model based on the fact that only a small fraction of SNPs is disease-associated. In order to reduce the number of false positive findings, we develop an adaptive reweighting scheme to enhance sparsity. In addition, our method treats SNPs and their interactions as factors, and identifies them in a grouped manner. Thus, it is flexible to analyze various disease models, especially for interaction detection. However, due to the intensive computation when millions of interaction terms needs to be searched in the model fitting, our method needs to combined with some filtering methods when applied to genome-wide data for detecting interactions. Conclusion By using a wide range of simulated datasets and a real dataset from WTCCC, we demonstrate the advantages of our method.


Lasso-type methods
In this section, we shall give a brief introduction to Lasso-type methods.
In the usual linear regression setup, we have a continuous response y ∈ R N , an N × p design matrix X and a parameter vector β ∈ R p+1 . The Lasso estimator [7] is defined as: x j β j ) where γ is a regularization parameter, x j is an N ×1 vector corresponding to the j-th column of the design matrix X and v 2 2 = n i=1 v 2 i for any vector v ∈ R n . The sparsity penalty p j=1 |β j | encourages many β j s to be zero.
When categorical predictors (i.e., factors represented by dummy variables) are present in linear regression, the Lasso solution is not satisfactory since it only selects dummy variables individually. To address this issue, Yuan and Lin [10] proposed to treat the dummy variables as a group, and then imposed a sparsity constraint at the group level.
For convenience, we shall use the following notations to describe the model with grouped variables: Suppose we have a response variable y ∈ R N and an N ×p design matrix X collecting N samples with p variables. These p variables are partitioned into J disjoint groups with the j-th group consisting of p j variables. Clearly, we have j p j = p. We shall use X j , a submatrix of size N × p j , to denote the columns of X corresponding to the j-th group. Similarly, X ij , a submatrix of size 1 × p j , corresponds to the i-th sample and the j-th group. We also use β j to denote the coefficient vector corresponding to the j-th group.
For the regression problem of J groups with j-th group consisting of p j variables, the Group Lasso estimator [10] is defined as: where β j = [β j,1 , β j,2 , · · · , β j,p j ] T for j = 1, · · · , J, and β 0 is the intercept in the linear model.
Term β j 2 = β 2 j,1 + β 2 j,2 + · · · + β 2 j,p j imposes a constraint to select a group of variables rather than a single variable. Notice that regularization term J j=1 √ p j β j 2 involves the 2-norm β j 2 of β j rather than the squared 2-norm β j 2 2 . The "Lasso" term in the name "Group Lasso" refers the sum of the absolute value of the 2-norm, i.e., In the GL model, the group structure is assumed to be known. This structure is explicitly made use of in regularization. Elastic net introduced in [12] is another type of regularization method to model grouping effects. The Elastic net estimator is defined as: Notice that the regularization term p j=1 β 2 j is the squared 2-norm ||β|| 2 2 of β. In Elastic model (4), no group structure is explicitly imposed. For two variables x i and x j with correlation ρ, however, their estimated coefficients β i and β j will tend to be shrunken to the same value as γ 2 and ρ increase. This is known as the grouping effects encouraged in Elastic net.
In this paper, we are aiming at identify main effects and interactions by analyzing SNP data. For main effects, Lasso and Elastic net can be used with a presumed model structure, e.g., an additive model. Group Lasso can be used without a presumed model structure. This issue has been carefully discussed in the main text of this paper.
Thus, a two-locus model can be represented by using 9 dummy variables coding 9 genotypes. Further the group structure of these 9 dummy variables arises naturally based on the two-locus model. Hence, Group Lasso model could serve as a basic model for identifying interactions. Based on this consideration, we propose our Adaptive Group Lasso model. Lasso and Elastic net could also be applied for identifying interactions. One way is imposing model structure (e.g.,an additive model) which is the same as what has done for identifying main effects. Another way is using 9 dummy variables coding a two-locus model. Notice that no correlation exists among the 9 dummy variables, i.e., ρ = 0. Thus, these dummy variables can not be grouped by Elastic net. In this sense, both Lasso and Elastic net could only do variable selection at the variable level rather than the group level.
2 Theoretical Justification of Adaptive Group Lasso

Connection with the Majorization-Minimization algorithm
Our iteratively reweighted algorithm is a special case of Majorization-Minimization (MM) algorithms [4]. To establish the connection, consider the problem where ℓ(β) is the log-likelihood of logistic regression: The problem (6) can be rewritten as subject to −ℓ(β) ≤ α, The problem is in the form This is a problem that minimizes a concave function on a convex set. Instead of minimizing a concave function, we can minimize its tangent at a local point v 0 since concave functions are upper bounded by its tangent: Therefore, the problem (8) can be further written as Finally, the problem (11) can be rewritten as where . This is the form we propose in Algorithm 1 in the paper.

Properties
The theoretical reason for minimizing the negative log-likelihood with a concave penalty (e.g., with log function as in (6)) is given in [1]. The basic idea is that the estimation of important variables should be shrunken less than those of unimportant ones. Based on this idea, Zou proposed the Adaptive Lasso [11] and one-step estimator [13]. Wang et al. [9] have considered the Adaptive Group Lasso in the following form: They show that the Adaptive Group Lasso could do group selection consistently under some mild conditions. However, they do not show how to choose γ j adaptively. In this paper, we adaptively choose γ j by using the MM algorithm. The convergence of our algorithm can be proved following the idea in [13].

Details of the algorithm
We first solve the Group Lasso problem (14).
(14) It serves as the basis to solve the problem (15).
where ℓ(β) is the log-likelihood of logistic regression: For convenience, we standardize the block matrix X j such that 1 N X T j X = I p j , where I p j is a p j × p j identity matrix. After model fitting, the estimated coefficients β will be transformed back in the original scale. Notice that the standardization can be done efficiently: The inner product of any two dummy variables within the same group is always zero. Thus, 1 N X T j X is diagonal. Simple scaling can convert it to an identity matrix. General orthogonalization techniques such as QR decomposition is not necessary. Now let us consider a coordinate descent algorithm to solve (14). Suppose we have the estimationβ 0 andβ l for l = j, and we like to optimize (14) with respect to β j . We compute the gradient at β =β whenβ j = 0: Expressions (17) and (18) are actually the Karush-Kuhn-Tucker conditions given in [10]. They can be rewritten as where After combining (19) and (20), the coordinate-wise update has the form where (·) + is the operator Therefore, iteratively updating β j by (23) for j = 1, · · · , L gives the solution of (14). Now let us consider Logistic Group Lasso problem (15). We form a quadratic approximation to the log-likelihood based on current estimationβ: Here ∇ℓ(β) and H are the gradient and the Hessian matrix of ℓ(·) evaluated atβ, respectively: where U is a diagonal matrix with diagonal element u i =p(X i )(1 −p(X i )) and Rearranging expressions (25) ∼ (28), we obtain where and C(β) is a constant term only involvingβ. Then, the problem (15) is converted to the least square problem with a Group Lasso constraint: Further making use of the upper bound of Hessian terms u i as given in [3], we set u i = 0.25 in (29) which saves computation effort for calculating the Hessian matrix, then problem (33) becomes Problem (34) is exactly in the form (14) with equal weight, and thus can be solved by coordinate descent updating formula where r j is the vector with the i-th elements 1

Implement issues
The parameter γ is typically unknown. Here we use cross-validation to determine its value. As cross-validation needs a sequence of γ values, we construct the sequence as follows: We set the maximum value and the minimum value as γ max = max j respectively. Then, we construct a sequence of γ values decreasing from γ max to γ min uniformly on log scale γ = [γ max = γ 1 > γ 2 > · · · > γ K = γ min ]. To solve β(γ k ) k = 1, . . . , K, we exploit the warm starts technique: by using β(γ k ) as the initial point for β(γ k+1 ). In doing so, we are able to do cross-validation on γ values efficiently. In our experiment, we typically set K = 100, ǫ = 0.001 when N > L, when N < L, we set ǫ = 0.05.
We also know that many β j s remain as zero during the whole optimization process from expression (35). Naive cycling optimization on all groups in iteration process loses efficiency. We make use of this property as in [2], which leads to considerable speedup: • Step 1: We complete a cycle through all groups, and identify the active set (nonzero β j ).
• Step 2: We iterative only on the active set until convergence.
• Step 3: We form a cycling through all groups again. If this does not change the active set, the optimization process converges; otherwise go to Step 2.
4 More Results

Comparison with MDR
We choose MDR [6] for comparison because MDR is a very popular method. Our approach is similar to MDR in the sense that both methods enumerate all possible interactions during model fitting process. The difference is that our approach analyzes all interactions simultaneously, while MDR searches the best model in a sequential manner. During the comparison with MDR, we use six benchmark epitasis models [5] (details are in the supplementary document). Various sources of noise, such as genotyping error (GE), missing data (MS), phenocopy (PC), and genetic heterogeneity (GH) are added in the simulated data. The reader is referred to [5] for details of these six models and data simulation procedure.
The MDR algorithm is one of the most popular tools for gene-gene interaction detection. It has a very good power for identifying associated SNPs in the presence of various kinds of noise except that its power decreases significantly in the presence of genetic heterogeneity. Our method is similar to MDR in the sense that both methods enumerate all possible interactions. The difference is that MDR searches all interactions in a sequential manner and pick the best one, while our method simultaneously analyzes all interactions. The comparison results are shown in Table 1. The novelty of our method is that it significantly increases the power when the genetic heterogeneity is present. Concretely, for Model 2-1, Model 2-3, Model 2-4, Model 2-5 and Model 2-6, the power of MDR drops from greater than 80% to only 5%. For Model 2-2, its power drops from 100% to 41%. While the power of our method drops from around 100% to 50% for Model 2-3, Model 2-4, Model 2-5 and Model 2-6, and the power remains the same for Model 2-1 and Model 2-2. This clearly shows that our method is more robust in the presence of genetic heterogeneities.
MDR searches for all possible combinations of SNPs. Thus, it is able to detect high-order interactions even when the main effects and low-order effects are not significant. However, it is difficult for MDR to characterize additive structures between associated SNPs. Genetic heterogeneity is such a case that MDR performs poorly.
In contrast, our method does better when genetic heterogeneities are present. The reason is that our model is able to approximate genetic heterogeneity model by making use of the additive structure. Given the genetic heterogeneity model: log where α g depends on a particular disease model and I(expression) is the indicator function: Suppose half of the disease samples comes from model (36) and another half comes from the model (37). Our model approximates the genetic heterogeneity model in the following form: where the coefficients β g k 1 ,k 2 and β g k 3 ,k 4 can be estimated by our proposed method. Noise SNPs are unlikely to enter the final model due to the sparsity penalty.
We show a typical coefficient pattern in Fig. 1 when genetic heterogeneity is present. The interaction term (SNP k 1 , SNP k 2 ) enters the model first and is followed by (SNP k 3 , SNP k 4 ). The interaction terms borrow strength from each other and enter the model. Other noise terms enter the model gradually as γ decreases. Finally, the signals of (SNP k 1 , SNP k 2 ) and (SNP k 3 , SNP k 4 ) become unobvious among noise terms. We also show the coefficients estimated at γ * which is determined by cross-validation. One can see that the signals of (SNP k 1 , SNP k 2 ) and (SNP k 3 , SNP k 4 ) are much stronger than noise terms (the middle panel of Fig. 1). Reweighted estimation weakens noise terms further (the right panel of Fig. 1).

The influence of main effects
In the paper, we show that BEAM performs poorly when main effects are absent. Here we conduct simulation study showing that the performance of BEAM depends on the main effects. Our simulation is based on the XOR model given in Table 2 with α = 1 and θ = 2. The minor allele frequency (MAF) varies from 0.1 to 0.5 such that the main effect of the XOR model decreases until no main effect is present, as shown in Fig. 2. We simulate 100 datasets. Each dataset contains 400 samples and 1000 SNPs. The results of BEAM and AGL are shown in Fig. 3: • These two methods have comparable power for detecting main effects.
• BEAM has a low power to detect interactions when MAF= 0.3, 0.4, 0.5. The reason is that the two associated SNPs with weak main effects are difficult to be sampled. When MAF= 0.2, the two associated SNPs with noticeable main effects are more likely to be detected. Thus, the power of BEAM increases dramatically. When MAF= 0.1, it is easier for BEAM to detect the two associated SNPs but BEAM prefers to recognize them as main effects rather than interactions.
• The performance of AGL is robust for different MAFs.
AA Aa aa    (Table 2). BEAM (B) loses its power when detecting multiple interaction pairs, while our AGL keeps its power.

Detecting multiple interactions
In order to have a fair comparison regarding the power of detecting multiple interactions, we simulate interacting SNPs with noticeable main effects since BEAM has a low power when the main effect is weak. We design our experiment based on the XOR model shown in Table 2. Fig.  2 shows that the main effect of the XOR model tends to be zero as the MAF increases from 0.1 to 0.5. We choose MAF to be 0.2 such that interacting SNPs have noticeable main effects. Under this setting, it is easier for BEAM to sample those associated SNPs. We simulate 1000 samples and 500 SNPs. Six of 500 SNPs form three groups of interacting SNPs. Each group has the same characteristics as described by the XOR model. Fig. 4 shows the result. BEAM loses its power when detecting more than one interacting groups. Concretely, the power of BEAM is about 97% when identifying one interactions. But the power quickly drops to around 50% when identifying two interactions and further drops to 10% for identifying all three interactions. Our method keeps the high power when identifying all three interactions.

WTCCC RA data results
We provide the result of WTCCC Rheumatoid Arthritis (RA) data set. These results are analyzed under main effect model in the chromosome-wise manner. These results are given in Table 3 ∼ Table 10.

Models used for comparison with MDR
We provide models used for comparison with MDR in Table 11.

Models used in comparison with BEAM
The epistatic models used in Section 3.1.3 are given in Table 12, Table 13, Table 14 and Table  15. All of these models can be found in [8].         Table 11: Two-locus penetrance functions of six two-locus models that exhibit epistatic interactions in the absence of main effects [5]. Here p and q denote allele frequencies of A and B, respectively.