Large-scale multiple testing in genome-wide association studies via region-specific hidden Markov models

Background Identifying genetic variants associated with complex human diseases is a great challenge in genome-wide association studies (GWAS). Single nucleotide polymorphisms (SNPs) arising from genetic background are often dependent. The existing methods, i.e., local index of significance (LIS) and pooled local index of significance (PLIS), were both proposed for modeling SNP dependence and assumed that the whole chromosome follows a hidden Markov model (HMM). However, the fact that SNP data are often collected from separate heterogeneous regions of a single chromosome encourages different chromosomal regions to follow different HMMs. In this research, we developed a data-driven penalized criterion combined with a dynamic programming algorithm to find change points that divide the whole chromosome into more homogeneous regions. Furthermore, we extended PLIS to analyze the dependent tests obtained from multiple chromosomes with different regions for GWAS. Results The simulation results show that our new criterion can improve the performance of the model selection procedure and that our region-specific PLIS (RSPLIS) method is better than PLIS at detecting disease-associated SNPs when there are multiple change points along a chromosome. Our method has been used to analyze the Daly study, and compared with PLIS, RSPLIS yielded results that more accurately detected disease-associated SNPs. Conclusions The genomic rankings based on our method differ from the rankings based on PLIS. Specifically, for the detection of genetic variants with weak effect sizes, the RSPLIS method was able to rank them more efficiently and with greater power.

the locus q i to the locus q j and ln (P (Z (c) [qi:qj ] |Ψ c , m c , w c )) be the maximum log-likelihood about the data from [q i : q j ] of the whole observes Z (c) , where q i , q j ∈ w c , q 0 = 0 < q i < q j < q (|wc|+1) = L c , 0 < i < j < |w c |+1.
Then we define the dimension D (mc,wc) as (|w c | + 1)D mc , where D mc is the number of parameters only depending on m c . In our model, D mc = 3m c + 2 for chromosome c. Thus, given m c = i and w c , the objection function of the step 2 in algorithm 1 becomes:  Thus, given m c = i and K c ≤ K max , apply dynamic programming theory, we can get the following recursion Given λ c and m c = i, using the above recursion, we can design a dynamic programming algorithm to find the optimal change point setŵ c,t,i in step 2 of Algorithm 1.
The derivation of the RSPLIS procedure and proof of theorem 1 The derivation of the RSPLIS procedure The derivation involves three steps: (i) making connections between the multiple testing and weighted classification problems; (ii) derive an oracle procedure for FDR control; (iii) develop a data-driven procedure that mimics the oracle procedure.
Let β be the relative cost of a false positive to a false negative. Consider a weighted classification problem with loss function L c is the total number of SNPs from all chromosomes. Under mild conditions, the multiple testing problem is equivalent to a weighted classification problem. Specifically, let U α be the collection of all α-level FDR procedures of the form δ = I (T<C1) . Suppose that the classification risk with the above loss function is minimized by δ β {T, C(β)}, so that T is optimal in the weighted classification problem. If T T , then T is also optimal in the multiple testing problem, in the sense that for each FDR level α, there exists a unique β(α), and hence Cβ(α) = C(α), such that δ β(a) {T, C(α)} controls the FDR at level α with the smallest FNR level among all testing rules in U α .
The optimal classification rule that minimizes rl is strictly increasing in LIS (c) rl , the optimal testing procedure is of the form rl <C) : c = 1, . . . , C; r = 1, . . . , R c ; l = 1, . . . , L cr }. Now the question is how to determine the optimal cutoff C opt for a given FDR level a. Note that for a given threshold C, the FDR level of RSPLIS is From the above expression we can see that the group labels c and r are no longer needed and hence are dropped. Suppose the total number of rejections from all groups is RN, then according to the law of large numbers, It is straightforward to see that we should choose the largest RN such that Thus we have derived the RSPLIS procedure.

Proof of Theorem 1
(i) Validity. Let RN be the number of rejections by the RSPLIS procedure. Note that this is a pooled analysis, we neglect the group label c and r.
The result follows by noting that for all realizations of Z, Our RSPLIS procedure guarantees that (ii) Asymptotic optimality. The asymptotic optimality can be shown without essential difficulty by generalizing the proof of Theorem 6 in Sun and Cai (2009) (for a single Markov chain). We refer to Sun and Cai (2009) for more technical details.