First, Sun and Cai [14] developed a compound decision-theoretic framework for testing HMM-dependent hypotheses and presented an optimal testing procedure that can be used to analyze a single chromosome for SNP data. Second, Wei et al. [15] proposed a PLIS approach for multiple-chromosome analysis. They showed that under some regularity conditions, the PLIS procedure is valid and asymptotically optimal in the sense that it can control the global chromosome-wise FDR at the nominal level α and has the smallest false non-discovery rate (FNR) among all valid FDR procedures by combining the testing results from different chromosomes. In this section, we extend the chromosome-specific PLIS to the analysis of different chromosomes with multiple regions. In what following, we formulate our statistical model and elaborate the theoretical foundations of our method.
Region-specific multi-HMM with change points and where the number of components known
In this subsection, we assume that we have known change points as well as the number of components in the normal mixture distribution. Let w
c
and m
c
denote the change point set and the number of components for chromosome c. Suppose the case-control genotype data are available from the L
cr
SNPs in the r-th region of chromosome c, c=1,…,C, r=1,…,R
c
. We let θ rl(c)=1 indicate that SNP l from region r of chromosome c is disease-associated and θ rl(c)=0 otherwise. For each SNP, we first obtain a p-value by conducting a χ2-test to assess the association between the allele frequencies and the disease status, then we convert the p-value into a z-value Z rl(c) using the transformations proposed in Wei et al. [15] for further analysis. For chromosome c, let θ(c)={θ rl(c);l=1,…,L
cr
,r=1,…,R
c
} and Z(c)={Z rl(c);l=1,…,L
cr
,r=1,…,R
c
}. In the following, we treat θ(c) as the hidden variables and Z(c) as the observed variables to consider HMMs using some assumptions.
First, we assume that the observed data are conditionally independent given the hidden states for the same region, and that different regions of the same chromosome are independent. Then we have
Furthermore, let , where denotes the observation distribution for each SNP in region r of the chromosome c. For a non-associated SNP, we assume that the z-value distribution is a standard normal for all regions of chromosome c, and for a disease-associated SNP, the z-value distribution is a normal mixture, whereby , , and we assume that the number of components in the normal mixture is identical for all regions of chromosome c. The normal mixture model can approximate a large collection of distributions and has been widely used elsewhere [19-21].
Second, we assume that the hidden states and are independent for the different regions, r and t. For the r-th region of chromosome c, we assume that is distributed as a stationary Markov chain with a transition probability . Let us denote Λ
cr
=(a rij(c)) the transition matrix and the stationary distribution, where and .
Let ϕ
cr
={(μ ri(c),σ ri(c),ξ
i
);i=1,…,m
c
}, then we denote Ψ
cr
=(Λ
cr
,π
cr
,ϕ
cr
) the collection of HMM parameters for r-th region of chromosome c. When w
c
and m
c
are known, the maximum likelihood estimate of the HMM parameters can be obtained using the expectation-maximization (EM) algorithm [14, 22].
Adaptive criterion-based partitioning (ACP) method for finding change points and the number of components
However, in practice, change points and the number of components in the normal mixture distribution are often unknown. In this subsection, for each chromosome, we will give an ACP method to conduct a model selection procedure for simultaneously finding w
c
and m
c
.
Candidate change point set
To effectively reduce the huge space of competing change points and save computation time, our ACP method needs a candidate change point set in advance. Here, we use a haplotype block partition method [23] to obtain the haplotype-block boundary points for each chromosome, which can be collected as the candidate change point set. Because the minimum length value of block L
min
should be pre-specified in their haplotype block partition method, here, we let L
min
be 300 for all our analysis.
Adaptive criterion-based partitioning procedure
Simultaneously inferring m
c
and w
c
can be regarded as a model selection problem. To select a desired model, the commonly used methods are established base on the criterion of minimizing the penalized negative maximum likelihood (e.g. BIC). However, many other existing criterions including BIC, assume that the observations are independent, which is not true in HMM. As a result, the effective sample sizes may be small owing to strong dependence among the observations, and the existing criterions may suffer from a failure of consistency. A data-driven penalized criterion was proposed in the Gaussian and least-squares regression model selection for independent observations [17, 24, 25]. Especially, [25] used this adaptive criterion for variable selection and clustering in Gaussian mixtures model and showed that this adaptive criterion outperforms other criterions (e.g. BIC) for small sample sizes. Following their work, we propose a data-driven penalized criterion for dependent observations in HMM.
Let denote a change point set for chromosome c, |w
c
| be the number of the change points in w
c
, and Ψ
c
={Ψ
cr
;r=1,…,R
c
}, where is the candidate change point set for chromosome c. Then, we consider a penalized maximum likelihood criterion with the following form
(1)
where is the maximum likelihood estimator of the parameters Ψ
c
in HMMs for chromosome c using an EM algorithm, and the penalty function is designed to avoid overfit problems. In this penalty function, λ
c
>0 is a tuning parameter to be chosen depending on sample size and is the number of parameters in the model. Furthermore, in this paper, we have , where is the number of parameters that only depend on m
c
. If we let , this penalty function becomes the penalty function of BIC for HMM.
Given a value of λ
c
in the penalized criterion , we can find and to minimize of equation 1 by running Algorithm 1.
Algorithm 1
In step 1, we need to give pre-specified values of K
max
and m
max
in advance, where K
max
denotes the maximum value of the number of change points for each chromosome, and m
max
is the maximum value of the number of component. As we know, the number of true change points is usually far less than the number of the candidate change points in practical applications, so we can give a smaller value for K
max
to save computation time. For m
max
, Wei et al. [15] suggested that values of between four and six are usually chosen.
In step 2, following the methods of [26] and [27], we provide an optimal partitioning search method for change points to estimate given λ
c
and m
c
, which is, in essence, a dynamic program (DP) algorithm. The detailed procedure about the optimal partitioning search method is shown in Additional file 1.
However, in practice, λ
c
is unknown and needs to be calibrated and estimated from the data themselves. Slope heuristics [24] as well as its generalization, dimension jump method [17], are practical and effective calibration algorithms to estimate the optimal penalty . Here, we propose a sliding window-based dimension jump method to estimate λc,o p t, where the sliding window is used to avoid losing cases involving several successive jumps. When the width of the sliding window is 1, our proposed method becomes the dimension jump method of Wei et al. [17]. The following algorithm describes the detailed procedure for estimating λc,o p t.
Algorithm 2
At the end of Algorithm 2, we can obtain the estimation of the λc,o p t. Having , we can then run Algorithm 1 to obtain and as well as the desired optimal model by minimizing of Equation (1). At the same time, we can get , the estimates of model parameters Ψ
c
based on the optimal model, where c=1,…,C.
Pooled FDR control procedure for different chromosomes with multiple regions
After each chromosome is divided into different regions by change points, it is desirable that that the global region- wide FDR can also be controlled by combining the test results from multiple regions of different chromosomes. In the following, we extend the chromosome-specific PLIS to the RSPLIS and operate the new procedure in three steps:
Step 1. For chromosome c (c=1,…,C), we search the change points to divide the whole chromosome into multiple regions using the ACP method. For each region r, we can get by using the EM algorithm from which we can calculate the plug-in LIS statistic for all regions of each chromosome by using the forward-backward algorithm [28].
Step 2.Combine and rank the plug-in LIS statistics from different regions of multiple chromosomes. Denote by LIS(1),…,LIS(L) the ordered values and H(1),…,H(L) the corresponding hypotheses, where
Step 3.Reject all H(i), i=1,…,l, where
We define FNR as the expected proportion of falsely accepted hypotheses. Under a compound decision-theoretic framework, the following theorem can verify that our RSPLIS is valid and asymptotically optimal. We provide the detailed proof of the theorem in Additional file 1.
Theorem 1
Consider the multi-region HMMs defined in section 'Region-specific multi-HMM with change points and where the number of components known’. Let LIS(1),…,LIS(L) be the ranked LIS values from all the regions of all chromosomes. Then, the RSPLIS procedure controls the global FDR at level α. In addition, the global FNR level of RSPLIS is β∗+o(1), where β∗ is the smallest FNR level among all valid FDR procedures at level α.