Replicability analysis in the framework of multiple testing
In order to express the problem explicitly, we first make a brief description of the framework for replicability analysis across two studies in GWAS. Suppose there are m SNPs to be investigated in each study. For the ith study (i=1,2), let \(\left \{H_{i,j}\right \}^{m}_{j=1}\) be the underlying states of the hypotheses, where Hi,j=1 indicates that the jth SNP is associated with the phenotype of interest and Hi,j=0 otherwise. For the jth SNP, we are interested in examining the following null hypothesis
$$\mathcal{H}^{0j}_{NR}: \left(H_{1,j},H_{2,j}\right)\in \left\{(0,0),(1,0),(0,1) \right\}, $$
and we call \(\mathcal {H}^{0j}_{NR}\) the no replicability null hypothesis showing that the SNP is associated with the phenotype in at most one study. The goal of the replicability analysis in GWAS is to discover as many SNPs that are associated with phenotype in both studies as possible [14]. In this paper, we handle this problem in the framework of multiple testing under dependence since the disease-associated SNPs are always clustered and dependent. Specifically, we aim to develop a multiple testing procedure that can discover the SNPs with replicated associations (i.e. (H1,j,H2,j)=(1,1)) as many as possible, while the FDR is controlled at the pre-specified level. To this end, we define the FDR as follows:
$$\text{FDR}=E\left[\frac{\sum^{m}_{j=1}I_{\left(\left(H_{1,j},H_{2,j}\right)\in \{(0,0),(1,0),(0,1)\}\right)}\delta_{j}}{\sum^{m}_{j=1}\delta_{j}}\right], $$
where δj=1 indicates that the jth SNP is claimed to be associated with the phenotype in both studies and δj=0 otherwise. Correspondingly, the marginal false discovery rate (mFDR) is defined as:
$$\text{mFDR}=\frac{E\left[\sum^{m}_{j=1}I_{\left(\left(H_{1,j},H_{2,j}\right)\in \{(0,0),(1,0),(0,1)\}\right)}\delta_{j}\right]}{E\left[\sum^{m}_{j=1}\delta_{j}\right]}. $$
Since the mFDR is asymptotically equivalent to the FDR in the sense that \(\text {mFDR}=\text {FDR}+O\left (1/\sqrt {m}\right)\) under some mild conditions [32], hereafter, we mainly focus on developing a multiple testing procedure that can control the mFDR at the pre-specified level for replicability analysis.
The Cartesian hidden Markov model
Let zi,j be the observed z-value of the jth SNP in the ith association study, which can be obtained by using appropriate transformation. Specifically, zi,j can be transformed from Φ−1(1−pi,j), where Φ−1 is the inverse of the standard normal distribution and pi,j is the p-value of the jth SNP in the ith association study, for i=1,2, and j=1,…,m.
The Markov chain, which is an effective tool for modelling the clustered and locally dependent structure among disease-assocaited SNPs, has been widely used in the literatures [21, 22]. We assume that \(\left \{\left (H_{1,j},H_{2,j}\right)\right \}^{m}_{j=1}\) is a four-state stationary, irreducible and aperiodic Markov chain with the transition probability
$$ A_{uv}=P\left(\left(H_{1,j+1},H_{2,j+1}\right)=v|\left(H_{1,j},H_{2,j}\right)=u \right), $$
(1)
where u,v∈{(0,0),(1,0),(0,1),(1,1)}. We further assume that the observed z-values \(\left \{\left (z_{1,j},z_{2,j}\right)\right \}^{m}_{j=1}\) are conditionally independent given the hypotheses states \(\left \{\left (H_{1,j},H_{2,j}\right)\right \}^{m}_{j=1}\), namely,
$$ {{\begin{aligned} P\left(\left\{\left(z_{1,j},z_{2,j}\right)\right\}^{m}_{j=1} | \left\{\left(H_{1,j},H_{2,j}\right)\right\}^{m}_{j=1} \right) =\prod\limits^{m}_{j=1}P\left(z_{1,j}|H_{1,j}\right)\prod\limits^{m}_{j=1}P\left(z_{2,j}|H_{2,j}\right). \end{aligned}}} $$
(2)
The Markov chain \(\left \{\left (H_{1,j},H_{2,j}\right)\right \}^{m}_{j=1}\) with the dependence model (2) is called Cartesian hidden Markov model (CHMM) [33]. The structure of the CHMM can be intuitively understood with a graphical model as follows in Fig. 3.
Following [20–22], we suppose that the corresponding random variable Zi,j follows the two-component mixture model:
$$ Z_{i,j}|H_{i,j}\sim \left(1-H_{i,j}\right)f_{i0}+H_{i,j}f_{i1}, $$
(3)
where fi0 and fi1 are the conditional probability densities of Zi,j given Hi,j=0 and Hi,j=1, respectively. In practice, we usually assume that f10 and f20 are the densities of the standard normal distribution N(0,1), and f11 and f21 are the densities of the normal distributions \(N\left (\mu _{1},\sigma ^{2}_{1}\right)\) and \(N\left (\mu _{2},\sigma ^{2}_{2}\right)\), respectively.
Let π=(π00,π10,π01,π11) be the initial distribution of the four-state Markov chain, where πst=P((H1,1,H2,1)=(s,t)), for s,t=0,1. For convenience, let \(\vartheta =(\pi,\mathcal {A},\mathcal {F})\) denote the parameters of the CHMM, where \(\mathcal {A}=\left \{A_{uv}\right \}_{4\times 4}\) with u,v∈{(0,0),(1,0),(0,1),(1,1)} and \(\mathcal {F}=\left (f_{10},f_{11},f_{20},f_{21}\right)\).
The repLIS procedure for replicability analysis
In this section, we develop the multiple testing procedure for replicability analysis by studying the connection between the multiple testing and weighted classification problems. Consider the loss function of the weighted classification problem with respect to replicability analysis as
$$ {\begin{aligned} L_{\lambda} \left(\left\{H_{1,j}\right\}^{m}_{j=1},\left\{H_{2,j}\right\}^{m}_{j=1},\left\{\delta_{j}\right\}^{m}_{j=1} \right) & =\frac{1}{m}\sum\limits^{m}_{j=1}\left\{\lambda\left[\left(1-H_{1,j}\right)\left(1-H_{2,j}\right)\right.\right.\\ & \quad\left.+ H_{1,j}\left(1-H_{2,j}\right)+\left(1-H_{1,j}\right)H_{2,j}\right]\delta_{j} \\ &\quad+ \left. H_{1,j}H_{2,j}(1-\delta_{j})\right\}, \end{aligned}} $$
where λ is the relative cost of false positive to false negative, and δj was defined in the above section and we call (δ1,…,δm)∈{0,1}m the classification rule for replicability analysis here. By some simple derivations, the optimal classification rule, which minimizes the expectation of the loss function, is obtained as
$$ \delta_{j}\left(\Lambda_{j},1/\lambda\right)=I_{\left(\Lambda_{j}<1/\lambda\right)}, ~~ \text{for} ~ j=1, \dots, m $$
(4)
where
$$\Lambda_{j}=\frac{P\left(\mathcal{H}^{0j}_{NR} \text{~is ~true~} \big |\{z_{1,i}\}^{m}_{i=1},\{z_{2,i}\}^{m}_{i=1} \right)}{1-P\left(\mathcal{H}^{0j}_{NR} \text{~is ~true~} \big | \{z_{1,i}\}^{m}_{i=1},\{z_{2,i}\}^{m}_{i=1} \right)} $$
is called the optimal classification statistic in the weighted classification problem, and I(·) is an indicator function.
Following the work of [34], it is not difficult to show that the optimal classification statistic is also optimal for replicability analysis in the sense that the multiple testing procedure based on the optimal classification statistics with a suitable cutoff can control the mFDR at the pre-specified level α and has the smallest mFNR among all α-level multiple testing procedures. Since Λj is increasing with \(P\left (\mathcal {H}^{0j}_{NR} \text {~is ~true~} |\left \{z_{1,i}\right \}^{m}_{i=1},\left \{z_{2,i}\right \}^{m}_{i=1}\right)\), we can also define the optimal multiple testing statistic for replicability analysis as
$$ {{\begin{aligned} \text{repLIS}_{j}=P \left(\mathcal{H}^{0j}_{NR} \text{~is ~true~} \left|\{z_{1,i}\}^{m}_{i=1},\left\{z_{2,i}\right\}\right.^{m}_{i=1} \right), ~~ \text{for}~ j=1, \dots, m. \end{aligned}}} $$
(5)
Denote by repLIS(1), repLIS(2),…,repLIS(m) the ordered repLIS values and \(\mathcal {H}^{0(1)}_{NR}\), \(\mathcal {H}^{0(2)}_{NR}, \dots, \mathcal {H}^{0(m)}_{NR}\) the corresponding no replicability null hypotheses. The repLIS procedure for replicability analysis is:
$$ {{\begin{aligned} \text{{let}~} l=\max \left\{t: \frac{1}{t}\sum\limits^{t}_{j=1} \text{repLIS}_{(j)} \leq\alpha \right\}; ~\text{then~ reject~all~} \mathcal{H}^{0(j)}_{NR}, j=1, \dots, l. \end{aligned}}} $$
(6)
It is necessary to note that, to focus on the main ideas, we restrict attention to repLIS in testing two GWAS studies. Extending repLIS to multiple GWAS studies (≥3) is formally straightforward, but requires additional computations.
The following theorem shows that repLIS procedure is asymptotically optimal. The proof of the theorem is outlined in Additional file 2.
Theorem 1
Consider the Cartesian hidden Markov model (1)-(2) and define the testing statistics \(\text {repLIS}_{j}=P\left (\left (H_{1,j},H_{2,j}\right)\in \{(0,0),(1,0),(0,1)\} |\{z_{1,i}\}^{m}_{i=1},\left \{z_{2,i}\right \}^{m}_{i=1}\right)\) for j=1,…,m. Let repLIS(1), repLIS(2),…,repLIS(m) be the ordered repLIS values and \(\mathcal {H}^{0(1)}_{NR}\), \(\mathcal {H}^{0(2)}_{NR}, \dots, \mathcal {H}^{0(m)}_{NR}\) be the corresponding no replicability null hypotheses. Then the repLIS procedure (6) controls FDR at α. Moreover, the FNR yielded by repLIS procedure is β∗+o(1), where β∗ is the smallest FNR level among all α-level FDR multiple testing procedures.
The forward-backward algorithm for computing repLIS
When the parameters of CHMM are known, repLIS statistics can be calculated by utilizing the forward-backward algorithm. Specifically, the repLIS statistic for the jth SNP can be expressed as:
$$\text{repLIS}_{j}=1-\frac{\alpha_{j}(1,1)\beta_{j}(1,1)}{\sum^{1}_{p=0}\sum^{1}_{q=0}\alpha_{j}(p,q)\beta_{j}(p,q)}, $$
where the forward variable \(\alpha _{j}(p,q)=P\left ({\vphantom {\left \{z_{1,i}\right \}^{j}_{i=1},\left \{z_{2,i}\right \}^{j}_{i=1}}}\left (H_{1,j},H_{2,j}\right)=\right. \left.(p,q),\left \{z_{1,i}\right \}^{j}_{i=1},\left \{z_{2,i}\right \}^{j}_{i=1}\right)\) and the backward variable \(\beta _{j}(p,q)=P\left (\left \{z_{1,i}\right \}^{m}_{i=j+1},\left \{z_{2,i}\right \}^{m}_{i=j+1}|\left (H_{1,j},H_{2,j}\right)=(p,q)\right)\) can be calculated by using the following recursive formulas:
$$\alpha_{j+1}(p,q)=\sum^{1}_{s=0}\sum^{1}_{t=0}\alpha_{j}(s,t)f_{1p}\left(z_{1,j+1}\right)f_{2q}\left(z_{2,j+1}\right)A_{(s,t)(p,q)}, $$
$$\beta_{j}(p,q)=\sum^{1}_{s=0}\sum^{1}_{t=0}\beta_{j+1}(s,t)f_{1s}\left(z_{1,j+1}\right)f_{2t}\left(z_{2,j+1}\right)A_{(p,q)(s,t)}. $$
The EM algorithm for estimating the parameters of CHMM
In reality, the parameters 𝜗 of the CHMM are not usually known. We use the plug-in \(\widehat {\text {repLIS}}\) yielded by utilizing the maximum likelihood estimates to replace the true parameters for replicated analysis. In this section, we provide details of the EM algorithm for estimating the parameters of CHMM. For simplicity, let \(\sum \limits _{H_{1,*};H_{2,*}}=\sum \limits _{H_{1,1},H_{1,2},\dots,H_{1,m}} \sum \limits _{H_{2,1},H_{2,2},\dots,H_{2,m}}\), \(\mathcal {Z}=\left (\left \{z_{1,j}\right \}^{m}_{j=1},\left \{z_{2,j}\right \}^{m}_{j=1}\right)\) and \(\mathcal {H}=\left (\left \{H_{1,j}\right \}^{m}_{j=1},\left \{H_{2,j}\right \}^{m}_{j=1}\right)\).
The full likelihood can be expressed as:
$$\begin{array}{@{}rcl@{}} L(\vartheta;\mathcal{Z},\mathcal{H})&\,=\,&P_{\vartheta}\left(\left\{z_{1,j}\right\}^{m}_{j=1},\!\left\{z_{2,j}\right\}^{m}_{j=1},\!\left\{H_{1,j}\right\}^{m}_{j=1},\!\left\{H_{2,j}\right\}^{m}_{j=1}\right)\\ &=&P_{\vartheta}\left(H_{1,1},H_{2,1}\right)\prod^{m}_{j=1}f_{1H_{1,j}}\left(z_{1,j}\right)\prod^{m}_{j=1}f_{2H_{2,j}}\left(z_{2,j}\right)\\ &&\times \prod^{m-1}_{j=1}A_{\left(H_{1,j},H_{2,j}\right)\left(H_{1,j+1},H_{2,j+1}\right)}. \end{array} $$
We first initialize the parameters \(\vartheta ^{(0)}= \left (\pi ^{(0)},\mathcal {A}^{(0)},\mathcal {F}^{(0)}\right)\). In the E-step of the tth iteration, we calculate the following Q(𝜗,𝜗(t)) function:
$$\begin{array}{@{}rcl@{}} Q\left(\vartheta,\vartheta^{(t)}\right)&=&\sum\limits_{H_{1,*};H_{2,*}}\log P_{\vartheta}(\mathcal{Z},\mathcal{H})P_{\vartheta^{(t)}}(\mathcal{Z},\mathcal{H})\\ &=&\sum\limits_{H_{1,*};H_{2,*}}\log P_{\vartheta}\left(H_{1,1},H_{2,1}\right)P_{\vartheta^{(t)}}(\mathcal{Z},\mathcal{H})\\ &&+\sum\limits_{H_{1,*};H_{2,*}}\left[\sum^{m}_{j=1}\log \left.\left(f_{1,H_{1,j}}\left(z_{1,j}\right)f_{2,H_{2,j}}(z_{2,j}\right)\right)\right]P_{\vartheta^{(t)}}(\mathcal{Z},\mathcal{H})\\ &&+\sum\limits_{H_{1,*};H_{2,*}}\left[\sum^{m-1}_{j=1}\log A_{\left(H_{1,j},H_{2,j}\right)\left(H_{1,j+1},H_{2,j+1}\right)} \right]P_{\vartheta^{(t)}}(\mathcal{Z},\mathcal{H}) \end{array} $$
In the M-step of the tth iteration, maximizing Q(𝜗,𝜗(t)) yields to
$$\vartheta^{(t+1)}=\arg\max\limits_{\vartheta}Q\left(\vartheta,\vartheta^{(t)}\right). $$
Specifically, using the Lagrange multiplier method yields to
$$\begin{array}{@{}rcl@{}} \pi^{(t+1)}_{u} &=& P_{\vartheta^{(t)}}\left(H_{1,1},H_{2,1}\right)=u|\mathcal{Z}), \\ A^{(t+1)}_{uv} &\,=\,& \frac{\sum^{m-1}_{j=1}P_{\vartheta^{(t)}}\left(\left(H_{1,j},H_{2,j}\right)\,=\,u,\left(H_{1,j+1},H_{2,j+1}\right)\,=\,v|\mathcal{Z}\right)} {\sum^{m-1}_{j=1}P_{\vartheta^{(t)}}\left(\left(H_{1,j},H_{2,j}\right)=u|\mathcal{Z}\right)}, \\ \mu^{(t+1)}_{i} &=& \frac{\sum^{m}_{j=1}z_{i,j}P_{\vartheta^{(t)}}\left(H_{i,j}=1|\mathcal{Z}\right)} {\sum^{m}_{j=1}P_{\vartheta^{(t)}}\left(H_{i,j}=1|\mathcal{Z}\right)}, \\ \sigma^{2(t+1)}_{i} &= &\frac{\sum^{m}_{j=1}\left(z_{i,j}-\mu^{(t+1)}_{i}\right)^{2}P_{\vartheta^{(t)}}\left(H_{i,j}=1|\mathcal{Z}\right)} {\sum^{m}_{j=1}P_{\vartheta^{(t)}}\left(H_{i,j}=1|\mathcal{Z}\right)}, \end{array} $$
for i=1,2 and u,v∈{(0,0),(1,0),(0,1),(1,1)}.