 Research
 Open Access
 Published:
Pathwaybased approach using hierarchical components of rare variants to analyze multiple phenotypes
BMC Bioinformatics volume 19, Article number: 79 (2018)
Abstract
Background
As one possible solution to the “missing heritability” problem, many methods have been proposed that apply pathwaybased analyses, using rare variants that are detected by next generation sequencing technology. However, while a number of methods for pathwaybased rarevariant analysis of multiple phenotypes have been proposed, no method considers a unified model that incorporate multiple pathways.
Results
Simulation studies successfully demonstrated advantages of multivariate analysis, compared to univariate analysis, and comparison studies showed the proposed approach to outperform existing methods. Moreover, real data analysis of six type 2 diabetesrelated traits, using largescale whole exome sequencing data, identified significant pathways that were not found by univariate analysis. Furthermore, strong relationships between the identified pathways, and their associated metabolic disorder risk factors, were found via literature search, and one of the identified pathway, was successfully replicated by an analysis with an independent dataset.
Conclusions
Herein, we present a powerful, pathwaybased approach to investigate associations between multiple pathways and multiple phenotypes. By reflecting the natural hierarchy of biological behavior, and considering correlation between pathways and phenotypes, the proposed method is capable of analyzing multiple phenotypes and multiple pathways simultaneously.
Background
In the past decade, genomewide association studies (GWAS) have played a key role in identifying genetic associations between Single Nucleotide Variants (SNVs) and many complex biological pathologies, including type 2 diabetes (T2D), heart disease, and schizophrenia [1,2,3]. However, largescale genetic analyses continue to suffer from incomplete association, of single nucleotide variants (SNVs), with distinct phenotypes (“missing heritability”), and difficulties of biological interpretation [4].
Among many proposed solutions to solve the missing heritability problem, many researchers have focused on “rare variants”. Methods for rare variants analysis arose from extending individual variantlevel approaches to those at the genelevel [5, 6], and extending those at the gene level, to multiple phenotypes [7,8,9].
As the number of publicly available biological resources is increasing, recent methods for analyzing rare variants utilize pathway knowledge as a priori information. Since most biological behaviors manifest from a complex interaction of biological pathways [10, 11], analyzing pathway information for identifying rare variants has several advantages. In contrast to variantlevel analysis, the number of statistical tests is substantially smaller in pathway analysis, resulting in less strict multiple testing corrections. Moreover, since pathways explain curated biological behaviors with multiple genes, it is easier to interpret statistically significant pathways than variant or genelevel analyses. In this respect, many pathwaybased approaches have been proposed especially using the pathway databases, which resulted in improvement of the interpretation of discoveries [12, 13].
Another effort to enhance the power of rare variants is to develop multivariate analysis methods. In general, many complex diseases arise from multiply correlated traits. For example, according to American Diabetes Association guidelines, diabetic status is diagnosed based on four traits: fasting glucose, two hours after plasma glucose, random plasma glucose, and HbA1c [14]. In that regard, simultaneous analysis of those correlated traits offer two substantial advantages over univariate analysis. First, multivariate analysis can elevate statistical power to identify additional causal biomarkers, which are not discovered by single phenotype analysis. Second, by analyzing multiple traits at once, the required number of statistical tests can be reduced, compared to those of univariate analysis. Those advantages have been well documented in past studies of largescale sequencing datasets [15, 16].
There have now been many applications of multivariate analysis to largescale datasets. In particular, for variant and genelevel analysis, many multivariate methods, for common and rare variants, have been proposed [8, 15]. Despite those efforts, only a number of pathwaybased multivariate analyses have been deemed feasible. Recently, three multivariate approaches, for regionlevel analyses, were proposed: MARV, aSPU, and MURAT. MARV [17] uses a statistical approach, reverse regression, to investigate associations between genetic regions and multiple phenotypes, by treating phenotypes as independent variables, hence enabling rapid multivariate analysis of largescale datasets. On the other hand, aSPU [18], extends an original concept, dataadaptive sum of powered score test, to multivariate analysis, using summary statistics from single SNVs. For multivariate extension of powerful genebased tests, MURAT (Multivariate Rarevariant Association Test) extended the original SKAT (sequence kernel association test) method to multiple phenotypes [19]. However, it might not be adequate to apply SKATbased methods to pathwaybased analysis, as we have previously demonstrated [20]. Moreover, none of the above methods are available for multivariate pathwaybased association tests for rare variants with multiple pathways. Since the established pathway databases have substantial overlap among their pathways, they may ignore significant correlations between pathways, leading to misleading biological interpretations [21, 22].
In this report, we introduce a new method, “PHARAOHmulti” (Pathwaybased approach using HierArchical component of collapsed RAre variants Of Highthroughput sequencing data), for analyzing multiple phenotypes. Previously, we proposed a componentbased hierarchical model for analysis of multiple pathways with a single model [20]. Here, while keeping the advantages of our previous approach, we extend it to enable analysis of multiple traits using hierarchical components of genetic variants. In addition, the proposed model can identify associations between multiple phenotypes and multiple pathways, with a single model, in the presence of subsequent genes within pathways, as a hierarchy.
Methods
Exome sequencing dataset for discovery study
To demonstrate the validity of the proposed method for examining largescale datasets with multiple phenotypes, in real (biological) data analysis, we analyzed wholeexome sequencing (WES) data from a Korean population study. In brief, the dataset consists of next generation sequencing of 1087 individuals’ genomes, using the Illumina HiSeq2000 platform (Illumina, Inc., San Diego, CA), selected by the Korean Association REsource (KARE) study [23], as a part of the T2DGENES consortium. For pathwaygene mapping, we retrieved pathway information from MSigDB [24], and mapped the genes to 217, 186 and 674 pathways extracted from the Biocarta, KEGG [25] and Reactome [26], respectively.
Exome chip dataset for replication study
For replication of the identified pathways from the discover study, an independent cohort from Koreans, the Health Examinee shared control study (HEXA), was used. HEXA is a part of the KoGES population based cohort, initiated in 2001 [27]. In total, genotypes of 3445 individuals were acquired using the HumanExome BeadChip v1.1 (Illumina, Inc., San Diego, CA). With same quality control criteria, 24,474 rare variants were used in the analysis.
PHARAOHmulti method
Our ultimate goal was to find an association between Q phenotypes and K pathways, each of whose number of genes was T_{1}, …, T_{ K }, under the presence of distinct parameters for ridge penalization. The proposed method is based on Generalized Structural Component Analysis (GSCA) [28], and an exemplary structure of the model is shown in Fig. 1.
Let Y = [y_{11}…y_{1Q}; …; y_{N1}…y_{ NQ }] be the matrix of phenotypes for N samples, where y_{ iq } is the observation of the i^{th} sample on the q^{th} phenotype, and let X be the matrix of genelevel collapsed variables generated by summing rare variants according to their gene variantgene mapping. Let g_{ ij } ∈ {0, 1, 2} be the number of minor alleles for the j^{th} genetic variant of the i^{th} sample. Regarding the elements of X, x_{ ikt } is a genelevel summary of rare variants which is defined as weighted sum of the i^{th} sample’s rare variants in the t^{th} gene of the k^{th} pathway, denoted by \( {x}_{ikt}={\Sigma}_{j\in {M}_{kt}}{\omega}_j{g}_{ij} \), where M_{ kt } is an index set that defines which rare variants are mapped onto the t^{th} gene in the k^{th} pathway. Several weighting parameters, ω_{ j }, can be used, as previously described in [20]. By imposing two penalty parameters λ_{ G } and λ_{ P } on the genespathway and pathwaysphenotype, we sought to address potential multicollinearity problems, in both genes and pathways, in the proposed method. Such problems may adversely affect the estimation of weights and coefficients. The proposed model assumes that the phenotype, y_{ iq }, arises from the multivariate normal distribution with mean μ and covariance Σ (i = 1,…,Q and j = 1,…,N). Then we define the proposed PHARAOHmulti model as.
Here, \( {f}_{ik}={\sum}_{t=1}^{T_k}{x}_{ik t}{w}_{tk} \) and F_{ i } indicate the i^{th} observation’s score of the k^{th} pathway, and its matrix form across Q phenotypes, respectively. Moreover, \( {\overset{\sim }{\beta}}_q=\left[{\beta}_{0q}\ {\beta}_{1q}\cdots {\beta}_{Kq}\right] \) is a vector of coefficients for the q^{th} phenotype, and \( {\overset{\sim }{\epsilon}}_i=\left[{\epsilon}_{i1}\cdots {\epsilon}_{iQ}\right] \) is a vector of residuals for the i^{th} sample.
Parameter estimation
The proposed model seeks to associate pathways and phenotypes. The effect of the k^{th} pathway, on multiple phenotypes, can be determined by testing all coefficients of the pathways simultaneously (H_{0} : β_{k1} = … = β_{ kQ } = 0).
Moreover, by its nature, the proposed method can further assess three more associations: (1) the effect of a gene on multiple phenotypes conditioned by a given pathway; (2) the effect of a gene on a phenotype conditioned by the pathway; and (3) the effect of a pathway on a phenotype. Detailed characteristics of the proposed model (PHARAOHmulti), including relationships and coefficients, are shown in Table 1.
Let B is a matrix of \( {\overset{\sim }{\beta}}_1,\cdots, {\overset{\sim }{\beta}}_Q \). From the above model, we seek to maximize the penalized loglikelihood function, to estimate the parameters w_{ tk } and β_{ kq }, subject to the conventional scaling constraint \( \sum \limits_{i=1}^N{f}_{ik}^2=N \) [29]. The penalized loglikelihood function is expressed to
where λ_{ G } and λ_{ P } are the penalty parameters for each specific gene and pathway, respectively, and ‖w_{ tk }‖_{2} and ‖β_{ kq }‖_{2} are the ridge penalty function.
We previously introduced an iteratively reweighted least square (IRLS) method to minimize an univariate version of (2) under the presence of ridge penalties [20], which is similar to the alternating regularized leastsquares algorithm [30]. Here we extend the previous algorithm to multivariate analysis. Let R_{ i } be a “columntrimmed” matrix of GSCA [30], defined by F_{ i } ⊗ I_{ K }, where ⊗ is Kronecker product, and I_{ K } is K × K identity matrix. Maximization of (2) in respect of B and W is equivalent to minimizing the following leastsquare functions:
These leastsquare functions are subject to diag(R^{′}R) = NI_{ NQ }, where Φ_{ i } is a columntrimmed matrix of B^{′} ⊗ X_{ i } [30], and vec(·) is a vectorization operator. Then, it can be easily shown that the covariance matrix Σ is not related to the above equations since the PHARAOHmulti model uses multivariate linear model. An estimation of Σ can be done after convergence of B and W, by minimizing the first derivate of (2) with respect to Σ, as:
Similarly, B and W can be updated by equating (3) and (4) to zero. This then gives the estimating equation of B and W as:
Taken together, the overall procedure of the proposed algorithm is as follows:

1.
Let t = 1.

2.
Assign random initial values to W, which are then represented by W_{(0)}.

3.
Calculate F_{(t)}, using W_{(t − 1)}.

4.
Update B_{(t)}, using F_{(t)}.

5.
Update W_{(t)}, using F_{(t)} and B_{(t)}.

6.
Repeat until the sum of the differences W_{(t)} − W_{(t − 1)} + B_{(t)} − B_{(t − 1)} converges the threshold.
Finally, we determine the values of λ_{ G } and λ_{ P }, before applying the parameter estimation procedure. To that end, we can implement kfold crossvalidation (CV) to determine the values of λ_{ G } and λ_{ P }. First, we construct a twodimensional grid of different λ_{ G } and λ_{ P } values. Then we compute the deviance of each model with the given λ_{ G } and λ_{ P }, for all CV fold values. Finally, λ_{ G } and λ_{ P } are selected by their average deviance, which is minimized.
Significance testing
To assess the significance of genes or pathways, resampling methods can be used to test the statistical significance of the estimated effects of all pathways on the phenotype. In the proposed method, we utilize a permutation test to obtain pvalues. By permuting the given phenotype, our method first generates null distributions for both pathways and gene coefficients. By computing the quantile of estimated pathway and gene coefficients, from the nonpermuted dataset in each empirical null distribution, we can obtain an empirical pvalue for any specific pathway and gene.
The testing of joint effects, between multiple phenotypes, is crucial. As shown in Table 1, PHARAOHmulti provides the individual effects of a pathway on each phenotype through β_{ k1 }, ..., β_{ kQ }. The global effect of a pathway, on all phenotypes, can be evaluated by jointly testing β_{ k1 }, ..., β_{ kQ }. Here, we introduce two different schema for determining a joint pvalue for the k^{th} pathway, from multiple phenotypes.
Our first approach was to combine the individual pvalues (referred as “P_K”). Since there are considerations among the estimated coefficients β_{ k1 }, ..., β_{ kQ }, these correlations should be accounted for combining multiple pvalues. Let the pvalues from the k^{th} pathway be denoted by P_{ k1 }, …, P_{ kQ }. The simplest way to combine those pvalues is to use Fisher’s method, which is denoted by \( {\Psi}_k=2{\sum}_{i=1}^Q\log {P}_{ik} \) under the independence assumption. Then, the statistic, Ψ_{ k }, follows the χ^{2} distribution, with the degrees of freedom, 2Q, under the null hypothesis. An extended version of Fisher’s method, Brown’s method, can combine dependent pvalues using a rescaled χ^{2}distribution and covariance of pvalues [31]. However, an analytical computation of the covariance is not feasible for largescale datasets, due to their computational complexity. A solution for this problem [32] introduced an approximation using a thirdorder polynomial for the covariance, denoted by \( \operatorname{cov}\left(2\log {P}_i,2\log {P}_j\right)\approx 3.263{\rho}_{ij}+0.71{\rho}_{ij}^2+0.027{\rho}_{ij}^3 \). To that end, Kost’s approach has been shown to be one of the best working methods for combining pvalues [33]. Here, we adopt Kost’s method by substituting ρ to the empirical correlation of estimated coefficients, β_{ k1 }, ..., β_{ kQ }, and derive the statistic for joint effect between the k^{th} pathway and multiple phenotypes, as follows:
where c_{ k }, d_{ k } and \( {\Phi}_{2{d}_k} \) are the scale parameter, the rescaled degree of freedom, and the cumulative distribution function of χ^{2}, with the degree of freedom 2d_{ k } for the k^{th} pathway, respectively [32].
Our second approach was to construct a single statistic that combines all Q coefficients (referred as “P_M”). Here, we define a Waldtype statistic, T, as below, and utilize T for the following permutation testing scheme:
Then, the estimated covariance \( \mathit{\operatorname{cov}}\left(\widehat{{\overset{\sim }{\beta}}_k}\right) \) can be directly estimated using (6) with equation \( \mathit{\operatorname{cov}}\left(\widehat{\overset{\sim }{\beta }}\right)={\left({F}^{\prime }F+{\lambda}_PI\right)}^{1}{F}^{\prime }F{\left({F}^{\prime }F+{\lambda}_PI\right)}^{1}\otimes \widehat{\Sigma} \) [34], or can be altered by calculating sample covariance of \( {\overset{\sim }{\beta}}_k \), from permutations.
Multiple testing correction
Since the number of pathways is far less than those of genes or genetic variants, the “multiple testing problem” remains. While Bonferroni correction can be a straightforward approach for adjusting for multiple testing, it may impose an adjustment that is too stringent, especially for correlated results [35]. To overcome this issue, we applied two types of multiple testing corrections.
First, PHARAOHmulti corrects pvalues using the Westfall & Young permutation procedure [36], which can be easily adopted, since PHARAOHmulti already uses a permutation scheme. Let T_{(0)} be a vector of the statistics calculated using observed, unpermuted phenotypes, and let T_{(j)} be those from the j^{th} permutation. First, we rank the values of T_{(0)} in ascending order, and let the rank of the k^{th} pathway, and the k^{th} index, be r_{ k } and r_{(k)}, respectfully. Then, for each permutation j = 0, 1, ⋯, J, let \( {T}_{(j)}^{\prime } \) be \( {T}_{(j){r}_{(1)}},\cdots, {T}_{(j){r}_{(K)}} \), to define \( {T}_{(j)}^M \) as a cumulative maximum of \( {T}_{(j)}^{\prime } \). Let I_{j, k} be an indicator function that resolves to 1.0, if \( {T}_{(0){r}_k}^{\prime }<{T}_{(j){r}_k}^M \), or 0.0, if that condition does not hold. The adjusted pvalue for the k^{th} pathway, by the Westfall & Young procedure, is then defined as:
Second, PHARAOHmulti provides False Discovery Rate (FDR) adjustment, by calculating qvalues [37]. Here, we first obtain K as the number of permutation pvalues, and from those, we can derive qvalues, using the BenjaminiHochberg stepup procedure.
Simulation study
To evaluate the performance of the proposed method, we conducted simulation studies, under various scenarios. For generating rare variants, we first produced a pool of genetic variants, using SimRare [38], a rare variant simulator with wellestablished genetic assumptions. A pool was then generated, with default settings and gene lengths of 1Kbp. Next, we generated a simulation dataset of 10 pathways, with 1000 samples, for each replicate. All simulation scenarios were evaluated, using 1000 replicates. Based on the genotypes, the simulated phenotypes were generated by the following model, with an assumption that only the first pathway is causal to the phenotypes:
This is then subject to diag(F^{′}F) = NI_{ K }, where H_{1} is the number of causal genes in the first pathway, and M_{1t} is the number of rare variants in the t^{th} gene of the first pathway (i.e., causal pathway).
In the above model (eq. 11), γ_{1tj} denotes the effect of the j^{th} genetic variant, of the t^{th} gene, set to log_{10}MAF_{ tj }, such that ϵ_{ iq } denotes the residual and follows MVN(0, Σ). In our simulation, the settings q = 1,2, and H_{1} = 1, 2, 5, 10, were used. For each replicate, all rare variants were collapsed into genes.
Results
For the simulation and the analysis of the real dataset, a workstation system with two Intel Xeon E5–2620 CPUs, with a combined RAM of 128GiB, were used. Note that the aSPU and MARV analyses were performed using the default settings, except that aSPU was performed without “genetic variant pruning” capability, as we observed that aSPU raises “unrecoverable error” with that capability. For our proposed method and aSPU, the number of permutations was 5000, to prevent possible lower bound limitation.
Simulation study
For PHARAOHmulti, we selected the tuning parameters λ_{ G } and λ_{ P }, based on threefold CV for each replicate, using twodimensional grids of λ_{ G } and λ_{ P }, with six different starting points of ridge parameters, ranging from 10^{1} to 10^{6} (on a logarithmic base 10 scale). In the simulation, we considered the following conditions: number of effective genes in the causal pathway (H_{1}), genelevel effect (w_{ t1 }), pathwaylevel effect (β_{1q}), and residual correlation (ρ). In our simulation, H_{1}, w_{ t1 }, β, and ρ were assumed to be 1, 2, and 5; 0.1 and 0.2; 0.1, 0.15, and 0.2; and 0, 0.25, 0.5, respectively, with evaluation of their exhaustive combinations. Other parameters, Q, K and T_{ K }, were fixed to 2, 10, and 10, respectively.
We first compared the type 1 error rates of the proposed method vs. the traditional methods. Here, type 1 error rate was computed as a proportion of pvalues for the pathways with no effect, and was less than the significance level, across 1000 replicates of permuted phenotypes. As shown in Fig. 2, we evaluated the type 1 errors using two significance levels, 0.01 and 0.05. As a result, type 1 errors were controlled well in the traditional methods, but PHARAOHmulti showed a moderately deflated type 1 error rate (P_M), while the inflated rate is P_K, when ρ = 0. In contrast, the quantilequantile (QQ) plots in Fig. 3 show no inflation or deflation pattern, in all the methods, except for P_K, with no correlation between phenotypes.
It was also worthwhile to assess the gain of power in the multivariate analysis, as compared to univariate analysis. In this respect, our simulation study was conducted to compare the power gain from multivariate methods, and between multivariate and univariate analyses.
First, we checked whether PHARAOHmulti with multiple phenotypes boosts power compared to PHARAOH with a single phenotype, under the same scenarios of the power simulation. As a result, we observed that the power of PHAROHmulti was at least 2.52 times larger than PHARAOH, and this difference becomes even larger, as w and β increase (data not shown).
Second, we assessed the statistical power of PHARAOHmulti and the compared methods, defined as the proportion of the adjusted qvalue of the simulated causal pathway (the first pathway) being less than the significance threshold, e.g., 0.05. Despite the proposed method supporting the WestfallYoung permutation procedure, it was not considered in the simulation study, due to the absence of corresponding adjustments in the compared methods.
Figures 4 and 5 show comparison results of statistical power simulation from 1000 replications. Each row in the grid of plots represents the same settings of w and β, with different numbers of causal genes in the causal pathway, and each column represents the same number of causal genes, with different effect settings.
In most scenario comparisons, the two proposed statistics obtained by PHARAOHmulti (P_M) and pvalue aggregation (P_K) showed greater power than the other two approaches, aSPU and MARV. However, this did not hold when 50% of the genes were causal for a specific pathway, with effect sizes of w = 0.1 and β = 0.1. In order to investigate whether or not there are significant differences among powers, we performed paired ttests between a pair of methods. In Fig. 4 for the case of w = 0.1, the pvalues were 3 × 10^{− 7} for comparing powers of P_M and aSPU, and 4.2 × 10^{− 6} for comparing those of P_M and MARV. In Fig. 5 for the case of w = 0.2, the same pairwise comparison for the pvalues were 7.3 × 10^{− 7} and 4.2 × 10^{− 6}, respectively. In overall scenarios, powers of P_M were larger up to 18%p compared to aSPU in H_{1} = 5, w = 0.2 and β = 0.2, and were larger up to 83%p compared to MARV. Generally, P_K exhibited smaller power than P_M, and showed comparable or slightly smaller power, than aSPU.
Here, we observed three interesting patterns in the results. First, the proposed P_M and aSPU methods showed lower power, when the proportion of causal genes increase, compared to MARV. Second, the power rapidly increased, as β increased, as shown in Fig. 4. Third, the contribution of w to the power was relatively moderate, compared to β, as shown in corresponding scenarios of Figs. 4 and 5. The reason for the occurrence of those two patterns is that the model(s) generate phenotypes for power simulation, and eq. (11) requires the constraint of the socalled “latent variable,” in GSCA (see Methods). While both PHARAOHmulti and aSPU construct hierarchies of genes and pathways, MARV essentially treats a pathway as a large set of SNVs, since the motivation of MARV is for region vs. pathwaybased tests. The simulation setting and its overall effect on phenotypes is summarized, first at the genelevel, and then by the expression of a linear combination of those genes. In this respect, the results of PHARAOHmulti and aSPU were more plausible than those of MARV, because those two methods more properly reflected the simulation settings.
To confirm the above hypothesis, we performed an additional comparison using the same dataset, except that the phenotypes were generated without the constraint. As shown in Fig. 6, PHARAOHmulti (“P_M”) showed larger power than in the previous simulation, while the power of MARV increased as the number of causal genes increased. However, in contrast to the previous results, the powers of PHARAOHmulti and aSPU also increased.
Finally, we investigated whether or not the statistical power changes by M_{ kt }. For simplicity, we split 1000 simulation datasets into two groups: the first group where the number of variants is small and the second where it is large. Then, we compared the power of each method between two groups using ttest. As a result, the pvalues were 0.097 for aSPU, 0.684 for MARV, and 0.825 for PHARAOHmulti. Thus, we concluded that M_{ kt } is unlikely to affect the simulation result regardless of the methods.
Real data discovery from wholeexome sequencing dataset
To evaluate the practical performance of PHARAOHmulti, we conducted a discovery study using a largescale sequencing dataset. Many studies suggest that the major underlying risk factors for metabolic disorders include high density lipoprotein (HDL), blood pressure (SBP, DBP), waist circumference (WAISTC), fasting glucose (FAST_GLU), and triglycerides (TG). In this regard, we conducted a multivariate analysis of metabolismrelated traits, using a largescale sequencing dataset, obtained from the Type 2 Diabetes Genetic Exploration by Nextgeneration sequencing in multiEthnic Samples (T2DGENES) Consortium, comparing our proposed (PHARAOHmulti) and other common methods. In detail, we analyzed a dataset consisting of 1086 samples selected from the Korean Association REsource (KARE) study [23].
After removal of samples with any missing observations of the aforementioned six phenotypes, we included 1085 samples for analysis. The quality controls with genotype call rates were < 95%, or for HardyWeinberg Equilibrium (HWE) test P < 10^{− 5}, the minor allele frequency was < 5% and the minor allele count was > 2, resulted in 198,761 variants. The final dataset was then mapped to genes, using the human genome19 (hg19) reference genome coordinates, with 10Kbp flanking regions. The gene range of hg19 reference, was extracted from RefSeq track of UCSC Table Browser, as of October 2014. Finally, the genelevel collapsed variable was generated using Workbench for Integrated Superfast Association study with Related Data (WISARD), with betatransformation weighting, as suggested in [5], with the number of genes being 4388.
Next, we compared our multivariate and univariate analysis results, using PHARAOHmulti and PHARAOH. As shown in Figs. 7 and 8, QQ plots of the results showed no substantial inflation or deflation pattern for either the multivariate or univariate results. However, with regard to pathway discovery, the results did show significant differences.
These comparisons clearly support the one advantage of multivariate analysis that we discussed above: elevation of statistical power. As with multivariate analysis, we calculated qvalues for each univariate result. Interestingly, no univariate analysis identified significant pathways, except for SBP with KEGG, which identified three pathways (drug metabolism cytochrome P450, glutathione metabolism and progesteronemediated oocyte maturation). As shown in Table 2, only one pathway, glutathione metabolism, was identified in the univariate analysis, and the qvalues of univariate analyses for pathways identified by multivariate analysis, were not significant.
Second, we compared the result of multivariate analyses, using PHARAOHmulti, aSPU and MARV. As shown in Fig. 8, PHARAOHmulti exhibited generally acceptable pvalue trends, despite the result from KEGG being modestly deflated, due to the optimization of lambda. The QQ plots of MARV look similar to PHARAOHmulti. In contrast, aSPU showed unacceptably inflated patterns of QQ plots, regardless of the pathway databases, which were not used in the simulation study. This could possibly be due to substantial overlap of existing pathway databases.
As shown in Table 2, the multivariate analysis successfully identified eight pathways from three pathway databases, with BenjaminiHochberg qvalue < 0.1. Interestingly, PHARAOHmulti identified glutathionerelated pathways in both KEGG and Reactome pathway databases, which supports the result of PHRAOHmulti. As shown in Fig. 8, the quantilequantile plots of aSPU for the real dataset are highly inflated (i.e., their pvalues are very small). As a result, 57.7% (Reactome), 29.5% (Biocarta) and 71.5% (KEGG) of the tested pathways by aSPU were statistically significant (qvalue < 0.1). Unfortunately, these pathways are highly false positives. In this respect, we included the results of significant pathways identified by either PHARAOHmulti or MARV.
The identified pathways suggested evident relationships with metabolic syndrome. Since the peroxisome pathway elucidates peroxisome biogenesis, which contributes to fatty acid oxidation and biosynthesis of ether lipids, many studies have discussed interrelationship between peroxisomes and metabolic processes [39, 40]. Likewise, identification of the GABA pathway can also be explained by the relationship between GABA and peroxidation, and putative relationship of obesity [41, 42]. Moreover, identification of glutathione metabolism, and its conjugation, explain that PHARAOHmulti successfully captured a key process of metabolic disorders [43]. Finally, another report suggested a putative role of adhesion molecules in metabolic diseases, as explained by “cell2cell” pathway [44]. For the two pathways identified by MARV, we found that Caspase pathway has been known to be related to metabolic stress or perturbation [45], but no evidence for D4GDI pathway was found.
Replication study using independent exome chip dataset
We conducted a replication study using exome chip dataset from an independent cohort, using the identified pathways in the discovery study. Despite the insufficiency of detected variants in the exome chip dataset, as a result, we successfully replicated two pathways with pvalue < 0.1, the peroxisome pathway in KEGG (p = 0.059) and cell2cell pathway in Biocarta (p = 0.093). As shown in the literature search, the two pathways we replicated have strong relationships with metabolic disorders.
Discussion
Compared to univariate approaches, which analyze each phenotype individually, our real data analysis successfully demonstrated that the multivariate approach could identify pathways commonly associated with specific traits. It is important to construct a systematic analysis that considers the correlation between complex diseases and their underlying biological traits. In addition, our results from two wellestablished pathway databases were strongly supported by many existing publications, thus demonstrating the advantage of our proposed approach.
Compared to existing multivariate analysis methods, PHARAOHmulti features several advantages. Firstly, by constructing a hierarchical structure of genespathwaysphenotypes, four types of associations (genesingle phenotype, genemultiple phenotypes, pathwaysingle phenotype, and pathwaymultiple phenotypes) can be estimated simultaneously. Compared to our proposed method, existing methods of multivariate analysis were limited to genelevel analysis, and hence, the combinatorial effect of multiple genes, via biological pathways, was impossible to estimate. In addition, the proposed method considers the correlation between genes, pathways, and phenotypes, by imposing penalty parameters on the estimation procedure.
Secondly, PHARAOHmulti provides multiple options for correcting for the multiple testing issue. Although Bonferroni correction is simple, and powerfully controls type 1 error, it is a wellknown fact that the Bonferroni correction often results in controls that are too stringent, when the tests are correlated. Under such conditions, application of the WestfallYoung permutation procedure can be an appropriate alternative, since its asymptotic optimality under dependence is known [35]. In this respect, the proposed method has the advantage of identifying causal pathways, by considering correlation among pathways.
For analysis times of both simulation and real datasets, MARV was the fastest among all the methods, while PHARAOHmulti ran slightly faster than aSPU. For example, in the analysis of simulation dataset of 100 genes with 1000 samples, the running times of MARV, PHARAOHmulti, and SPU were 13, 67 and 235 s, respectively. The trends of execution time were consistent regardless of simulation parameters or datasets. However, PHARAOHmulti can be further accelerated with multithreading which is not supported by MARV and aSPU. With multithreading of 8 threads, the analysis time of PHARAOHmulti was reduced to 12 s.
At this point, there are a number of subjects we can consider for future research. Our current analysis is limited only to Korean population. In our future study, we apply our method to the whole data of 13,000 WES dataset of T2DGENES consortium [46] which contains our KARE samples. It would be a challenging work to identify novel pathways across multiple populations. For the methodological aspect, our approach uses genelevel collapsing of multiple rare variants. Although the collapsing method has the advantage that the analysis of very rare variants is possible, it cancels out the effects of variants with opposite direction (e.g, gene upregulation vs. downregulation). Despite such limitations, our method showed great potential in identifying causal genetic structure in the real data analysis. However, further research, on a more sophisticated approach that can consider the effect direction of variants, is needed. Moreover, we plan to improve our proposed multivariate analysis by applying Generalized Estimating Equations (GEE) or Linear Mixed Model (LMM). Our method can be extended to prediction models, rather than association tests, using other types of penalization, such as LASSO or SCAD [47]. Lastly, our method can also be extended to pathway interaction analysis that has been commonly performed in gene expression data analysis [48].
Conclusion
In this study, we proposed a novel statistical approach for multivariate pathwaybased analysis of rare variants, from largescale sequencing datasets. Analyses of multiple phenotypes have been successful in analyzing various complex diseases, including type2 diabetes (T2D) or hypertension. In general, curated guidelines suggest diagnosing T2D according to traits observed in the individual. Consequently, incorporating multiple correlated traits, to be investigated for association with specific diseases, via multivariate analysis, elevates the statistical power. In this respect, our simulation study reflects the relationship between diseases and their related traits. Throughout the simulation study, PHARAOHmulti outperformed existing multivariate methods. In addition, our proposed method successfully demonstrated several advantages of multivariate analysis, including significantly improving the detection power of causal pathways, as compared to univariate analysis, while also retaining detection power for the individual phenotype. Moreover, we successfully demonstrated that the proposed method is capable of identifying plausible pathways in the real dataset, by identifying eight pathways in the discovery study, and replicating two pathways in the replication study. We firmly believe that the proposed method will assist researchers in understanding the genetic structures that underlie many complex diseases.
References
 1.
Schizophrenia Working Group of the Psychiatric Genomics C. Biological insights from 108 schizophreniaassociated genetic loci. Nature. 2014;511(7510):421–7.
 2.
Lettre G, Palmer CD, Young T, Ejebe KG, Allayee H, Benjamin EJ, Bennett F, Bowden DW, Chakravarti A, Dreisbach A, et al. Genomewide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe project. PLoS Genet. 2011;7(2):e1001300.
 3.
McCarthy MI, Zeggini E. Genomewide association studies in type 2 diabetes. Curr Diab Rep. 2009;9(2):164–71.
 4.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, et al. Finding the missing heritability of complex diseases. Nature. 2009;461(7265):747–53.
 5.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
 6.
Choi S, Lee S, Cichon S, Nothen MM, Lange C, Park T, Won S. FARVAT: a familybased rare variant association test. Bioinformatics. 2014;30(22):3197–205.
 7.
Wang L, Lee S, Gim J, Qiao D, Cho M, Elston RC, Silverman EK, Won S. Familybased rare variant association analysis: a fast and efficient method of multivariate phenotype association analysis. Genet Epidemiol. 2016;40(6):502–11.
 8.
Zhou X, Stephens M. Efficient multivariate linear mixed model algorithms for genomewide association studies. Nat Methods. 2014;11(4):407–9.
 9.
Lee S, Won S, Kim YJ, Kim Y, Consortium TDG, Kim BJ, Park T. Rare variant association test with multiple phenotypes. Genet Epidemiol. 2016;
 10.
Costanzo M, Baryshnikova A, Bellay J, Kim Y, Spear ED, Sevier CS, Ding H, Koh JL, Toufighi K, Mostafavi S, et al. The genetic landscape of a cell. Science. 2010;327(5964):425–31.
 11.
Hirschhorn JN. Genomewide association studiesilluminating biologic pathways. N Engl J Med. 2009;360(17):1699–701.
 12.
Lee JH, Zhao XM, Yoon I, Lee JY, Kwon NH, Wang YY, Lee KM, Lee MJ, Kim J, Moon HG, et al. Integrative analysis of mutational and transcriptional profiles reveals driver mutations of metastatic breast cancers. Cell Discov. 2016;2:16025.
 13.
Zhao XM, Liu KQ, Zhu G, He F, Duval B, Richer JM, Huang DS, Jiang CJ, Hao JK, Chen L. Identifying cancerrelated microRNAs based on gene expression data. Bioinformatics. 2015;31(8):1226–34.
 14.
American Diabetes A. Diagnosis and classification of diabetes mellitus. Diabetes Care. 2014;37(Suppl 1):S81–90.
 15.
O'Reilly PF, Hoggart CJ, Pomyen Y, Calboli FC, Elliott P, Jarvelin MR, Coin LJ. MultiPhen: joint model of multiple phenotypes can increase discovery in GWAS. PLoS One. 2012;7(5):e34861.
 16.
Yang Q, Wang Y. Methods for analyzing multivariate phenotypes in genetic association studies. J Probab Stat. 2012;2012:652569.
 17.
Kaakinen M, Magi R, Fischer K, Heikkinen J, Jarvelin MR, Morris AP, Prokopenko I. A rarevariant test for highdimensional data. Eur J Hum Genet. 2017;
 18.
Kwak IY, Pan W. Adaptive gene and pathwaytrait association testing with GWAS summary statistics. Bioinformatics. 2016;32(8):1178–84.
 19.
Sun J, Oualkacha K, Forgetta V, Zheng HF, Brent Richards J, Ciampi A, Greenwood CM, Consortium UK. A method for analyzing multiple continuous phenotypes in rare variant association studies allowing for flexible correlations in variant effects. Eur J Hum Genet. 2016;24(9):1344–51.
 20.
Lee S, Choi S, Kim YJ, Kim BJ, Consortium TdG, Hwang H, Park T: Pathwaybased approach using hierarchical components of collapsed rare variants. Bioinformatics 2016, 32(17):i586i594.
 21.
Alexa A, Rahnenfuhrer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22(13):1600–7.
 22.
Skarman A, Shariati M, Jans L, Jiang L, Sorensen P. A Bayesian variable selection procedure to rank overlapping gene sets. BMC bioinformatics. 2012;13:73.
 23.
Cho YS, Go MJ, Kim YJ, Heo JY, Oh JH, Ban HJ, Yoon D, Lee MH, Kim DJ, Park M, et al. A largescale genomewide association study of Asian populations uncovers genetic factors influencing eight quantitative traits. Nat Genet. 2009;41(5):527–34.
 24.
Liberzon A, Subramanian A, Pinchback R, Thorvaldsdottir H, Tamayo P, Mesirov JP. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27(12):1739–40.
 25.
Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of largescale molecular data sets. Nucleic Acids Res. 2012;40(Database issue):D109–14.
 26.
Fabregat A, Sidiropoulos K, Garapati P, Gillespie M, Hausmann K, Haw R, Jassal B, Jupe S, Korninger F, McKay S, et al. The Reactome pathway knowledgebase. Nucleic Acids Res. 2016;44(D1):D481–7.
 27.
Kim YJ, Go MJ, Hu C, Hong CB, Kim YK, Lee JY, Hwang JY, Oh JH, Kim DJ, Kim NH, et al. Largescale genomewide association studies in east Asians identify new genetic loci influencing metabolic traits. Nat Genet. 2011;43(10):990–5.
 28.
Hwang H, Takane Y. Generalized structured component analysis. Psychometrika. 2004;69(1):81–99.
 29.
Takane Y, Hwang H. An extended redundancy analysis and its applications to two practical examples. Computational Statistics & Data Analysis. 2005;49(3):785–808.
 30.
Hwang H. Regularized generalized structured component analysis. Psychometrika. 2009;74(3):517–30.
 31.
Brown MB. 400: a method for combining nonindependent, onesided tests of significance. Biometrics. 1975;31(4):987–92.
 32.
Kost JT, McDermott MP. Combining dependent Pvalues. Stat Probabil Lett. 2002;60(2):183–90.
 33.
Alves G, Yu YK. Accuracy evaluation of the unified Pvalue from combining correlated Pvalues. PLoS One. 2014;9(3):e91225.
 34.
Hoerl AE, Kennard RW. Ridge Regression  Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55&.
 35.
Meinshausen N, Maathuis MH, Bühlmann P. Asymptotic optimality of the Westfall–young permutation procedure for multiple testing under dependence. Ann Stat. 2011;39(6):3369–91.
 36.
Westfall PH, Young SS. Resamplingbased multiple testing : examples and methods for Pvalue adjustment. New York: Wiley. 1993;
 37.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B Methodol. 1995;57(1):289–300.
 38.
Li B, Wang G, Leal SM. SimRare: a program to generate and analyze sequencebased data for association studies of quantitative and qualitative traits. Bioinformatics. 2012;28(20):2703–4.
 39.
Azhar S. Peroxisome proliferatoractivated receptors, metabolic syndrome and cardiovascular disease. Futur Cardiol. 2010;6(5):657–91.
 40.
Hall D, Poussin C, Velagapudi VR, Empsen C, Joffraud M, Beckmann JS, Geerts AE, Ravussin Y, Ibberson M, Oresic M, et al. Peroxisomal and microsomal lipid pathways associated with resistance to hepatic steatosis and reduced proinflammatory state. J Biol Chem. 2010;285(40):31011–23.
 41.
Deng Y, Xu L, Zeng X, Li Z, Qin B, He N. New perspective of GABA as an inhibitor of formation of advanced lipoxidation endproducts: it's interaction with malondiadehyde. J Biomed Nanotechnol. 2010;6(4):318–24.
 42.
Ma YH, Hu JH, Zhou XG, Zeng RW, Mei ZT, Fei J, Guo LH. Transgenic mice overexpressing gammaaminobutyric acid transporter subtype I develop obesity. Cell Res. 2000;10(4):303–10.
 43.
Wu G, Fang YZ, Yang S, Lupton JR, Turner ND. Glutathione metabolism and its implications for health. J Nutr. 2004;134(3):489–92.
 44.
Wagner O, Jilma B. Putative role of adhesion molecules in metabolic disorders. Horm Metab Res. 1997;29(12):627–30.
 45.
McIlwain DR, Berger T, Mak TW. Caspase functions in cell death and disease. Cold Spring Harb Perspect Biol. 2015;7(4)
 46.
Fuchsberger C, Flannick J, Teslovich TM, Mahajan A, Agarwala V, Gaulton KJ, Ma C, Fontanillas P, Moutsianas L, McCarthy DJ, et al. The genetic architecture of type 2 diabetes. Nature. 2016;536(7614):41–7.
 47.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.
 48.
Liu KQ, Liu ZP, Hao JK, Chen L, Zhao XM. Identifying dysregulated pathways in cancers from pathway interaction networks. BMC Bioinformatics. 2012;13:126.
Acknowledgements
Not applicable.
Funding
This research was supported by grants of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (HI15C2165, HI16C2037), supported by the BioSynergy Research Project (2013M3A9C4078158) of the Ministry of Science, ICT and Future Planning through the National Research Foundation. The publication of this article was sponsored by the BioSynergy Research Project.
Availability of data and materials
An implementation of PHARAOHmulti can be downloaded from the website (http://statgen.snu.ac.kr/software/pharaohmulti). The KARE exome sequencing dataset is a part of T2DGENES consortium, and is available upon approval of T2DGENES project committee. The HEXA exome chip dataset is a part of KoGES, and is available upon approval of the genome center in Korea National Institute of Health.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 4, 2018: Selected articles from the 16th Asia Pacific Bioinformatics Conference (APBC 2018): bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume18supplement17.
Author information
Affiliations
Contributions
SL performed all analyses and developed the software implementation. SL and TP conducted the entire study, developed the methodology, and wrote the manuscript. YK and SC helped with the performing of analyses. HH helped developing the methodology. All of the authors have read and approved of the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
KARE and HEXA dataset are a part of Korean Genome Epidemiology Study (KoGES). In this study, the exome sequencing data from 1037 samples of KARE study and the exome chip data from 3445 samples of HEXA study were used. KARE dataset was used under the partnership of T2DGENES. All participants of KARE and HEXA studies provided written informed consent. The study using KARE and HEXA samples was approved by two independent institutional review boards at Seoul National University.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lee, S., Kim, Y., Choi, S. et al. Pathwaybased approach using hierarchical components of rare variants to analyze multiple phenotypes. BMC Bioinformatics 19, 79 (2018). https://doi.org/10.1186/s1285901820669
Published:
DOI: https://doi.org/10.1186/s1285901820669
Keywords
 Pathwaybased analysis
 Nextgeneration sequencing data
 Multivariate analysis
 Generalized structured component analysis
 Hierarchical analysis