Identifying novel associations in GWAS by hierarchical Bayesian latent variable detection of differentially misclassified phenotypes

Background Heterogeneity in the definition and measurement of complex diseases in Genome-Wide Association Studies (GWAS) may lead to misdiagnoses and misclassification errors that can significantly impact discovery of disease loci. While well appreciated, almost all analyses of GWAS data consider reported disease phenotype values as is without accounting for potential misclassification. Results Here, we introduce Phenotype Latent variable Extraction of disease misdiagnosis (PheLEx), a GWAS analysis framework that learns and corrects misclassified phenotypes using structured genotype associations within a dataset. PheLEx consists of a hierarchical Bayesian latent variable model, where inference of differential misclassification is accomplished using filtered genotypes while implementing a full mixed model to account for population structure and genetic relatedness in study populations. Through simulations, we show that the PheLEx framework dramatically improves recovery of the correct disease state when considering realistic allele effect sizes compared to existing methodologies designed for Bayesian recovery of disease phenotypes. We also demonstrate the potential of PheLEx for extracting new potential loci from existing GWAS data by analyzing bipolar disorder and epilepsy phenotypes available from the UK Biobank. From the PheLEx analysis of these data, we identified new candidate disease loci not previously reported for these datasets that have value for supplemental hypothesis generation. Conclusion PheLEx shows promise in reanalyzing GWAS datasets to provide supplemental candidate loci that are ignored by traditional GWAS analysis methodologies.


Text S3. Effect of differential thresholding on performance of PheLEx and Rekaya in simulations
In contrast to earlier assumptions of using all genotypes in the dataset as input to misclassification extraction methods [1][2][3], filtering of genotypes processed as input (training SNPs) was proposed in this study. The p-value cut-off used to filter training SNPs was p < 10 -6.3 for performance results stated above and in the results section of the main paper. Once the input was processed by PheLEx and Rekaya respectively, a misclassification probability threshold t was used as cut-off on the average misclassification probabilities (for cases in the phenotype) returned as output. For performance results shown above (and results in the main paper), this threshold t was set to be the 99 th percentile of misclassification probabilities, where any sample (case in phenotype) having misclassification probability greater than t was marked as misclassified and switched from case to control in the PheLEx corrected phenotypes produced. Effect of using differential threshold values was examined on (i) the p-value cut-off (p < 10 -6.3 , p < 10 -5 , and p < 10 -4 ) used to filter training SNPs, and (ii) the misclassification probability threshold t (99 th percentile, 95 th percentile, 90 th percentile, 85 th percentile, 80 th percentile, and 75 th percentile) used to determine misclassified samples by misclassification extraction methods.
Impact of misclassification probability threshold t on GWAS performance was also investigated (Fig. S6). Using differential threshold t values (99 th percentile, 95 th percentile, 90 th percentile, 85 th percentile, 80 th percentile, and 75 th percentile) on average misclassification probabilities produced by PheLEx (with p < 10 -6.3 p-value cut-off on training SNPs), PheLEx corrected phenotypes were produced at each threshold t value across simulations with increasing misclassification rates. Association analyses with the resulting corrected phenotypes at different threshold t values were used to compute p-values and used to quantify AUC ROC and AUC PR values in detecting diseaseassociated SNPs. AUC ROC and AUC PR values in detecting disease-associated SNPs were computed across different threshold t values for simulated true phenotypes (no misclassification), misclassified phenotypes, and PheLEx corrected phenotypes for comparison. Across threshold t values, with increasing misclassification AUC ROC and AUC PR values in detecting diseaseassociated SNPs decreased for all phenotypes. At higher misclassification rates, PheLEx corrected phenotypes consistently improved on the AUC ROC and AUC PR values for misclassified phenotypes across threshold t values (85 th percentile to 99 th percentile), however this improvement was not observed at less stringent threshold t values (75 th percentile and 80 th percentile). Gradual decrease in AUC ROC and AUC PR values in detecting disease-associated SNPs with decreasing t values can be attributed to the lower precision (Precision = fraction of samples correctly identified as misclassified from total samples identified as misclassified using PheLEx at threshold t) in correcting phenotypes at lower values of t that result in larger number of true cases switched to controls (Fig. S7), thus lowering the improvement that is observed at more stringent threshold t values (that have higher precision of correctly identifying misclassified samples). These results highlight the importance of using stringent threshold on computed misclassification probabilities to determine misclassified samples.

Text S4. Algorithm details for Rekaya, PheLEx-mh, and PheLEx-m
The major existing misclassification framework designed for the analysis of GWAS data is described in three papers [49,73,77]. Based on the authorship of these papers, we have designated this method as "Rekaya". Along with comparing PheLEx to the algorithm Rekaya (as described in these publications) we also implemented two additional versions of PheLEx to provide a fair comparison when separately assessing the impact of the two major differences between PheLEx and Rekaya: (i) PheLEx includes an Adaptive Metropolis-Hastings step in the MCMC algorithm not present in Rekaya Gibbs sampler and (ii) PheLEx includes a full mixed model that can account for genetic relatedness/population structure. We have named these two variants of PheLEx as follows: (i) PheLEx-mh (PheLEx -/minus Metropolis-Hastings): This method removed Adaptive Metropolis-Hastings (mh) step from PheLEx and incorporated the full Gibbs sampler from existing method by Rekaya et al [1] resulting in a full Gibbs sampler with a mixed model (ii) PheLEx-mm (PheLEx -/minus mixed model): This method excluded mixed model from the method PheLEx, resulting in an Adaptive Metropolis-Hastings within Gibbs sampler that doesn't account for mixed effects due to genetic relatedness/population structure.
In the following subsections, we provide details of the implementation of the MCMC algorithms described for the Rekaya framework [49, 73, 77] and the MCMCs for PheLEx-mh and PheLExmm.

Rekaya algorithm
The following steps were used from the Bayesian threshold method as described by Rekaya [1]. This method is referred to as Rekaya in simulation results. Distribution is truncated for : > 0 : = 1and : <= 0 iii.

PheLEx-mh algorithm
Comparison between PheLEx and PheLEx-mh (PheLEx -/minus Metropolis-Hastings) was used to determine improvement due to use of Adaptive Metropolis-Hastings within Gibbs instead of Gibbs Sampling, and between PheLEx-mh and Rekaya was used to determine improvement when accounting for genetic relatedness. PheLEx-mh was therefore designed to include an underlying mixed model but was implemented without the Adaptive Metropolis-Hastings step within the Gibbs Sampling algorithm, an approach which also provides a fair comparison to existing methods proposed by Rekaya et al [1][2][3]. Mixed effects due to genetic relatedness/population structure were inferred using conditional probability distributions as described in previous literature [7,8]. The following steps were used:

PheLEx-mm Algorithm
This algorithm was designed to exclude an underlying mixed model from PheLEx, where comparison between PheLEx-mm (PheLEx -/minus mixed model) and Rekaya [1][2][3] was used to determine improvement due to Adaptive Metropolis-Hastings algorithm over Gibbs Sampling without considering effects due to accounting for genetic relatedness/population structure.
Given observed phenotype Y for n individuals and genotypes matrix X for j SNPs where • € is a genotypes vector for SNP j

Repeat from (2) for 99,999 iterations
Variance for jumping distribution of effect sizes is adjusted across iterations to maintain acceptance ratio for MCMC chains around 0.2 using method defined previously [9]. Acceptance rate of 0.2 was used as recommended by Gelman et al [10]. The algorithm was run on each dataset for 100,000 iterations with burn-in of 20,000 iterations. At each iteration, estimates for each parameter are used to calculate misclassification probability for each sample which is then used to identify misclassified samples. Prior probabilities on parameters are defined as:

Text S5. Additional details for analysis of UK Biobank bipolar disorder and epilepsy phenotypes using PheLEx
To ensure reliability of results for UK Biobank phenotypes, a more conservative approach was employed for analyses of these phenotypes using PheLEx. Association analysis was performed for each original disease phenotype and a p-value cut-off used on the resulting unadjusted p-values to filter training SNPs. Using differential p-value cut-offs on training SNPs, number of resulting potential training SNPs were calculated (Fig. S8). As simulation results suggested a conservative threshold, p-value cut-off of 10 -5 was selected to prune training SNPs used for both bipolar disorder and epilepsy phenotypes. Using training SNPs per phenotype and phenotype data, PheLEx was used to analyze the original disease phenotypes and produce misclassification probabilities. Ten sets of misclassification probabilities were produced by analyzing the data using PheLEx with ten independent chains. All samples identified as misclassified using misclassification probability threshold t (values: 99 th percentile, 95 th percentile, and 90 th percentile) were listed. Any sample marked as misclassified in at least g = 2, 3, 4, 5, and 6 out of ten sets of misclassification probabilities (produced by PheLEx) were switched from cases to controls for that phenotype. Table  S1 lists results from both bipolar disorder and epilepsy phenotypes with threshold t values, g values (number of misclassification probability sets where sample was consistently marked as misclassified), and the resulting number of cases that were identified as misclassified and switched to controls for that phenotype. Association analysis was performed for each corrected UK Biobank phenotype produced using this methodology and PheLEx discoveries were identified (Table S1). PheLEx discoveries were identified using the same criteria as defined in the main text: PheLEx discoveries were (i) differentially significant SNPs in corrected vs. original UK Biobank phenotype using adjusted p-values < 0.1 as cut-off and (ii) not in LD with training SNPs (r 2 < k, k ~ 1e -2 ) used for that phenotype. As simulation results showed increased accuracy with stringent thresholds (t and g), the most stringent threshold values that resulted in identification of PheLEx discoveries were selected for each phenotype. For PheLEx corrected bipolar disorder phenotype, parameters selected were misclassification probability threshold t = 95 th percentile and g = 4 misclassification probability sets. For PheLEx corrected epilepsy phenotype, parameters selected were misclassification probability threshold t = 95 th percentile and g = 2 misclassification probability sets. Although additional loci were identified for less stringent threshold t and g values, a conservative approach was applied and is recommended for future analyses using PheLEx. Results for PheLEx discoveries at selected parameter values are discussed in the main text. Figure S1. Performance analysis of simulations without genetic relatedness/population structure. (A-E) Mean Precision (y-axis) over Recall (x-axis) over 100 simulated datasets without genetic relatedness/population structure is shown across increasing misclassification (5%, 10%, 20%, 30%, and 40%). Misclassification extraction methods: (i) PheLEx-mm (dark blue) represents PheLEx without mixed model, (ii) Rekaya with PheLEx input (dark gray) represents methodology implemented by Rekaya et al.        Table S1. Effect of differential thresholding of misclassification probabilities on analyses of PheLEx corrected UK Biobank phenotypes