Estimating colocalization probability from limited summary statistics

Background Colocalization is a statistical method used in genetics to determine whether the same variant is causal for multiple phenotypes, for example, complex traits and gene expression. It provides stronger mechanistic evidence than shared significance, which can be produced through separate causal variants in linkage disequilibrium. Current colocalization methods require full summary statistics for both traits, limiting their use with the majority of reported GWAS associations (e.g. GWAS Catalog). We propose a new approximation to the popular coloc method that can be applied when limited summary statistics are available. Our method (POint EstiMation of Colocalization, POEMColoc) imputes missing summary statistics for one or both traits using LD structure in a reference panel, and performs colocalization using the imputed summary statistics. Results We evaluate the performance of POEMColoc using real (UK Biobank phenotypes and GTEx eQTL) and simulated datasets. We show good correlation between posterior probabilities of colocalization computed from imputed and observed datasets and similar accuracy in simulation. We evaluate scenarios that might reduce performance and show that multiple independent causal variants in a region and imputation from a limited subset of typed variants have a larger effect while mismatched ancestry in the reference panel has a modest effect. Further, we find that POEMColoc is a better approximation of coloc when the imputed association statistics are from a well powered study (e.g., relatively larger sample size or effect size). Applying POEMColoc to estimate colocalization of GWAS Catalog entries and GTEx eQTL, we find evidence for colocalization of 150,000 trait-gene-tissue triplets. Conclusions We find that colocalization analysis performed with full summary statistics can be closely approximated when only the summary statistics of the top SNP are available for one or both traits. When applied to the full GWAS Catalog and GTEx eQTL, we find that colocalized trait-gene pairs are enriched in tissues relevant to disease etiology and for matches to approved drug mechanisms. POEMColoc R package is available at https://github.com/AbbVie-ComputationalGenomics/POEMColoc. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04170-z.

Assume a region with p SNPs. Assume a single causal SNP and let c be its index. Let λ be the expected value of the Z statistic at the causal SNP. The log likelihood under the approximate normal distribution is Absorbing constant terms with respect to the parameters The simplification is due to r c being a row of Σ. This is maximized at λ = Z c and c = argmaxZ i . The maximizing choice of c may not be unique, and the model may not be identifiable if two markers have r = 1.
Using these plugin estimates for c and λ we can impute Z 2,mis with its conditional expectation given Z obs .

S2 Tissue enrichment computations
Let C ijk be 1 if trait i is found to be colocalized with an eQTL of gene j in tissue k and 0 otherwise. We compute an enrichment odds ratio for tissue I and trait K and associated p-value from Fisher's exact test using the following 2 × 2 table.
Conditioning on colocalization when performing the enrichment analysis serves to mitigate the effect of confounding variables such as different colocalization probabilities across genes, traits and tissues. For example, there is a strong correlation between the tissue sample size and number of colocalizations detected ( Figure S8). To motivate this computation consider the following model p(C i,j,k = 1) = µα i β j γ k δ ik η ij τ jk If δ ik = 1 for all i, k and τ j,k = 1 (no interactions between colocalization in tissue and colocalization in trait, and no interactions between colocalization in gene and colocalization in trait) for all j, k, expected cell counts are µα I γ K j β j η Ij µα I j β j η Ij k =K γ k µγ K i =I j β j α i η ij µ i =I j β j α i η ij k =K γ k leading asymptotically to an odds ratio of S3 Follow up on UK Biobank analysis using GTCA-COJO The most common difference between POEMColoc and coloc applied to the UK Biobank was that a minority of datasets with high estimated colocalization probability using coloc applied to full summary statistics had low estimated colocalization probability using POEMColoc. We defined a "false negative" colocalization to be one in which coloc estimated the posterior probability of colocalization greater than 0.9, but POEMColoc estimated the posterior probability of colocalization be less than 0.1 and subjected them to further investigation. When looking at datasets "false negative" colocalization using POEM-Coloc, many appeared to have two separate peaks in the full dataset, but only one peak in the imputed dataset ( Figure 3C). One of the peaks was colocalized with the eQTL, but it was not the highest. To address this, we ran GCTA-COJO [1] on the full summary statistics from UK Biobank to detect multiple causal SNPs within a 2 Mb window surrounding each top SNP from the UK Biobank using a p-value cutoff of 5 × 10 −8 and colinearity parameter 0.8. A subsample of 5,000 individuals from the UK Biobank was used as a reference panel. When multiple SNPs were detected, we ran a conditional analysis for the top SNP conditional on all other SNPs. In some cases, the top SNP was not selected in the conditional analysis. If a SNP with LD at least 0.9 to it was, we used this in its place, otherwise the dataset was excluded from the conditional colocalization analysis. We recomputed colocalization probabilities using the conditional summary statistics as input to POEMColoc in place of the raw summary statistics. As a point of comparison, we also looked at how the same analysis changes the results for true positive or true negative datasets (defined by both POEMColoc and coloc having colocalization probabilities greater than 0.9 and less than 0.1 respectively).

S4 Precision-recall curve confidence intervals
Area under the precision-recall curve (AUCPR) was computed using the precrec R package [2]. Bootstrap confidence intervals were generated using boot R [3] for individual methods and pairwise differences in AUCPR. Following recommendations [4], we used a stratified bootstrap. Table S1: Performance of methods in predicting coloc at cutoff 0.9 in the UK Biobank and predicting causal causal variants in simulation. Area under precision recall curve (AUCPR) and 95% bootstrap confidence interval. All confidence intervals for pairwise differences in AUCPR exclude zero (from bootstrap resampling of pairwise differences), except for the difference between PIC-COLO and POEMColoc-2 on simulated datasets. % missing gives the percent of datasets that were not analyzed due to missing SNP information. . Prior probability of a SNP associated with a single trait is set to p 1 = p 2 = 10 −4 . The default value from the coloc R package used elsewhere in this work is p 12 = 10 −5 , but p 12 = 10 −6 is another common choice [5] [6]. Labels show squared correlation R 2 between POEMColoc and coloc estimates of P r(H 4 ). Analysis uses a subset of 507 phenotypes.       Here, we are only considering the 101 pairs where colocalization was detected using the full data for which all three methods could be run. Note that true positives and false negatives will in general be less than this because frequently one or more methods will yield ambiguous results with colocalization probability between 0.1 and 0.9.  Table S5: Top tissue trait enrichment -log10 p-values detected by POEMColoc-1 and eQTL p-value approach. POEMColoc-1 uses a colocalization cutoff of 0.9. eQTL p-value cutoff of ∼ 10 −12 was chosen such that the total number of gene-trait links between the two methods was the same. Because of the arbitrary choice of thresholds, using the same number of top-ranked links ensures we are using thresholds of similar stringency. All enrichments having p-value 0.05 or lower using either method, and at least 3 genes with significant eQTL or colocalizations for the trait in the tissue are shown.     Figure S18: Proportion of gene-trait pairs eQTL p-value of the GWAS top SNP and colocalization posterior probability P(H 4 ) meeting cutoffs that involve coding genes (A) and approved drug targets conditional on being coding (B). xMHC genes were excluded from both the numerator and the denominator of B because they are unavailable in approved drug target dataset.