An optimal variant to gene distance window derived from an empirical definition of cis and trans protein QTLs

Background A genome-wide association study (GWAS) correlates variation in the genotype with variation in the phenotype across a cohort, but the causal gene mediating that impact is often unclear. When the phenotype is protein abundance, a reasonable hypothesis is that the gene encoding that protein is the causal gene. However, as variants impacting protein levels can occur thousands or even millions of base pairs from the gene encoding the protein, it is unclear at what distance this simple hypothesis breaks down. Results By making the simple assumption that cis-pQTLs should be distance dependent while trans-pQTLs are distance independent, we arrive at a simple and empirical distance cutoff separating cis- and trans-pQTLs. Analyzing a recent large-scale pQTL study (Pietzner in Science 374:eabj1541, 2021) we arrive at an estimated distance cutoff of 944 kilobasepairs (95% confidence interval: 767–1,161) separating the cis and trans regimes. Conclusions We demonstrate that this simple model can be applied to other molecular GWAS traits. Since much of biology is built on molecular traits like protein, transcript and metabolite abundance, we posit that the mathematical models for cis and trans distance distributions derived here will also apply to more complex phenotypes and traits. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04706-x.

are somehow influencing the gene encoding the protein resulting in differences in levels of the protein. Such a DNA variant is termed a protein quantitative trait locus, or pQTL.
Many GWAS of protein abundance have now been conducted generating tens of thousands of published pQTLs. The genomic position of a pQTL for a specific protein is often located near the cognate gene. As a practical matter, a pQTL near a cognate gene is called a "cis-pQTL" with the assumption that that pQTL is acting through the cognate gene. When a pQTL is located far from the cognate gene (or on a different chromosome), this is called a "trans-pQTL" with the assumption that that pQTL is acting through an intermediate gene, for example a transcription factor near the trans-pQTL which then drives expression of the cognate gene. Although the mechanisms driving "cis-pQTLs" and "trans-pQTLs" are completely different, there is nothing in the statistical associations themselves that can easily distinguish these. Consequently different studies have applied different distance cutoffs to segregate cis-pQTLs and intrachromomsomal trans-pQTLs, with typical distances being 500,000 bp or 1,000,000 bp (e.g., [1][2][3][4]).
In order to arrive at an empirical definition of cis-and (intrachromosomal) trans-pQTLs we took a closer look at the distribution of variants to cognate gene from a recent large scale pQTL study [4]. We start with the simple assumption that for cis-pQTLs there should be some (non-random) distance dependence between the variant and the transcription start site (TSS) of the cognate gene, while for trans-pQTLs, there should be a random distribution of distances. From these assumptions, we build a simple model that fits the distribution of intrachromosomal variant-TSS distances across the full range of observed distances.
The second peak (longer distances) follows the expected distribution for two points selected at random from the human genome within the same chromosome, which is to say there is no explicit distance dependence on the variant-TSS distances aside from the physical requirement that the two positions occur on the same chromosome. This distance-independent distribution requires no free parameters (see Methods).
Although the first peak superficially resembles a Gaussian or normal distribution, we found empirically that we could obtain a better fit with fewer parameters using the Weibull distribution (see Methods). The Weibull distribution thus provides a specific model for the distance dependence on the distribution of variant-TSS distances.
The full model requires a third parameter which is the proportion of the observations falling in to the first peak or the second peak. It should be noted that when plotted using the untransformed variant-TSS distance the distribution resembles a rapidly decreasing exponential decay, with a maximum density close to 0 (Fig. 2).
The full distribution is well fit by our combined model. The three parameters (and the standard errors) for the full model are κ (Weibull shape parameter) = 6.78 (0.211), λ (Weibull scale parameter) = 4.48 (0.024), and Weibull fraction = 0.799 (Table 1). The Histogram of log base 10 of the distance from lead SNP for GWAS of protein abundance to the transcription start site (TSS) of the cognate gene for that protein, for 2051 unique proteins from the study of Pietzner, et al. Four bins are used for each log unit. Solid red line represents the best fit Weibull distribution curve fit to all data points below 10 5.75 . Solid blue line represents best fit random distribution curve fit to all pQTLs with a distance beyond above 10 6 base pairs. Dashed purple line represents best combined model starting from the parameters estimated for the initial Weibull curve and adding a Weibull fraction parameter to add the Weibull curve and the trans model curve Fig. 2 Histogram of distance from lead SNP for GWAS of protein abundance to the transcription start site (TSS) of the cognate gene for that protein, for 1604 unique proteins where the distance is less than 500 kb (bin size = 10 kb), with the curve fit to our global model which includes a Weibull curve and our trans model. The Weibull model dominates a t distances less than 1,000,000 base pairs one-sample Kolmogorov-Smirnov test for this fit yields a p-value of p = 0.36, consistent with this simple model being a reasonable explanation for the observed distribution. By contrast, using instead a two parameter (mean and sd) Gaussian to fit the left-most peak in a similar three parameter model yielded an analogous Kolmogorov-Smirnov test p-value of p = 0.005, indicating a significantly poorer fit. Using these parameters from the best-fit Weibull-based model, we calculated where the regimes cross-over, that is, the distance at which the Weibull and the random model are equally likely. For this data set, with ~ 80% Weibull fraction, that cross-over occurs at a distance of 944 kilobasepairs (kbp) with a 95% confidence interval of [767-1,161] kbp. If we assume a model where the two regimes are a priori equally likely, that cross over would occur at 653 kbp, with a 95% confidence interval of [556-767] kbp.
A separate large-scale pQTL study using the same platform but applied to a different population was recently published [3]. Ferkingstad  Another genetic trait where the cognate gene is a reasonable hypothesis for the causal gene is mRNA abundance, in which case the genetic variant represents an expression quantitative trait locus (eQTL). The eqtlgen [5] study represents a very large, well powered eQTL GWAS. To reduce the computation involved this study only calculated association statistics for variants within 1 Mb of the midpoint of the cognate gene, limiting our ability to use this study to analyze long-range intrachromosomal eQTLs. Despite this truncation, the distances under 1 Mb have a reasonable fit to the Weibull distribution, with shape and scale parameters similar to those observed in the two pQTL studies (κ = 6.19 (0.315), λ = 4.47 (0.043)).
When the genetic trait is metabolite abundance (metabolite quantitative trait locus, or metabolite QTL) the known biochemistry can point to a likely causal gene [6]. Using all available metabolite QTLs in the GWAS catalog paired with a large set of curated metabolite interacting proteins [7] we identified 250 intrachromosomal variant-gene pairs, of which 53 have a variant-TSS distance of less than 1 Mb. These variant-TSS distance can also be fit with our combined model, with Weibull shape and scale parameters of κ = 4.262(1.881) and λ = 5.544 (0.708), respectively. For this analysis the Weibull fraction is only 35%, probably indicating that the limited number of curated metabolite interacting proteins is missing many of the true causal genes.

Discussion
The GWAS catalog now contains over 300,000 genetic associations, but for the majority of these the underlying causal gene, the gene mediating the phenotypic impact of the genetic variation, is unknown [8]. While the genes close to the genetic variation often represent plausible candidate genes, a precise definition of "close" has been difficult to define.
Traditionally, genetic variants influencing abundance of proteins or transcripts have been described as "cis-acting" or "trans-acting" with the understanding that "cis-acting" variants exert their influence directly on the cognate gene while "trans-acting" variants influence some other gene which then as a downstream consequence influences the transcript or protein being measured. In the absence of a simple method of determining the exact mechanism of any particular variant, researchers have typically relied on distance cutoffs to separate cis and trans variants, with the distance ranging from 250,000 bp [9] up to 1 million bps [10]. One prior attempt to empirically define the divide between cis and trans effects found that the percent of cis-eQTLs that could be supported by allele-specific expression fell with increasing variant-TSS distance and could not be distinguished from that expected by error at a distance of between 0.85 and 1.3 million base pairs [11].
In this work we relied on a very well powered pQTL study which allowed us to identify two populations of variant-gene distances; one population where the distribution of distances is a function of the distance of the gene from the variant, and a second population where the distances are dictated by the mathematics of picking two points at random. The first population follows a Weibull distribution and is substantially contained within the interval from 0 to 1 Mb. For the second distribution, because most chromosomes are over 100 Mb long, two randomly selected intrachromosomal points are almost always (99%) more than 1 Mb apart. Thus, these two populations are well-separated and can be interpreted as the mathematical representations of the biological processes of cis-and transQTLs.
Previous analyses of molecular QTLs have similarly noted a rapid drop-off of observed associations with increasing variant-TSS distance. For example, Roby Joehanes et al. used a multi-exponential decay with a median variant-TSS distance of 27 kb to model a large set of eQTLs measured in whole blood samples from over 5000 participants in the Framingham Heart Study [12]. There is however no theoretical model to rationalize an exponential decay for this distribution.
As noted by Lieberman-Aiden et al. [13], the distribution of promoter-enhancer Hi-C distances can be modeled using a power-law with an exponent of approximately -1 [14,15]. Plotted against variant-TSS distance, a power-law with p(distance) proportional to 1/distance looks similar to an exponential decay, with p(distance) proportional to e −distance . However, a simple 1/distance power-law distance dependence would not generate the curve obtained in Fig. 1, since a power-law would place an equal number of observations in each bin, since the bins increase in width with increasing distance at the same rate that p(distance) is decreasing.
The Weibull distribution used here was first described as a family of curves [16] which has found applicability to describe the distribution of particle sizes following fragmentation or fractionation [17]. Brown and Wohletz provide a mechanistic derivation for the Weibull distribution which follows from repeated fragmentation of a larger structure, with each step resulting in a fractal fragmentation pattern (thus following a power law). Smaller fragments escape further fragmentation, resulting in a rapid drop-off of larger particle sizes.
An entirely hypothetical conjecture would be that the pattern we observe in these data results from a similar superposition of multiple processes. The Activity-by-Contact model of enhancer-promoter regulation suggests that the activity of a particular enhancer-promoter pair is increased by the strength (activity) of the enhancer and decreased by the distance between the enhancer and promoter [18,19]. Since any given promoter can be influenced by multiple enhancers, the strongest genetic associations are more likely to come from closer enhancers. The dense packing of the chromosome provides the equivalent of the single fractionation event, imposing a fractal distance geometry on the genome. The fact that there are far more enhancers than promoters in the genome provides the equivalent of multiple fractionation events, potentially explaining the fit to the Weibull distribution for molecular QTLs in the range of 0 to 1 million base pairs (Fig. 3).
Trans-eQTLs and trans-pQTLs are generally understood to be acting on a gene proximal to the variant which then influences the molecular trait of interest. The cis fraction in our combined model is then likely to reflect the extent to which we have correctly selected the set of true causal genes for a given study. Further, the model suggests that in general about 99.9% of GWAS variants should be explainable through a gene with a TSS within 1 megabase of the lead variant. Thus, if a large fraction of the variant-TSS distances fall into the long-range, distance-independent regime our model suggests it is worth taking another look at the set of proposed or potential causal genes.
Assuming that most GWAS variants are likely impacting biology through their influence on molecular traits such as transcript, protein or metabolite abundance we expect that the cis-and trans-models and distributions observed here will apply to other, more complex or polygenic traits. It should be noted however that the exact mechanism linking the GWAS variant to the causal gene is not addressed in this model. It has been observed that a large fraction of pQTLs and metabolite QTLs are linked to missense variants, and that may skew the exact distributions somewhat when looking at other phenotypes or disease traits.
An important consideration in the dissection of individual loci is the observation that paralogous genes often exist near one another [20], meaning many genes with similar functions may exist in cis to the lead SNP. We avoided this scenario in this analysis by, for example, only including metabolite QTLs for which HMDB listed a single biochemically-related gene on the entire chromosome (see Methods). In practice, researchers According to the ABC model [18] of gene activation and models of chromatin compaction [14,15], the chance that a particular enhancer (E1-E4) is in contact with the promoter of a particular gene ("Gene") is proportional to distance −γ (that is, distance to the power -γ, where γ has a value of about 1) from the enhancer to the promoter. In a scenario where all enhancers are equally active, a particular gene will be most strongly influenced by the closest enhancer (E2 in this figure). A Weibull model, as observed empirically in this analysis, can result from such a "superposition" of power-law distributions [17] should look carefully at whether there are multiple plausible causal genes, such as from paralogs, which exist within the 944 kb distance cutoff recommended here.
An additional caveat is that this study focused on only the single strongest association per molecular trait (per chromosome) and this will tend to bias the set of variants as well. This simplification was applied here because while it is straight-forward to define the primary signal per locus there are still multiple approaches to defining secondary or independent signals. As molecular QTL studies continue to grow in size and power, it will be important to revisit this analysis with respect to secondary signals.

Conclusion
By leveraging recent large-scale molecular QTL genetic studies we demonstrate that variant-TSS distances fall into one of two regimes, a short-range, distance-dependent one, or a long-range, distance-independent one. These correspond to the biological notions of cis and trans genetic effects. By providing mathematical models for these two regions we demonstrate a clear separation occurring at about 944 kbp in the situation where 80% of observations are well explained by cis effects, or 650 kbp when cis and trans effects are equally likely.

Methods
Pietzner et al. reported 10,674 unique pQTLs across 3,892 distinct proteins [4]. For this analysis we only used the primary association reported for each locus. Further we eliminated all SNP-protein pairs where the SNP and the cognate gene occur on the different chromosome in the latest build (Ensembl release 104). Variants were converted from the GRCh37 coordinates provided in the original data table by reference to an rsid if provided and current, or via the NCBI Genome remapping service. Proteins were mapped to HGNC gene symbols using the Ensembl gene IDs if provided and current. Otherwise HGNC gene symbols were assigned manually from the provided protein names. As some proteins actually map to multiple genes (either due to complexes or ambiguity), SNP-protein pairs were retained if exactly one of the cognate genes was present on the same chromosome as the SNP. This resulted in a total of 2,051 intrachromosomal SNPgene pairs, where the distance from the SNP to the TSS of the gene was between 3 and 206,513,449 base pairs (Additional file 1).
Given the eight orders of magnitude range for variant-TSS distances we log-transformed the distances (specifically using 10 as the base and binning at 0.25 log 10 units to generate 4 bins per order of magnitude). For visualization and curve-fitting, we used the number of intra-chromosomal pQTLs falling in each bin based on this log-transformed distance.
Our trans model is a mathematical model of the distribution of two random positions in the genome that happen to fall on the same chromosome. To generate this trans model, we represented all pairs of two randomly selected positions within a single chromosome as an NxN matrix, where N is the length of the chromosome, and rows and columns representing the 1st and 2nd position, respectively. With the exception of a discontinuity at distance = 0 (also not handled by log-transformation, but highly unlikely and not present in our data), the number of random pairs at a distance d > 0 bp is given by the two diagonal segments shifted 'd' units from the central diagonal, and so totals 2*(N-d) out of N 2 elements (where d < N). The probability of distance 'd' within the chromosome is then prob = 2*(N-d)/ N 2 if d < N, otherwise prob = 0. For any one chromosome then this probability is linear with respect to distance, with a maximum at a distance of 1 and a minimum at a distance corresponding to the length of the chromosome.
Restricting to random pairs on the same chromosome, the relative likelihood of a pair being in any chromosome 'i' is given by its relative number of elements to that of all intra-chromosomal pairs on the genome, so that prob(Chr i ) = N i 2 /ΣN i 2 , where N i is the length of chromosome 'i' . The final probability of a random intrachromosomal pair having distance d > 0 bp is then given by the weighted sum of the probability at each chromosome, with weights given by the relative likelihood of each chromosome. Hence: where each sum is over 23 chromosomes , N i being the length of each chromosome. Applied to the observed pQTL data our trans model is only a reasonable fit in the range from 1 to 230 megabase pairs (Fig. 4).
We found empirically that the remaining (shorter) distances could be fit with the two parameter Weibull distribution. The Weibull distance distribution can be represented as: lx is the log of distance from the pQTL to the TSS of the cognate gene, κ is the shape function and λ is the scale parameter. (1) (2) P(lx) = (κ/ ) * (lx/ ) (κ−1) * e −(lx/ )κ Fig. 4 Histogram of distance from lead SNP for GWAS of protein abundance to the transcription start site (TSS) of the cognate gene for that protein, for 349 unique proteins where the distance is greater than 10 megabases (bin size = 10 megabases), with the curve fit to our global model which includes a Weibull curve and our trans model. The trans model dominates at distances past 1 megabase