 Software
 Open access
 Published:
rBahadur: efficient simulation of structured highdimensional genotype data with applications to assortative mating
BMC Bioinformatics volumeÂ 24, ArticleÂ number:Â 314 (2023)
Abstract
Existing methods for generating synthetic genotype data are illsuited for replicating the effects of assortative mating (AM). We propose rb_dplr, a novel and computationally efficient algorithm for generating highdimensional binary random variates that effectively recapitulates AMinduced genetic architectures using the Bahadur order2 approximation of the multivariate Bernoulli distribution. The rBahadur R library is available through the Comprehensive R Archive Network at https://CRAN.Rproject.org/package=rBahadur.
Background
The simulation of realistic genotype/phenotype data is a fundamental tool in statistical genetics and is essential for the development of robust statistical methods for the analysis of genomewide data. As such, much prior effort has focused on generating synthetic data that recapitulate salient characteristics of genetic marker data, including local linkage disequilibrium (LD) structure induced by variable recombination rates [1,2,3], relationships between local LD structure and allelic effects [4], and the many consequences of drift, admixture, and geographic stratification [5, 6].
Despite these advances, existing methods are illsuited for generating synthetic genotype/phenotype data reflecting the consequences of recent assortative mating (AM); in contrast to the effects of recombination, which yields banded covariance structures, AM induces dense covariance structures reflecting signconsistent dependence among all causal variants across the genome [7, 8]. On the other hand, there is substantial recent evidence that AM is widespread [8,9,10] and complicates the interpretation of many commonly applied methods in statistical genetics, including heritability estimation [11], genetic correlation estimation [8, 12], and Mendelian randomization [13]. Efficient simulation methods for generating highdimensional genotype/phenotype data congruent with the consequences of AM will be critical to the development of robust analytic tools.
SNP haplotypes subject to AMinduced longrange dependencies can be represented mathematically by the mdimensional multivariate Bernoulli (MVB) distribution [14], which is challenging to sample from; existing methods require \(O(m^2)\) or more operations to draw a vector of m haplotypes [1, 3, 15, 16] (see Fig.Â 1), which becomes infeasible in high dimensions. As such, existing methods for synthesizing AMconsistent marker data at scale obviate this problem by generating genotype / phenotype data assuming random mating and subsequently proceeding through multiple generations of forwardtime simulation, complete with mating and meioses [2, 8, 11]. However, these methods require repeatedly shuffling the elements of large arrays and simulating the genotypes of a large number of individuals to obtain a relatively small sample of unrelated individuals, making them cumbersome in the context of methods development.
In the current manuscript, we introduce a novel collection of efficient methods for directly sampling highdimensional MVB random variables satisfying particular moment conditions (i.e., admitting a Bahadur order2 representation; [17]). In particular, we propose the rb_dplr algorithm, which exploits the diagonalpluslowrank correlation structure induced by AM to generate MVB samples using only O(m) operations. We then present numerical experiments demonstrating that the proposed methods outperform existing direct sampling methods and verify that they faithfully represent the effects of AM by comparing results to forwardtime simulations. We provide these methods, together with a collection of utilities for characterizing the equilibrium distribution of haplotypes under AM, in rBahadur, an opensource library for the R programming language.
Implementation
Overview of the rBahadur library
The rBahadur library consists of three component collections: First, we provide two generalpurpose MVB samplers, rb_unstr and rb_dplr, the implementation of which we discuss in the following section. Second, we provide utilities for modeling equilibrium AM, including a set of convenience functions for computing equilibrium parameters given initial conditions and for parametrizing the corresponding MVB distribution. Third, we provide a routine for endtoend simulation that combines the first two components to efficiently construct equilibrium genotypes and phenotypes given population parameters.
Bahadur approach to the MVB distribution
Suppose \(X_1, \ldots , X_m\) are Bernoulli random variables with means \(\mu _i :=\mathbb {E}[X_i]\). When the variables are independent, the distribution of \((X_1, \ldots , X_m)\) is simply \(p_{[1]}(x_1, \ldots , x_m) :=\prod _{i=1}^m \mu _i^{x_i} (1\mu _i)^{1x_i}\), but this is not the case in general. Bahadur [17] showed that the distribution of an MVB takes the form
where \(x = (x_1, \ldots , x_m)\). The function f in (1) is defined as
where \(z_i :=(x_i  \mu _i)/\sqrt{\mu _i(1\mu _i)}\) and \(r_{i_1 \ldots i_n} :=\mathbb {E}[ z_{i_1} \cdots z_{i_n} ]\). The means \((\mu _i)\) and mixed moments \((r_{ij}), (r_{ijk})\), and so forth, characterize the MVB and are comprised of \(2^m1\) parameters. This exponential dependence on m makes working with general MVBs challenging.
We consider Bahadur order2 approximations to the distribution in (1); i.e., we assume that \(r_{i_1 \cdots i_n} = 0\) for \(n \ge 3\). Thus, the order2 MVB distribution is fully characterized by its means \((\mu _i)\) and correlations \((r_{ij})\). rBahadur provides two methods for sampling from this distribution. The first algorithm (rb_unstr) can handle generic correlation matrices and requires \(O(m^2)\) operations to sample from the mdimensional MVB. The second algorithm (rb_dplr) is developed specifically for the case when the correlation matrix is diagonalpluslowrank (DPLR; i.e., \((r_{ij}) = \textbf{D} + \textbf{U} \textbf{U}^T\) where D is diagonal and U is \(m \times c\) for some \(c \ll m\)) and requires O(mc) operations. Both methods sample the entries of the random vector sequentially: First, a realization of \(X_1\) is drawn. Then, subsequent variables \(X_n\) are drawn conditionally on the realization of the previously drawn variables \(X_1, \ldots , X_{n1}\) for \(2 \le n \le m\). For further details, see Additional file 1.
Equilibrium distribution of causal variants under AM
Here we demonstrate how the DPLR order2 MVB distribution is used to model the consequences of assortment. Consider the equilibrium distribution of haploid causal variants \(X_1,\dots ,X_m\) with allele frequencies \(\mu _1,\dots ,\mu _m\) under primaryphenotypic assortative mating for an additive phenotype with panmictic heritability \(h^2_0\), panmictic genetic variance \(\sigma ^2_{g,0}=h^2_0\), and crossmate phenotype correlation r. Following Nagylaki [7], the equilibrium heritability is
and the equilibrium crossmate genetic correlation and genetic variance are respectively \(r_{g,\infty }=r\cdot h^2_\infty\) and \(\sigma ^2_{g,\infty } = \sigma ^2_{g,0}/(1r_{g,\infty })\). Additionally, we denote the equilibrium phenotypic variance \(\sigma ^2_{y,\infty }\) and the standardized haploid effects \(\varvec{\beta }\).
Assuming casual variants are unlinked at panmixis, the correlation matrix of causal haploid variants will be of the form \(\textbf{R}=\textbf{D}+{\varvec{\phi} }{\varvec{\phi} }^T\) where, following Border et al. [11], \({\varvec{\phi} }:\mathbb {R}^m\rightarrow \mathbb {R}^m\) is a function of the standardized haploid effects \(\varvec{\beta }\) given elementwise by
and \(\textbf{D}\) is the diagonal matrix with entries \(\textbf{D}_{kk}=1\phi _k^2\). Setting \(\textbf{U}={\varvec{\phi}}\) allows drawing haploid causal variants from the equilibrium distribution under AM via \({\texttt {rb}\_\texttt {dplr}}\).
The \({\texttt {rBahadur}}\) library includes functions for computing equilibrium parameters under this model: vg_eq, h2_eq, and rg_eq compute equilibrium parameters given the initial conditions \(\sigma ^2_{g,0}\), \(h^2_0\), and r. Finally, am_covariance_structure parametrizes the corresponding DPLR MVB distribution for a specified set of allele frequencies, causal effects, and initial conditions, by using (3) to compute \({\varvec{\phi} }\).
Simulating genotype/phenotype data with rBahadur
rBahadur provides the am_simulate routine for simplified endtoend simulation of genotype/phenotype data. am_simulate requires the user to specify the panmictic heritability and mating correlation parameters, as well as the desired number of diploid causal variants and number of simulation replicates (i.e., individuals). am_simulate returns genotypes, phenotypes, as well as additional architectural components, including the allele frequencies, allelesubstitution effects, and the heritable component of the generated phenotype. We provide a vignette illustrating usage of am_simulate in further detail in Additional file 1.
Numerical experiments
Figure 1a compares the time required to generate m binary haplotypes for \(n=m/4\) individuals under the equilibrium AM model, with \(h_0^2=r=0.5\), across existing and proposed methods. The proposed rb_dplr algorithm scales linearly in sample size and problem dimension. Both rb_dplr and the unstructured variant rb_unstr outperform existing methods including haplosim [3], three methods implemented by Kruppa et al. (kruppa*) [1], and GenOrd [15]. Results for the mipfp library [16], which had exponential time complexity, are omitted.
Figure 1b compares heritabilities associated with genotype / phenotype data as generated with rb_dplr under the equilibrium AM model versus those achieved after of up to 20 generations of the corresponding forwardtime procedure, as implemented by Border et al. [8], across 100 replicates. Results were consistent across methods (comparing rb_dplr \(h^2\) to mean \(h^2\) values across forwardtime generations 1620, Welchâ€™s \(t(99)=0.84\), \(p=0.40\)). Bestfit lines and standarderrors summarize variation across 100 replicates with 2000 haploid causal variants for 8000 individuals. Here, rb_dplr provides a direct and efficient alternative to forward time approaches that can be readily incorporated into sensitivity analysis and methods development pipelines.
Conclusions
Given that AM is both widespread [10, 11] and consequential for the interpretation of markerbased estimators [8, 11, 13], it is crucial that statistical geneticists are able to perturb randommating assumptions when developing and evaluating methods. To this end, we have developed the rBahadur library to efficiently sample haploid causal variants under AMinduced genetic architectures. The software is opensource and freely available through Comprehensive R Archive Network at https://CRAN.Rproject.org/package=rBahadur.
Our approach is limited by the requirement that the target distribution admits a secondorder Bahadur approximation (i.e., there is a valid probability distribution with the specified allele frequencies and correlations). For far apart causal variants, this is of little consequence as the true values of moments up to order four are expected to be smaller than their sampling variances unless \(n\gg m\), which is rarely the case in practice [11]. However, this limits applications to complex correlation structures involving both strong local LD and AMinduced global dependence. We address this by ensuring the simulation functions in rBahadur fail transparently in such cases and provide a vignette demonstrating how rBahadur can be used in conjunction with reference haplotypes to overcome this limitation in Additional file 1.
Availability of data and materials
Project name: rBahadur Project homepage: https://CRAN.Rproject.org/package=rBahadur Operating system: Platform independent Programming language: R Other requirements: Not applicable License: GNU General Public License v3.0 Any restrictions to use by nonacademics: None.
Abbreviations
 AM:

Assortative mating
 DPLR:

Diagonalpluslowrank
 LD:

Linkage disequilibrium
 MVB:

Multivariate Bernoulli (distribution)
References
Kruppa J, Lepenies B, Jung K. A genetic algorithm for simulating correlated binary data from biomedical research. Comput Biol Med. 2018;92:1â€“8. https://doi.org/10.1016/j.compbiomed.2017.10.023.
Tahmasbi R, Keller MC. GeneEvolve: a fast and memory efficient forwardtime simulator of realistic wholegenome sequence and SNP data. Bioinformatics. 2017;33(2):294â€“6. https://doi.org/10.1093/bioinformatics/btw606.
Coster A, Bastiaansen J. HaploSim: Functions to Simulate Haplotypes. R package version 1.8.4.2. https://CRAN.Rproject.org/package=HaploSim.
Speed D, Cai N, Johnson MR, Nejentsev S, Balding DJ. Reevaluation of SNP heritability in complex human traits. Nat Genet. 2017;49(7):986â€“92. https://doi.org/10.1038/ng.3865.
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, Tsambos G, et al. Efficient ancestry and mutation simulation with Msprime 1.0. Genetics. 2022;220(3):iyab229. https://doi.org/10.1093/genetics/iyab229.
DeGiorgio M, Rosenberg NA. Geographic sampling scheme as a determinant of the major axis of genetic variation in principal components analysis. Mol Biol Evol. 2013;30(2):480â€“8. https://doi.org/10.1093/molbev/mss233.
Nagylaki T. Assortative mating for a quantitative character. J Math Biol. 1982;16(1):57â€“74. https://doi.org/10.1007/BF00275161.
Border R, Athanasiadis G, Buil A, Schork AJ, Cai N, Young AI, et al. Crosstrait assortative mating is widespread and inflates genetic correlation estimates. Science. 2022;378(6621):754â€“61. https://doi.org/10.1126/science.abo2059.
Howe LJ, Lawson DJ, Davies NM, Pourcain BS, Lewis SJ, Smith GD, et al. Genetic evidence for assortative mating on alcohol consumption in the UK biobank. Nat Commun. 2019;10(1):1â€“10. https://doi.org/10.1038/s4146701912424x.
Yengo L, Robinson MR, Keller MC, Kemper KE, Yang Y, Trzaskowski M, et al. Imprint of assortative mating on the human genome. Nat Hum Behav. 2018;2(12):948. https://doi.org/10.1038/s4156201804763.
Border R, Oâ€™Rourke S, de Candia T, Goddard ME, Visscher PM, Yengo L, et al. Assortative mating biases markerbased heritability estimators. Nature Commun. 2022;13(1):660. https://doi.org/10.1038/s41467022282949.
Howe LJ, Nivard MG, Morris TT, Hansen AF, Rasheed H, Cho Y, et al. WithinSibship genomewide association analyses decrease bias in estimates of direct genetic effects. Nat Genet. 2022;54(5):581â€“92. https://doi.org/10.1038/s41588022010627.
Brumpton B, Sanderson E, Heilbron K, Hartwig FP, Harrison S, Vie GÃ…, et al. Avoiding dynastic, assortative mating, and population stratification biases in mendelian randomization through withinfamily analyses. Nat Commun. 2020;11(1):1â€“13.
Teugels JL. Some representations of the multivariate Bernoulli and binomial distributions. J Multivar Anal. 1990;32(2):256â€“68. https://doi.org/10.1016/0047259X(90)90084U.
Barbiero A, Ferrari PA.: GenOrd: simulation of discrete random variables with given correlation matrix and marginal distributions.
Barthelemy J, Suesse T, NamaziRad M.: Mipfp: multidimensional iterative proportional fitting and alternative models.
Bahadur RR. A representation of the joint distribution of responses to n dichotomous items. In: Studies in Item Analysis and Prediction. Stanford, California: Stanford University Press; 1961. p. 158â€“68.
Acknowledgements
Not applicable
Funding
RB was partially supported by a grant from the National Institutes of Health (T32NS048004). OAM was partially supported by the AFOSR grant FA95502010138 and by the National Science Foundation under Grant No. 1810314.
Author information
Authors and Affiliations
Contributions
RB and OAM developed the method, wrote the software, ran numerical experiments, and contributed to the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interest
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Supplementary note.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Border, R., Malik, O.A. rBahadur: efficient simulation of structured highdimensional genotype data with applications to assortative mating. BMC Bioinformatics 24, 314 (2023). https://doi.org/10.1186/s12859023054426
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023054426