 Software
 Open access
 Published:
kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R
BMC Bioinformatics volume 25, Article number: 86 (2024)
Abstract
Background
Approximating the recent phylogeny of N phased haplotypes at a set of variants along the genome is a core problem in modern population genomics and central to performing genomewide screens for association, selection, introgression, and other signals. The Li & Stephens (LS) model provides a simple yet powerful hidden Markov model for inferring the recent ancestry at a given variant, represented as an \(N \times N\) distance matrix based on posterior decodings.
Results
We provide a highperformance engine to make these posterior decodings readily accessible with minimal preprocessing via an easy to use package kalis, in the statistical programming language R. kalis enables investigators to rapidly resolve the ancestry at loci of interest and developers to build a range of variantspecific ancestral inference pipelines on top. kalis exploits both multicore parallelism and modern CPU vector instruction sets to enable scaling to hundreds of thousands of genomes.
Conclusions
The resulting distance matrices accessible via kalis enable local ancestry, selection, and association studies in modern large scale genomic datasets.
Background
The hidden Markov model (HMM) of haplotype diversity proposed by Li & Stephens [1] (hereinafter, the LS model) has become the basis for several probabilistic phasing, ancestry inference, and demographic inference methods in modern genomics [2, 3].
Accelerated implementations of the LS model, typically targeting the Viterbi path, are integral to many commonly used genomics software packages, including BEAGLE [4], IMPUTE [5], and tsinfer [6]. A pioneering ancestry inference software package, ChromoPainter, popularized the idea of using the LS model to summarize the ancestry of \(N\) haplotypes with an \(N \times N\) similarity matrix [7]. This matrix is obtained by running \(N\) independent HMMs in which each haplotype is modelled as a mosaic of all of the other haplotypes in the sample. This ‘allvsall’ copying approach is motivated by the product of approximate conditionals (PAC) likelihood originally proposed by [1] and allows ChromoPainter to render a chromosomewide estimate of the recent ancestry of the \(N\) haplotypes with high resolution.
The Relate [2] software suite extended this idea to performing local (locusspecific) ancestry inference along the genome. Internally, Relate uses high performance C++ implementations of the forward and backward algorithm to perform posterior decoding under a modified version of the LS model that incorporates derived allele information at many loci spaced along the genome. We will refer to this modified LS model as the derived allele haplotype copying model. These posterior decodings are transformed to \(N \times N\) local genetic distance matrices and used to initialise variantspecific ancestral trees for downstream population genetic analyses ranging from demography to selection inference.
The current Relate software suite does not provide an interface for outputting the posterior decodings at a locus of interest and does not support the original LS model, only the derived allele haplotype copying model, which requires derived allele information.
Additionally, a LSlike model is implemented in [8] to run forward and backward recursions to variants of interest. However, the transition kernel used is different to the original LS model: upon a recombination event the transition kernel in [8] does not permit a haplotype to continue copying from the same donor haplotype.
The focus of kalis is to provide a highperformance engine to directly obtain the posterior decoding at a set of loci of interest for a dataset with hundreds of thousands of phased haplotypes. kalis supports the original LS model and the derived allele haplotype copying model. It provides a simple interface to enable rapid development of a range of future variantspecific ancestral inference pipelines on top, in the easy to use statistical programming language R [9].
At the same time, it has been recognised for over a decade [10] that the serial execution speed of CPUs will increase modestly, with additional performance primarily coming from concurrency via multicore architectures or the growing width of specialised single instruction, multiple data (SIMD) instruction sets. Whilst multicore architectures are now somewhat routinely exploited via forked processes or threading, SIMD instructions remain an often overlooked source of performance gains, possibly because they are harder to program. There are a cornucopia of SIMD instruction sets: on the Intel platform the genesis was in the 64bit wide MMX instruction set [11] which allows simultaneous operation on two 32bit, four 16bit or eight 8bit integers. The most recent incarnation on Intel CPUs is a suite of AVX512 instruction sets [12], now capable of operating on 512bits of various data types simultaneously (eg eight 64bit floating point, or sixteen 32bit integer values). Other CPU designs have similar SIMD technologies, such as NEON on ARM CPU [13] designs (including the Apple M1 and M2 processors, as well as Amazon Web Services Graviton range). Additionally all modern CPUs are superscalar architectures supporting instruction level parallelism, an advance that has been in the consumer Intel platform since the Pentium [14]. Judicious programming can make it easier for compilers and the deep reorder buffers of modern pipelined CPUs to exploit this more hidden form of parallelism.
In this work we provide a reformulation of the LS model and an optimised memory representation for haplotypes, which together enable us to leverage both multicore and SIMD vector instruction parallelism to obtain local genetic distance matrices for problem sizes that previously appeared out of reach. This high performance implementation is programmed in C [15], with an easy to use interface provided in R [9]. We provide lowlevel targets of AVX2, AVX512 and NEON instruction sets (covering the vast majority of CPUs in use today), and the whole package has an extensive suite of \(162,835\) unit tests.
In the Implementation section below, we start with a description of the LS model and our reformulation which makes it amenable to these highperformance CPU technologies. We also describe the technical details of the underlying lowlevel implementation for the interested reader. We then demonstrate the performance that can be achieved with kalis, including examples with 100,000 haplotypes capable of running on a single machine. We also present a real data example using kalis to examine the ancestry at the LCT gene. In the following Discussion section, we describe the user friendly R interface which enables easy use of the high performance implementation without any knowledge of the underlying CPU technologies.
The kalis package is fully documented both within the package and on the package website https://kalis.louisaslett.com/.
Implementation
The LS model
To formalize our objective, let \(h\) be an \(L \times N\) matrix of \(0\)s and \(1\)s encoding \(N\) phased haplotypes at \(L\) sites. Let \(h_i^{\ell } \in \{0,1\}\) denote the the \((\ell ,i)\)th element of \(h\). For brevity, let \(h_i\) denote the \(i\)th haplotype (the \(i\)th column of \(h\)) and \(h_{i}\) denote all of the haplotypes excluding the \(i\)th haplotype. The LS model proposes an HMM for \(h_i  h_{i}\) in which the hidden state at variant \(\ell\), \(X^{\ell }_i \in \{1,\dots ,N\} \setminus i\), is an index indicating the haplotype in \(h_{i}\) that \(h_i\) is most closely related to (or “copies from”) at variant \(l\). We present here their proposed emission and transition kernels (see Equation A1 and Equation A2 in [1]) with a simplified parametrisation that is similar, but not identical, to that used by ChromoPainter.
While the original LS model assumes that each haplotype has an equal a priori probability of copying from any other, following ChromoPainter, we define a left stochastic matrix of prior copying probabilities \(\Pi \in {\mathbb {R}}^{N \times N}\) where \(\Pi _{ji}\) is the prior probability that haplotype \(j\) is copied by \(i\) and, by convention, \(\Pi _{ii} = 0\). In other words, the donor haplotype (hidden state) that is sampled at the first variant and after every “recombination event” in the copying path is drawn according to \(\Pi\). Here and whenever possible in kalis, all matrices are columnoriented such that the \(i\)th column pertains to an independent HMM where \(h_i\) is treated as the observation. There is some probability of a miscopy at variant \(\ell\), \(\mu ^{\ell }\), which under the LS model is set proportional to the mutation rate at \(\ell\). This leads to an emission kernel of the form
The transition kernel between hidden states is based on the recombination rate between sites. Let \(m^\ell\) be the genetic distance between variant \(\ell\) and variant \(\ell+1\) in Morgans (the expected number of recombination events per meiosis). Define \(N_e = 4\tilde{N_e}/N\) where \(\tilde{N_e}\) is the effective diploid population size (ie half of the haploid effective population size). Then, under the LS model the transition kernel is
where \(\rho ^\ell = 1\exp \left( N_e m^\ell \right)\) and \({\textbf{1}}\left\{ \cdot \right\}\) is the indicator function. Intuitively, this transition kernel asserts that upon a “recombination event,” where a recipient haplotype i may change the donor haplotype j it is copying from, the new donor haplotype is resampled from the prior copying distribution \(\Pi _{\cdot i}\). [1, Appendix B] observe that in practice the estimation of recombination rates can be improved when the scaled recombination rate is raised to a power, so we adopt this approach and introduce an exponent \(\gamma\). By default, kalis sets \(\gamma = 1\), but this can be changed by the user. For \(\gamma >1\) the recombination map becomes more heavily peaked, whereas \(\gamma <1\) tempers the recombination map to make it more flat and smooth. Hence, in kalis, we set
calculated using expm1() to help avoid underflow.
In keeping with the nomenclature introduced by [7], we refer to \(h_i\) as the “recipient haplotype” and the remaining haplotypes, \(h_{i}\), as the “donor haplotypes”, in the context of the HMM where \(h_{i}\) is treated as the emitted observation vector. This reflects the fact that each recipient haplotype \(h_i\) is modelled as an imperfectly copied mosaic of the other observed haplotypes under the LS model. Hence, the posterior marginal probability at variant \(\ell\), \(p^{\ell }_{ji} := {\mathbb {P}}\left( \left. X_i^\ell = j \right h\right)\), is the probability that donor \(j\) is copied by recipient \(i\) at variant \(\ell\) given the haplotypes \(h\). Under the above definitions of the prior copying probabilities \(\Pi\), the emission kernel (1), and the transition kernel (2), the full \(N \times N\) matrix of copying probabilities at \(\ell\), \(p^\ell\), can be obtained by running the standard forward and backward recursions [16] for each column (ie for each independent HMM).
From these posterior probabilities, we calculate a local \(N \times N\) distance matrix, \(d^\ell\). Firstly, notice that theoretically \(p_{ij}^\ell > 0\), but it can be that \(p_{ij}^\ell < \varepsilon\), where \(\varepsilon\) is the double precision machine epsilon (\(\approx 2.22\times 10^{16}\), [15], pp.26). Effectively this means \(d_{ij}^\ell\) is too large to reliably work with precisely, and so for the purposes of distance calculations we treat \(\varepsilon\) as the smallest observable posterior probability, yielding
where \(\vee\) is the maximum binary operator. By convention \(d_{ii} = 0\) for all \(i\).
We proceed in the next Section to reformulate the forward and backward recursions so that we can more fully exploit modern highperformance CPU instruction sets, while preserving numerical precision.
Modification of the forwardbackward algorithm
The \(N\) independent HMMs of the LS model have forward and backward probabilities, respectively:
where \(h_{i}^{1:\ell }\) denotes haplotype \(i\) from variant \(1\) to \(\ell\) inclusive.
Define,
Then the forward and backward recursions for the LS model can be written in vector notation (subscript \(\cdot\) denoting a vectorised index),
with recursions initialised with \(\alpha _{\cdot i}^{1} \leftarrow \theta ^{1}_{\cdot i} \Pi _{\cdot i}\) and \(\beta _{\cdot i}^{L} \leftarrow 1\). Note that Eq. (7) corresponds to Equation A5 in [1].
To partially mitigate the risk of underflow, the forward recursion can be rearranged in terms of \(\alpha _{\cdot i}^{\ell } := \frac{{\tilde{\alpha }}_{\cdot i}^{\ell }}{ F_i^{\ell 1}}\), and the backward recursion in terms of \(\beta _{\cdot i}^{\ell } := \frac{{\tilde{\beta }}_{\cdot i}^{\ell }}{G_i^{\ell }}\) (see Additional file 1 for details). Thus, in full for \(\ell \in \{1,\dots ,L\}\) we compute,
and
Given \(\alpha _{\cdot i}^\ell\) and \(\beta _{\cdot i}^\ell\), the vector of posterior probabilities for recipient \(i\), \(p^\ell _{\cdot i}\), can be calculated directly by normalising,
where \(\odot\) denotes the Hadamard product. In the event that \(\sum \limits _j \alpha ^\ell _{ji} \odot \beta ^\ell _{ji} = 0\), the distance between the recipient haplotype \(i\) and all of the donor haplotypes is beyond numerical precision, so as per the earlier discussion we define \(p_{ji}^\ell = \varepsilon \ \forall \ j \ne i\).
Finally, the local distances follow by taking the negative log and symmetrising. Note that if the distances are standardised for one of these columns, to account for the fact that the standard deviation will be 0, we set all of the standardised distances to 0. Please see Additional file 1 for a discussion on parameter values and exactly how kalis performs certain computations to maintain the numerical stability of the algorithm.
Core implementation details
The R interface described hereinbefore is a thin wrapper layer around a highperformance implementation of the core algorithm which is written in standards compliant C18 [15]. Most data structures are represented with native R types enabling user inspection and manipulation, except for the haplotype sequences themselves.
Computationally, the innermost forward and backward recursions are implemented using compiler intrinsics to exploit a variety of modern CPU instruction sets, including Streaming SIMD Extensions (SSE2 and SSE4.1), Advanced Vector Extensions (AVX, AVX2, AVX512 and FMA) and Bit Manipulation Instructions (BMI2) on Intel platforms; as well as NEON on ARM platforms. AVX2 is supported in Intel CPUs of the Haswell generation (released Q2 of 2013) or later, AVX512 tends to be available only in recent Intel server and workstation grade CPUs, and NEON is available for ARM CortexA and CortexR series CPUs, as well as Apple M1/M2 and Amazon Web Services Graviton processors. Although this covers most CPUs likely to be in use today, we nonetheless provide reference implementations in pure standards compliant C which will operate on any CPU architecture with a C18 compliant compiler. During package compilation, the correct codepaths are compiled based on detection of the presence or absence of the required instruction sets, or at the direction of the user via compiler flags. See Additional file 1 for more details, and for guidance on how to directly check your CPU for SIMD support.
It may be worth noting at this juncture that it was an explicit design choice to target CPUs and not GPU or tensor cards initially. This is because most University high performance computing clusters have plentiful CPU resources, often with untapped power in advanced SIMD instructions sets. We believe that the problem size that can be realistically tackled in many genetics studies can be massively increased without needing to resort to addon cards, though to scale beyond even this we may explore heterogeneous computing architectures in future kalis research.
In this section, we now describe the inner workings and design principles of the package, first covering in detail the data structures (both user facing and internal), followed by the computational implementation.
Data structures
There are three user accessible data structures utilised in the package and a low level binary haplotype representation which is not directly user accessible. The two principle data structures of interest to users are forward and backward table objects, represented as native R lists with respective S3 class names kalisForwardTable and kalisBackwardTable (detailed in Table 2 and discussed later), which are created with package functions MakeForwardTable() and MakeBackwardTable() respectively. The third user accessible data structure holds the LS model parameters, represented as a native R environment with S3 class name kalisParameters, which can be created with the package function Parameters().
Haplotype data
The haplotypes are stored in an optimised binary representation which is only natively accessible from within C. Note that here “optimised” is not a reference to spaceoptimisation: it would be possible to represent the haplotypes in an even more compressed manner, but we aim for streaming compute speed optimisation instead.
The haplotypes are loaded from disk and transformed to an in memory cache in this representation via CacheHaplotypes(), but this function does not return any handle to the loaded data. Thus the package provides the accessor function QueryCache(), which copies genome segments from the binary representation into native R integer vectors for user inspection.
When CacheHaplotypes() loads haplotypes into the cache, they are interleaved into a flat memory space which is organised as variantmajor. That is, variant 1 of each haplotype is loaded, converted to a binary 0/1 and then 32 consecutive haplotypes are packed into an unsigned integer. Moreover, the initial flat memory allocation is aligned on a 32byte boundary to satisfy memory alignment requirements for some CPU vector instructions^{Footnote 1}, and after all haplotypes at a given variant are packed into consecutive unsigned integers the pointer is wound forward to the next 32byte boundary to ensure the next variant starts on an SSE/AVX vector compatible memory boundary. This is depicted in Fig. 1.
Firstly, note that this orientation is natural, since the forward and backward recursions operate variant by variant, meaning variantmajor storage ensures sequential memory locations are fetched during a recursion. Indeed, with the cache line size of 64bytes (starting Intel Pentium IV), we essentially trigger the loading of \(64 \times 8 = 512\) neighbouring variants upon accessing the first variant in a recursion. This effect is even more pronounced on Apple M1/M2 whose cache line size is 128bytes, resulting in 1024 variants being prefetched upon access to the first variant in a recursion.
Secondly, a possible drawback is that we must extract the individual bit into a double floating point representation in order to compute with it in the recursion. However, efficient CPU instructions can help here too: take for example the following strategy kalis uses on an AVX2 capable CPU. Using the PDEP instruction in BMI2, we can efficiently deposit a bit into every ninth bit of an int (so there are now 4 8bit integers taking on the value of the haplotype at this variant packed in an int). Then, using SSE2, SSE4.1 and AVX instructions one can inflate through representations from 4 8bit integers packed in an int up to 4 64bit doubles packed in a 256bit AVX register. As such, we are then ready to operate with this variant in parallel using AVX instructions.
During development, testing indicated the memory bandwidth and cache efficiency savings of the packed binary representation provided speedups thanks to these instructions efficiently enabling unpacking and spreading a haplotype variant bit for parallel use. Furthermore, such a compact representation means that more of L1/L2 cache and memory bus bandwidth is left available for forward and backward tables, which are the largest objects we work with in this problem.
Parameters
The parameter set used by kalis can be created by calling the Parameters() function, which retuns a kalisParameters object with structure shown in Table 1. This structure corresponds to the parameters required to specify the LS model (Eqs. (1) and (2)). To calculate \(\rho\) from a recombination map, \(N_e\) and \(\gamma\), we also provide a helper function, CalcRho(), which implements Eq. (3).
The kalisParameters object uses an environment rather than list for parameters for two reasons: (i) the parameter environment and its bindings are locked which prevents changes in parameter values between forward or backward table propagation steps, since parameters must be fixed for all steps of a given forward or backward computation; and (ii) an environment explicitly ensures the (often large) parameter vectors are not copied when associated with potentially many different tables, but will always be purely referenced.
The environment contains only two members: another environment with the actual parameter values (which is locked with lockEnvironment()); and a SHA256 hash of those parameter values (details in Table 1). The purpose of the hash is to be able to efficiently determine whether the correct parameter set for a given forward or backward table has been passed when computing forward or backward recursions from an already initialised table (since it would be incorrect to propagate forward or backward using different parameter sets in different parts of the genome).
Forward/backward tables
Recall that the recipients (columns) in the forward/backward tables correspond to independent HMMs. Therefore, kalis enables storing only a ‘slice’ of recipients in a forward/backward table, making parallelisation across nonshared memory clusters much simpler: given all haplotype data, these recipient slices can be independently propagated in a communication free manner.
The forward and backward table objects contain not only the (upto) \(N\) independent forward/backward vectors at variant \(\ell\), but also supporting metadata. This includes the variant the table is currently at, the scaling constants \(F^\ell\) (forward, Eq. (5)) or \(G^\ell\) (backward, Eq. (6)), the range of recipient haplotypes represented (that is, the recipient HMMs to which the column corresponds), and a hash of the parameter values used in propagating this table.
In total, a fullsize forward table for example requires \(8N^2+8N+1576\) bytes of memory^{Footnote 2} for storage and the small overhead of R object management. Since this grows quadratically in the number of haplotypes, most functions in the package operate on forward and backward table objects inplace, rather than via the idiomatic copyonwrite mechanism of standard R. The most important consequence of this for users is that standard assignment of a table object to another variable name only creates a reference and so an explicit copy must be made by using the CopyTable() utility function provided in the package.
Core SIMD code
The two most important core algorithms which are accelerated with SIMD vector instructions are the forward and backward recursions. This code is fully implemented in C, with tailored modifications accounting for all combinations of: scalar/vector \(\mu\), scalar/matrix \(\Pi\), and use of the asymmetric mutation model of RELATE [2] or not (ie 8 combinations); to ensure that minimal memory accesses are performed where possible. So, for example, scalar \(\mu\) and scalar \(\Pi\) parameters will be faster than any other combination since these values are likely to be held in registers (or at least L1 cache) for the duration of the recursion.
Additionally, in all places where we identify SIMD instructions may be used, a macro is deployed, with a header file providing all mappings from these macros to a specific SIMD instruction for all supported instruction sets. Taking arguably the simplest nontrivial example, all src/ExactForward*.c and src/ExactBackward*.c files make us of the custom macro KALIS_MUL_DOUBLE(X, Y) when they need to multiply KALIS_DOUBLEVEC_SIZE double precision floating point values together. The file, src/StencilVec.h then provides definitions for these macros under each instruction set kalis supports (via assembly intrinsics), together with a pure C alternative. For this example, we have (with ... indicating other macro definitions):
The innermost loop in these core files then includes a programmatically generated unroll to the depth specified during compilation. All this is wrapped in code which dispatches using pthreads to multiple threads, with automatic detection of the ability to pin to specific cores if that option is passed (important in some settings to ensure a hot L1/L2 core cache). In particular, each thread operates on a subset of columns of the forward and backward tables, ensuring spatial locality for memory accesses. Furthermore, when propagating by more than a single variant position, each column (ie each independent HMM) is propagated all the way to the target variant before proceeding to the next column, ensuing temporal locality of memory accesses.
Unit tests
Given the complexity of the development described above, we have implemented a comprehensive suite of unit tests to ensure correctness. Internal to the package is a “gold master” implementation of the LS model, which is a pure R implementation that has been written for correctness and is not optimised for speed. These pure R implementations are callable with an undocumented argument option to the standard Forward() and Backward() functions: if the argument nthreads = “R” rather than a numeric value, then the gold master implementation is used (at the cost of running significantly slower).
Unit tests fall broadly into two categories, one verifying the correctness of loading from the different supported input formats (via R matrix, .hap.gz and h5) into the optimised binary representation of Fig. 1, the other checking forward and backward computations against a ground truth computed by the gold standard R implementation. The latter category of tests are the most extensive, since they cover tests of all combinations of: single threaded and multithreaded computation; moving different numbers of variants in a single call; different problem sizes where the numbers of haplotypes is either exactly divisible by the CPU vector unit length (i.e. 256bits for AVX2 etc), or has different remainders; original LS and derived allele haplotype copying model; scalar and vector mutation probabilities (\(\mu\)); uniform and matrix copying probabilities (\(\Pi\)); and in the case of backward recursions, all combinations of starting and ending a recursion in standard or rescaled probability space (beta.theta argument to Backward()).
All these combinations give rise to over 162, 000 tests (note also that the exact number of tests varys by architecture due to the differing vector unit lengths). This large number of tests ensures all the separately optimised code paths for the various combinations of runtime options are covered. We note that the tests take quite some time to run (e.g. potentially 3060 minutes on a laptop), precisely because the gold master R code is run to provide the ground truth for these tests.
If a user wishes to confirm correctness on their particular platform, they can be run with the following commands:
Results
We provide a brief overview of some example performance figures, though due to the highly tuned nature of kalis, the exact performance you can expect will be heavily dependent on your exact computer architecture and resources.
First, it is important to note we do not claim to have altered the scaling properties of the LS model, only that we provide an implementation which is highly optimised within the scaling constraints inherent to the model. As such, Fig. 2 demonstrates that kalis indeed inherits the \({\mathcal {O}}(N^2)\) and \({\mathcal {O}}(L)\) properties of the original LS model.
We turn now to the benefits kalis does provide.
Firstly, for some of the reasons highlighted in the previous Section, kalis exhibits accelerated performance when propagating the forward/backward recursions over more extended stretches of the genome. This is because every effort has been made to be cache efficient, so that when more than a single variant step is taken, the strong cache locality design ensures that we are not memory bandwidth limited. This effect can be seen quite dramatically in Fig. 3 by the rapid decrease in compute time per variant as longer stretches are propagated.
Secondly, the hardcoded loop unrolling functionality which can be controlled at compile time by the user can be seen to be beneficial in Fig. 3. Clearly excessive loop unrolling is harmful, with depth 32 unrolls actually being substantially slower than no unrolling. However, unrolling to depth 8 does give a clear improvement. The best choice of unrolls will be both problem and architecture dependent, so we recommend testing different unroll levels on the target problem before performing long compute runs.
Figure 3 also illustrates that the handdesigned use of lowlevel vector SIMD instructions is not superfluous, with substantial speedup afforded by their use (the difference between dashed and solid lines of the same colour).
Finally, Fig. 4 shows that in certain very large problem settings kalis’ ability to pin threads can make a substantial difference. In this setting, AVX2 showed the greatest benefit from eliminating context switching, ensuring that the cache is not invalidated by threads migrating between cores. The lack of substantial difference between AVX2 and AVX512 here once thread pinning is employed calls for some investigation, though this may be a result of thermal/power throttling which is known to occur especially for AVX512 heavy code [17].
These performance examples again highlight the importance of pilot benchmark runs with different configurations of instruction set and unroll settings before embarking on long compute runs to ensure the greatest compute efficiency is achieved for a given problem and compute architecture.
Benchmarking comparison
We performed two benchmarking experiments to compare the implementations of the forward and backward algorithms in kalis to those in Relate [2]. While several other leading software suites, including BEAGLE [4] and IMPUTE [5], use high performance implementations of the LS model, we chose to compare to Relate because it is explicitly optimized to target locusspecific \(N \times N\) genetic distance matrices analogous to those produced by kalis. We based all of our benchmarks on the same set of haplotypes, taken from the 1000 Genomes Project [18], as used in our realdata example which follows below. The data include 5008 haplotypes observed at 29193 variants.
kalis can perform the forward and backward recursions under either the original LS model (the default) or the derived allele haplotype copying model if use.speidel=TRUE is passed to the Parameters() function. Since Relate only computes these recursions for the derived allele copying model, it can exploit the asymmetry in the emission kernel based on the derived allele orientation of each variant. When painting a given recipient haplotype as a mosaic of donor haplotypes, this allows Relate to effectively skip all variants where a recipient haplotype does not carry the derived allele. This acceleration cannot be applied to the original LS model, which kalis was primarily designed for, because the symmetric emission kernel requires both the forward and backward algorithms to iterate over every variant for every recipient haplotype. Even with the derived allele copying model activated, kalis will still visit every variant for every recipient haplotype.
Accordingly, we found that the forward and backward recursions were approximately \(4\times\) faster using Relate rather than using kalis. However, if Relate visits every locus, as would be necessary to compute the original LS model, we found that the forward and backward recursions were approximately \(6\times\) faster using kalis rather than using Relate. This demonstrates the benefit of the lowlevel optimisations made in kalis. In principle kalis could also employ the same optimisation as Relate and visit only derived sites for every recipient haplotype. We consider this an exciting avenue of future research. Otherwise, kalis and Relate would be expected to share similar algorithmic scaling properties in data size.
Full details of how this benchmarking was performed are provided in Additional file 1, Section D.
Realdata example: recent selection for lactase persistence
LCT is a gene on chromosome 2 that encodes lactase, the enzyme responsible for the breakdown and digestion of lactose, the sugar commonly found in milk. Ancestral humans had a regulatory ‘switch’ on chromosome 2 that stops lactase production after infancy when children would be weaned off breast milk. Mutations that disrupt this switch allow lactase production to persist into adulthood, conferring a lifelong ability to extract energy from milk [19]. Such mutations have arisen independently at least twice in human history, in Europe and in East Africa, and are among the strongest examples of recent positive natural selection in humans [20, 21]. These mutations have been shown to spread across standard human population boundaries. For example, [22] used another implementation of the LS model to compare haplotypes at the LCT locus sampled from the West African Fula population to haplotypes collected from across Europe and Asia as part of the 1000 Genomes project [18]. They found that the genetic distance between Fulani haplotypes and Eurasian haplotypes was unusually small at the LCT locus. With some further analysis, they interpreted this as evidence that a European haplotype conferring lactase persistence became prevalent within the West African Fula population due to recent natural selection sometime over the past two thousand years.
Although it is difficult to directly replicate [22] since the Fulani samples they studied are not a part of the 1000 Genomes project, we take inspiration from their analysis. Here we present a small example using kalis to informally investigate whether there is evidence of recent geneflow from Eurasia into any of the African populations in the 1000 Genomes dataset at the lactase locus. We run kalis on 5008 haplotypes from the 1000 Genomes Phase 3 release to revisit the haplotype structure around LCT; the haplotypes are sampled from 26 subpopulations all over the world [18]. Figure 5 shows a clustered version of a distance matrix, calculated as in Eq. (4), at a variant in the regulatory region of LCT (rs4988235). To see if we could observe a pattern of geneflow into or out of Africa similar to what was observed by [22], we use average pairwise linkage [23] to cluster the African haplotypes separately from the nonAfrican haplotypes. In Fig. 5, distances between African haplotypes are shown in the upper left corner; nonAfrican haplotypes, in the lower right corner.
Rather than 26 clusters reflecting the 26 sampled human populations, we see that there are three very distinct lactase haplotypes that are common both within and outside Africa. This suggests that these three haplotypes, under strong positive selection pressure, recently spread across population boundaries and presumably confer lactase persistence. We cannot confirm whether any of these three haplotypes correspond to the one identified in the Fulani by [22]. These three haplotypes are not the only structure we see: in the upper left corner of the African (AFR) block we see some haplotypes that are only found inside Africa; and in the nonAfrican block, a haplotype that is only found outside Africa. We can also see some substructure within the clear haplotype blocks.
The code to reproduce this example is available in the examples directory of repository associated with this paper (https://github.com/louisaslett/kalisbmc), as a vignette in the package (if vignettes built at install time), or directly at the kalis package website https://kalis.louisaslett.com/articles/lct_example.html
Discussion
In Additional file 2, we introduce the package from a user perspective, from package installation right through to decoding a single variant position in R using kalis.
There are many avenues for future research in developing kalis. On the model side, for example, allowing for different recombination rates between subpopulations as done in fastPHASE [24] would be a natural extension.
On the computational side, ARM scalable vector extensions [25] represent an interesting new approach to SIMD instruction sets, where the width of instructions need not be hard coded prior to compilation. At present it is not widely available, but as this rolls out, it would be natural to extend kalis to enable targeting this new instruction set.
An important utility extension is expanding the file formats that kalis can natively read via CacheHaplotypes(), to enable simpler and more streamlined software pipelines when bioinformaticians incorporate kalis into their workflows.
Additionally, during development of kalis we have been congnisant of the potential interest in using the core C code from other languages. Therefore all core computational C code has been kept as lowdependency as possible, and in particular has no dependencies on R or any other external libraries. We hope in future to release a pure C library, or to provide other language bindings directly.
Finally, a future avenue of potential development is extension of kalis to support GPU or tensor cards. Note that it was an explicit design choice to initially target CPU SIMD extensions, since the vast majority of University high performance computing clusters have a huge amount of untapped compute power in this form, but often much more limited availability of specialist extension cards. Therefore, by pushing performance as extensively as possible via CPU only means, we provide the greatest potential impact for end users. This does not preclude future versions adding support for addon compute cards.
Conclusion
kalis provides a R interface to a highly optimized C implementation of the LS model that enables local ancestry, selection, and associations studies in modern large genomic datasets.
Availability and requirements
 Project name::

kalis
 Project home page::
 Operating system(s)::

Linux, MacOS, Windows
 Programming language::

R, C
 Other requirements::

R (\(\ge\) 3.5.0)
 License::

GPL (\(\ge\) 3)
 Any restrictions to use by nonacademics::

None beyond GPL (\(\ge\) 3).
Availability of data and materials
The package source code repository is at https://github.com/louisaslett/kalis. All scripts for reproducing the results of this paper are available in this repository https://github.com/louisaslett/kalisbmc. The two external dependencies are: 1000 Genomes data which are available for download from https://www.internationalgenome.org/; and the msprime simulator, which may be downloaded from https://tskit.dev/software/msprime.html.
Notes
Certain modern CPUs do not require specific alignment to be able to load memory to SSE/AVX registers, but for maximum compatibility we honor the alignment anyway.
Measured under R 4.2.2
Abbreviations
 LS model:

Li & Stephens model
 HMM:

hidden Markov model
 SIMD:

single instruction, multiple data
References
Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using singlenucleotide polymorphism data. Genetics. 2003;165:2213–33.
Speidel L, Forest M, Shi S, Myers SR. A method for genomewide genealogy estimation for thousands of samples. Nat Genet. 2019;51:1321–9.
Song YS. Na Li and Matthew Stephens on Modeling Linkage Disequilibrium. Genetics. 2016;203:1005–6.
Browning BL, Tian X, Zhou Y, Browning SR. Fast twostage phasing of largescale sequence data. Am J Hum Genet. 2021;108:1880–90.
Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the positional burrows wheeler transform. PLoS Genet. 2020;16:e1009049.
Kelleher J, et al. Inferring wholegenome histories in large population datasets. Nat Genet. 2019;51:1330–8.
Lawson DJ, Hellenthal G, Myers S, Falush D. Inference of population structure using dense haplotype data. PLoS Genet. 2012;8:e1002453.
Rosen YM, Paten BJ. An averagecase sublinear forward algorithm for the haploid Li and Stephens model. Algorithms Mol Biol. 2019;14:1–12.
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2023. https://www.Rproject.org/.
Sutter H. The free lunch is over: a fundamental turn toward concurrency in software. Dr Dobb’s J. 2005;30:202–10.
Peleg A, Weiser U. MMX technology extension to the Intel architecture. IEEE Micro. 1996;16:42–50.
Intel Corporation. Intel Architecture Instruction Set Extensions and Future Features. Tech. Rep. 319433046 (2022).
ARM. NEON Programmer’s Guide. Tech. Rep. DEN0018A ID071613 (2013).
Alpert D, Avnon D. Architecture of the Pentium microprocessor. IEEE Micro. 1993;13:11–21.
ISO. ISO/IEC 9899:2018 Information technology—Programming languages—C Fourth edn (BSI, 2018). https://www.iso.org/standard/74528.html.
Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE. 1989;77:257–86.
Schöne R, Ilsche T, Bielert M, Gocht A, Hackenberg D. IEEE (ed.) Energy efficiency features of the Intel SkylakeSP processor and their impact on performance. (ed. IEEE) 2019 International Conference on High Performance Computing & Simulation (HPCS), 2019. pp. 399–406.
Consortium GP, et al. A global reference for human genetic variation. Nature. 2015;526:68.
Ingram CJ, Mulcare CA, Itan Y, Thomas MG, Swallow DM. Lactose digestion and the evolutionary genetics of lactase persistence. Hum Genet. 2009;124:579–91.
Ranciaro A, et al. Genetic origins of lactase persistence and the spread of pastoralism in Africa. Am J Hum Genet. 2014;94:496–510.
Bersaglieri T, et al. Genetic signatures of strong recent positive selection at the lactase gene. Am J Hum Genet. 2004;74:1111–20.
Busby G, et al. Inferring adaptive geneflow in recent African history. BioRxiv 2017;205252.
Sokal RR. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull. 1958;38:1409–38.
Scheet P, Stephens M. A fast and flexible statistical model for largescale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet. 2006;78:629–44.
Stephens N, et al. The ARM scalable vector extension. IEEE Micro. 2017;37:26–39.
Acknowledgements
Both authors would like to acknowledge Professor Ira Hall, Professor Chris Holmes, and Dr Chris Spencer for their discussions and advice on this project.
Funding
This project was supported by the NHGRI Centers for Common Disease Genomics grant (UM1HG008853), active from 20152022.
Author information
Authors and Affiliations
Contributions
LA architected and wrote the Ccore. LA and RC collaborated on the R interface. RC conducted the realworld lactase persistence example. LA and RC wrote and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
An appendix containing full derivations of mathematical reformulation; software installation details; HDF5 file format specifications; and additional benchmarking details.
Additional file 2.
A user guide to the kalis R package.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Aslett, L.J.M., Christ, R.R. kalis: a modern implementation of the Li & Stephens model for local ancestry inference in R. BMC Bioinformatics 25, 86 (2024). https://doi.org/10.1186/s12859024056888
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859024056888