 Research
 Open access
 Published:
Pangenome de Bruijn graph using the bidirectional FMindex
BMC Bioinformatics volume 24, Article number: 400 (2023)
Abstract
Background
Pangenome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pangenome. Most stateoftheart pangenome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FMindex. As such, a memoryefficient graph data structure is obtained that inherits the FMindex’ backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance.
Results
We present a data structure for a pangenome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FMindex, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph’s topology through visualization and sequence alignment.
Conclusions
We propose a memoryefficient representation of the pangenome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL3.0 license.
Background
Modern sequencing platforms enable the rapid sequencing of genomes. Whereas one consensus reference genome per species used to be the norm, it is now common to have thousands of genomes for a single species. New techniques must be developed to efficiently store, manipulate, analyze and visualize large genomic collections (often representing a species or clade). These collections, analyzed jointly or used as a reference, are referred to as pangenomes [1, 2]. A key innovation in pangenomics is the adoption of graphs as the primary form of representation, as graphs are inherently suited to summarize multiple genomes into a single data structure by compacting shared regions into common nodes. As such, pangenome graphs can robustly and intuitively encode natural variation, such as SNPs and structural variation [3].
Pangenome graphs can be sequencebased, genebased, or a combination. Sequencebased pangenome graphs consist of nodes representing sequences and edges denoting adjacencies between them. They are ideal for detailed analysis of highly similar input genomes, such as human individuals. On the other hand, genebased approaches (distinguishing core genes, dispensable genes and strainspecific genes [4]) are more suitable for pangenomes of distantly related organisms with less conserved sequence content. This paper focuses on sequencebased pangenomes.
The emergence of pangenome graphs has enabled various functionalities [2]. Existing bioinformatics analyses relying on a reference genome are often biased towards the specific choice of reference [5,6,7,8,9]. Since pangenome graphs can mitigate this reference bias, the Computational PanGenomics Consortium proposes the following design goal: “Comparisons of short and long sequences (e.g. reads) with the pangenome ideally results in the corresponding location and the best matching individual genome(s)” [2]. Pangenome graphs also facilitate knowledge extraction through topological analysis [3], revealing (the degree of) similarity between the input genomes, the presence of (structural) variation, conserved regions, etc. Visualization of the graph enables the investigation of these features, which is why “all information within the data structure should be easily accessible for human eyes by visualization support on different scales” [2].
Stateoftheart pangenome representations
The most straightforward approach for storing a pangenome is creating a linear fulltext index of the concatenated genomes. This approach offers advantages such as efficient storage and alignment using stateoftheart linear aligners like BWAMEM [10] and Bowtie 2 [11] (both based on the FMindex [12]), while preserving linkage disequilibrium during alignment. However, downsides include the lack of insight into the pangenome’s characteristics and index growth proportional to the sequencecontent in the pangenome (although recent developments and implementations regarding the rindex [13] might alleviate this issue).
A second prominent form of pangenome representation is a variation graph, obtained by augmenting a linear reference genome with known variation in the population. We distinguish acyclic variation graphs and general variation graphs. Some tools support only acyclic variation graphs [14,15,16,17,18,19], lacking representation of complex variations like copy number variations, inversions, and translocations. In contrast, the most popular sequencetograph aligners [20,21,22,23] handle general variation graphs. Variation graphs can spaceefficiently incorporate variation across many individuals and enable the exploration of the graph topology through visualization. However, they depend on the reference genome that serves as the backbone of the graph, sequencetograph alignment is complex [24], and chimeric alignments can occur when isolated variations are added to the graph without preserving linkage disequilibrium. Giraffe [21] mitigates the latter by including haplotype information.
A third pangenome representation is the de Bruijn graph (dBG), consisting of nodes representing each distinct kmer in the pangenome (i.e., the collection of all complete genomes) and edges connecting corresponding nodes for each (k + 1)mer. Linear chains of nodes are often merged to create a compacted dBG (cdBG) with a more interpretable topology, where nodes represent unitigs and edges indicate divergence [3, 25]. Colored cdBGs (ccdBGs) assign colors to nodes and edges based on the underlying strains in which they occur [26]. Several tools construct assembly (cc)dBGs, and some can also perform (pseudo)alignment to them [27,28,29,30,31,32,33,34,35]. However, since assembly dBGs are created from a set of input reads, there is no functionality to maintain the connection between the graph (nodes) and (the coordinates of) the underlying input sequences. Therefore, these data structures and algorithms are not suitable for our problem, and vice versa. Tools that align reads to pangenome dBGs and can link graph nodes back to genome coordinates are relatively scarce. Examples include deBGA [36] for dBGs and PuffAligner [37] for ccdBGs (based on the Pufferfish index [38]). However, these tools only report coordinates without providing alignment information inside the graph (i.e., node paths), and lack support for visualizing regions of interest within the graph.
Beller and Ohlebusch [39] recently proposed a memoryefficient, implicit representation of a ccdBG, built upon the unidirectional FMindex of the underlying sequences. The graph edges are not explicitly stored; instead, the FMindex and a few additional arrays enable graph navigation. The FMindex also allows for pattern matching against the graph. However, the current implementation is limited to exact pattern matching, while approximate pattern matching (APM) is more relevant for bioinformatics applications due to sequencing errors and genetic variation. Also, only backward traversal of the graph is supported due to the underlying unidirectional FMindex, restricting visualization to asymmetric subgraphs (i.e., only the upstream neighborhood of the node(s) of interest) [40]. Finally, identifying a node containing a specific kmer is an O(n) operation (with n the size of the pangenome), which can be slow in practice. This paper aims to address these limitations.
Contributions
Inspired by the work of Beller and Ohlebusch, we propose a memoryefficient, colored, compacted de Bruijn Graph (ccdBG) representation that is built upon the bidirectional FMindex [41]. Specifically, we make the following contributions:

(i)
Leveraging the bidirectional FMindex, our graph representation supports bidirectional (i.e., forward and backward) navigation of the graph in O(1) time per step. Implementing this functionality in implicit graph representations is nontrivial. Additionally, we present an algorithm for visualizing a region of interest with its complete neighborhood, generating symmetric subgraphs.

(ii)
Our graph representation is built upon the bidirectional FMindex in a modular manner, allowing seamless integration of advancements for the bidirectional FMindex into our pangenome graph. We demonstrate this by applying search schemes [42] to enable efficient lossless approximate pattern matching against our pangenome graph under the edit distance metric (allowing substitutions and indels). Search schemes are a class of sequence alignment algorithms that, using a bidirectional fulltext index, prioritize quick elimination of unsuccessful search branches to minimize runtime. Their excellent performance has been demonstrated for linear reference genomes [42,43,44,45]. Unlike lossy heuristics (often relying on the seedandextend paradigm), search schemes are lossless: they guarantee to retrieve all occurrences within a specified error distance. As pangenome graphs can comprise hundreds of similar sequences (e.g., closely related bacterial strains), lossless algorithms that efficiently report all occurrences appear particularly attractive. As outputs, we report occurrences both as walks in the graph and as coordinates within the underlying sequences.

(iii)
We introduce checkpoint kmers to reduce the time complexity to identify the graph node corresponding to a given kmer from O(n) to O(1) (with n the size of the pangenome). In practice, this results in a significant speedup, with the node path identification step being up to 3 times faster. This improvement comes at a minimal additional memory cost.
This paper is organized as follows. We first describe the data structure with its support for graph navigation in constanttime, subgraph visualization, and efficient lossless approximate pattern matching using search schemes. In the results section, we demonstrate the functionalities and performance of our tool. We show that the graph representation requires far less memory than the underlying bidirectional FMindex. We analyze the performance of our approximate pattern matching implementation, comparing it with other tools and exploring the impact of the checkpoint sparseness factor. We present a case study on a Mycobacterium tuberculosis pangenome to illustrate the extraction of information from the graph topology.
Methods
Preliminaries
Zerobased indexing is used for strings and arrays. Consider a text T of length \(n=T\) over alphabet \(\Sigma\). In a pangenome context, T is the concatenation of multiple DNA sequences, separated by ‘%’ characters. We denote the number of sequences in T by S. The sentinel character ‘$’, a unique character lexicographically smaller than any other character in \(\Sigma\), is appended to T. Character ‘%’ is the lexicographically second smallest character in \(\Sigma\). Characters ‘%’ and ‘$’ are referred to as separation characters. A substring of a string T is denoted by a halfopen interval T[i, j[, with \(0\le i\le j \le n\). The ith suffix of T, denoted as \(T_i\), is the substring T[i, n[. Analogously, substring T[0, i[ is the ith prefix of T.
A de Bruijn graph (dBG) G(V, E) [46] is a directed graph where the nodes are all kmers (i.e., klength substrings) present in T. We omit kmers that contain a separation character (‘%’, ‘$’) in any but their last position. A directed edge connects two nodes u and v when a \((k+1)\)mer exists in T for which the first k nucleotides coincide with u and the last k nucleotides coincide with v. If multiple such \((k+1)\)mers exist in T, we draw the corresponding number of edges between nodes u and v. In other words, G(V, E) is a multigraph. Note that we do not create a bidirected genome graph, i.e., a kmer and its reverse complement are not represented by the same node. A compacted de Bruijn graph (cdBG) is obtained by maximally contracting all pairs of connected nodes u and v for which v is the sole successor of u and, vice versa, u is the sole predecessor of v. Nodes of a cdBG thus represent substrings of T of length \(\ge k\), referred to as unitigs. A colored compacted de Bruijn graph (ccdBG) retains the origin strain of each edge by assigning it a color.
Throughout this paper, we illustrate the data structures and algorithms using T = “CTATGTC%ATATGTTGGTC$” as a small example pangenome with \(S=2\) sequences. Figure 1 shows this example’s ccdBG (\(k=3\)).
Bidirectional graph data structure
Bidirectional FMindex
Our implicit representation of the ccdBG G(V, E) is built upon the bidirectional FMindex of T. Readers less familiar with the bidirectional FMindex are referred to the supplementary material for a brief overview. Table 2 illustrates for our example text T, the corresponding suffix array SA [47], BurrowsWheeler transform BWT [48], LF mapping and sorted suffixes. Similarly, Table 4 shows the reverse text \(T^r\), its suffix array \(\textrm{SA}^r\), BurrowsWheeler transform \(\textrm{BWT}^r\), LF mapping and sorted suffixes. Note that all variables related to the reverse part of the bidirectional FMindex are denoted with a superscript r. Bit vectors B and B^{r} will be explained later. Exact occurrences of a search pattern P in T are represented in the bidirectional FMindex by two intervals: an interval [b, e[ over SA and an interval \([b^r, e^r[\) over \(\textrm{SA}^r\), such that all suffixes \(T_{\textrm{SA}[i]}\) for \(b \le i < e\) have P as their prefix while suffixes \(T^r_{\textrm{SA}^r[i]}\) for \(b^r \le i < e^r\) are prefixed by \(P^r\), the reverse of P. For example, for search pattern P = “ATG”, \(\textrm{SA}[3,5[\) refers to the suffixes of T prefixed by P, while \(\textrm{SA}^r[9,11[\) refers to suffixes of \(T^r\) prefixed by \(P^r\) = “GTA”. Patterns are matched character by character: given a pattern P and its intervals [b, e[ and \([b^r,e^r[\), the intervals \([b',e'[\) and \([b{^r}',e{^r}'[\) of the extended pattern cP (extendBackward) or Pc (extendForward) can be found in \(\textrm{O}(1)\) time [49]. In other words, the key functionality of a bidirectional FMindex entails that a partial match can be extended with a character either to the left or to the right.
Collection of graph nodes
The data corresponding to the nodes of the ccdBG is stored in a vector G of length V, with V the number of nodes (see Table 1). Each node is assigned a unique identifier \(id \in \{0,\ldots ,V1\}\). This way, the node with identifier \(id\) can be accessed at \(\textrm{G}[ id ]\). Each node represents a substring \(\omega\) of T. This substring is not explicitly stored in G. Every node has four attributes: \(len\), \(mult\), \(left\_kmer\), and \(right\_kmer^r\). Here, \(len\) denotes the length of \(\omega\), while \(left\_kmer\) is the left boundary of the SA interval that corresponds to \(\omega\). Consequently, \(\omega\) can be deduced from the node attributes as \(T[\textrm{SA}[ left\_kmer ],\textrm{SA}[ left\_kmer ]+ len [\). Due to the characteristics of a cdBG, \(left\_kmer\) is also the left boundary of the suffix array interval that corresponds to the leftmost kmer of \(\omega\). The \(mult\) attribute corresponds to the multiplicity of the node, which is the number of times \(\omega\) occurs in T. Hence, \(mult\) is also the size of \(\omega\)’s suffix array interval: \(\textrm{SA}[ left\_kmer , left\_kmer + mult [\). Analogously, \(right\_kmer^r\) represents for the reverse of the rightmost kmer of a node, the left boundary of its interval in the reverse suffix array. Consequently, the reverse suffix array interval of the reverse rightmost kmer of the node can be found as \(\textrm{SA}^r[ right\_kmer^r , right\_kmer^r + mult [\). For example, consider the node with \(id = 4\) in Table 1. Its leftmost kmer, “GTT”, has its left boundary of the SA interval at index 11 (see Table 2). Similarly, the interval of the reverse rightmost kmer “TGG” in \(\textrm{SA}^r\) starts at index 16 (see Table 4).
The end nodes form an exception to the rules defined above: their rightmost kmer is obtained from the (cyclic) extension of \(\omega\) by the next \(k1\) characters in T (e.g., “$CT” for node 5 and “%AT” for node 6 in Fig. 1). Additionally, each of the S sequences in T gets a distinct end node in vector G, even if they correspond to the same string \(\omega\). For more detailed information, the reader is referred to [39].
Auxiliary bit vectors and tables
The bidirectional FMindex (Tables 2 and 4) is supplemented with two auxiliary bit vectors \(\textrm{B}\) and \(\textrm{B}^r\).
\(\textrm{B}[i] = 1\) if the following two conditions apply:

1
kmer \(T[\textrm{SA}[i], \textrm{SA}[i]+k[\) is the rightmost kmer of a node.

2
Suffix \(T_{\textrm{SA}[i]}\) is the lexicographically largest suffix of T that has kmer \(T[\textrm{SA}[i], \textrm{SA}[i]+k[\) as a prefix.
For example, “TGT” is the rightmost kmer of node 1 and is indicated by a 1bit in \(\textrm{B}\) at index 18 (Table 2). Again, for the end nodes, the rightmost kmer is defined differently and each of the S distinct end nodes is indicated in \(\textrm{B}\), even if they correspond to the same string \(\omega\). Hence, the S first bits in \(\textrm{B}\) are set to 1 for the end nodes.
Analogously, \(\textrm{B}^r = 1\) if the following two conditions apply:

1
kmer \(T^r[\textrm{SA}^r[i], \textrm{SA}^r[i]+k[\) is the reverse of the leftmost kmer of a node.

2
Suffix \(T^r_{\textrm{SA}^r[i]}\) is the lexicographically largest suffix of \(T^r\) that has kmer \(T^r[\textrm{SA}^r[i], \textrm{SA}^r[i]+k[\) as a prefix.
For example, “TTG” is the reverse of the leftmost kmer of node 4 and is indicated by a 1bit in \(\textrm{B}^r\) at index 19 (Table 4).
Note that there are as many 1bits in \(\textrm{B}\) and \(\textrm{B}^r\) as there are nodes in the graph. We will use bit vectors \(\textrm{B}\) and \(\textrm{B}^r\) to obtain node identifiers that correspond to a certain kmer using rank operations. Because the 1bits in \(\textrm{B}\) and \(\textrm{B}^r\) are ordered differently, we store two node identifier mappings, \(\textrm{IDmap}\) and \(\textrm{IDmap}^r\) (see Tables 3 and 5), which transform the rank extracted from \(\textrm{B}\) and \(\textrm{B}^r\) respectively, to the effective node identifier. Note that the nodes in vector \(\textrm{G}\) can be ordered arbitrarily, as long as \(\textrm{IDmap}\) and \(\textrm{IDmap}^r\) are adjusted accordingly. Here, we choose to put the S end nodes at the end of vector \(\textrm{G}\). This way, it can be easily assessed if a certain node identifier corresponds to an end node or not.
Building the data structure
The construction process of the underlying bidirectional FMindex is based on the implementation of Columba [45, 50]. The construction of components \(\textrm{G}\) and \(\textrm{B}\) is similar to the algorithms described in [39]. Finally, the construction of components \(\textrm{B}^r\), \(\textrm{IDmap}\) and \(\textrm{IDmap}^r\) is a new contribution. A description of these algorithms would be quite lengthy and technical and is therefore omitted from this paper.
Elementary graph operations
To support more complex graph operations (e.g., subgraph visualization and approximate pattern matching), we need a set of building blocks that aid in navigating the graph. We introduce three elementary graph operations:

1
Determining the node identifier given the suffix array interval of an extreme kmer (i.e., the left or rightmost kmer) of that node.

2
Computing the identifier of the predecessor (resp. successor) of a given node by prepending (resp. appending) a character c to its substring \(\omega\).

3
Obtaining the identifier of the predecessor of a node by following a specific edge, i.e., by extending a specific occurrence of \(\omega\) in T.
Determining the node identifier for an extreme kmer
Determining a node’s identifier based on its rightmost kmer is important in the following scenario. If a partial match in the graph is extended with a character to the left and, as a consequence, a new node is visited, its rightmost kmer is encountered first. For this partial match, the corresponding intervals [b, e[ and \([b^r, e^r[\) over \(\textrm{SA}\) and \(\textrm{SA}^r\) respectively, are kept track of by the bidirectional FMindex. All suffixes in interval [b, e[ then start with the rightmost kmer of the new node. This node identifier can then be retrieved using function findIDRight (see Algorithm 1). Value \(id_{\textrm{B}}\) is obtained by a rank operation on bit vector \(\textrm{B}\) at index b that returns the total number of 1bits in \(\textrm{B}[0, b[\) (i.e., before index b). Next, array \(\textrm{IDmap}\) maps value \(id_{\textrm{B}}\) to the actual node identifier id which can then be used to access vector \(\textrm{G}\). Assuming constanttime rank support on bit vectors [51], function findIDRight runs in \(\textrm{O}(1)\) time. For example, (rightmost) kmer “GTC” with interval \(\textrm{SA}[9,11[\) yields \(id_{\textrm{B}} = 5\) (Table 2). Node identifier \(id=0\) can be found at index 5 in \(\textrm{IDmap}\) (Table 3).
Analogously, \(\textrm{B}^r\) plays an important role when matching in the forward direction, as it stores information about the leftmost kmer of each node. When extending a partial match with a character to the right and a new node is visited as a consequence, its node identifier can be found using function findIDLeft in \(\textrm{O}(1)\) time. For example, (reverse leftmost) kmer “TAT” with interval \(\textrm{SA}^r[13,15[\) yields \(id_{\textrm{B}^r} = 5\) (Table 4). Node identifier \(id =1\) can be found at index 5 in \(\textrm{IDmap}^r\) (Table 5).
Note that for functions findIDRight and findIDLeft, it is not mandatory that the input intervals contain all suffixes that are prefixed by the kmer of interest. In fact, the rank operation on line 2 can be called using any index in \(\textrm{SA}\) (resp. \(\textrm{SA}^r\)), corresponding to the kmer of interest.
Jumping to a neighbor with a character
Given a node identifier \(id\) and a character c, function getPredIDWithChar computes the identifier of the predecessor node that is encountered by prepending c to substring \(\omega\) of node \(id\) (see Algorithm 2). On line 3, the suffix array interval [b, e[ contains all suffixes of T whose klength prefix equals the leftmost kmer of node \(id\), i.e., \(\omega [0,k[\). On line 4, the suffix array interval \([b', e'[\) is computed for the \((k+1)\)mer \(c\omega [0,k[\) using basic functionality offered by the bidirectional FMindex. If this interval is nonempty (i.e., \(c\omega [0,k[\) occurs in T), the identifier of the predecessor node is determined using function findIDRight (see Algorithm 1). Otherwise, the return value of \(1\) indicates that no such predecessor node exists. This routine can be called for all characters \(c \in \Sigma\) to identify all predecessor nodes. Analogously, function getSuccIDWithChar illustrates how to find the successor node identifier during forward matching. Both functions execute in \(\textrm{O}(1)\) time.
Jumping to a predecessor through a specific edge
Recall that the ccdBG \(\textrm{G}(V,E)\) is a multigraph, i.e., there can be multiple edges between nodes u and v. With the exception of start and end nodes, each node has mult incoming and mult outgoing edges where mult corresponds to the number of times its substring \(\omega\) occurs in T. Jumping to a predecessor through a specific edge is thus achieved by extending a specific occurrence of \(\omega\) in T back to the predecessor node. All occurrences of \(\omega\) in T of a node are represented in the interval \(\textrm{SA}[ left\_kmer , left\_kmer + mult [\). A specific occurrence of \(\omega\) in T is indicated by a relative offset \(edgeOffset \in [0, mult[\) in this interval.
Algorithm 3 shows how the predecessor is found in this scenario. On line 2, value i is the \(\textrm{SA}\) index such that the specific occurrence of \(\omega\) starts at \(T[\textrm{SA}[i]]\). The LF operation provided by the (bidirectional) FMindex computes the \(\textrm{SA}\) index j such that \(\textrm{SA}[j] = \textrm{SA}[i]1\). Suffix \(T_{\textrm{SA}[j]}\) thus has \(c\omega\) as a prefix and we know that the klength prefix of that suffix is the rightmost kmer of the predecessor node of interest. Its identifier is found using the \(\texttt {findIDRight}\) function from Algorithm 1. Assuming constanttime rank support on bit vectors, algorithm 3 runs in \(\textrm{O}(1)\) time.
In the context of this paper, only jumping to a predecessor through a specific edge is required. Therefore, we omit its bidirectional counterpart in this section.
Visualization
Using the elementary graph operations discussed before, the visualization of subgraphs of the pangenome graph is achieved as follows. Given a set of seed nodes (\(seedNodes\)) of interest and a userdefined neighborhood size (\(maxDepth\)), Algorithm 4 generates a list of all nodes u for which \(\textrm{distance}(u,v) \le maxDepth\) for some node \(v \in seedNodes\). Here, \(\textrm{distance}(u,v)\) is defined as the number of edges on the shortest path between u and v, irrespective of the orientation of edges. The time complexity of Algorithm 4 is \(\textrm{O}(V_s\Sigma )\), with \(V_s\) the number of nodes in the subgraph. It relies on the functions described in Algorithm 2. Similarly, Algorithm 5 lists all edges that are part of the subgraph, using the functionality provided by Algorithm 3. The subgraph can be visualized in e.g. Cytoscape [52].
The visualization methodology proposed here differs from that of Dede and Ohlebusch [40] in that sense that their algorithms enable the visualization of only the upstream neighborhood of the region of interest, due to the fact that their use of the unidirectional FMindex supports only backward traversing of the graph. In contrast, our use of the bidirectional FMindex enables backward as well as forward traversing of the graph. Secondly, we separate the processes of matching patterns to the graph (see further) and visualizing subgraphs. Therefore, we can offer a very efficient pattern matching implementation, since we assume that users will mostly map great numbers of patterns to the graph, only few of which are interesting enough to be visualized. In Dede and Ohlebusch’s algorithms on the other hand, these processes are connected.
Approximate pattern matching to the graph
In earlier work, Beller and Ohlebusch provided algorithms for exact pattern matching against the ccdBG. In this paper, we extend this functionality to also support approximate pattern matching, i.e., the identification of all approximate occurrences of a search pattern, allowing for substitutions, insertions or deletions. Formally, given a search pattern P, our implementation exhaustively identifies all occurrences O of search pattern P in T such that the edit distance \(\textrm{ED}(O, P) \le K\). We support values \(K = {0,1,2,3 \text { or } 4}\). Since the upper limit for the number of allowed errors is 4, our algorithms are most suited to identify occurrences of short, lowerror (e.g., Illumina) reads or short seeds of long, highererror (e.g., Pacific Biosciences, Oxford Nanopore Technologies) reads. The bidirectional FMindex and search schemes can be used to support lossless approximate pattern matching.
Once the occurrences O of a search pattern P have been identified, they can be located in the graph. Each approximate occurrence O of P is an exact substring of T. If the length of the occurrence O is at least the kmer size, i.e., \(O \ge k\), then O aligns to either a single node, or a unique sequence of connected nodes in the ccdBG. Otherwise, if \(O < k\), then O can occur at multiple positions in the graph when O is a substring of multiple kmers. For both cases, we provide algorithms in this section. Once a node path has been identified, a subgraph centered around this path can be extracted for visualization, using the functionality discussed earlier.
Search schemes
In lossless approximate pattern matching, all occurrences in search text T, within a certain Hamming or Levenshtein/edit distance of a query pattern P, are identified. In the context of this paper, the edit distance is used, allowing for up to K substitutions, insertions or deletions (collectively called errors). Using the FMindex, lossless approximate pattern matching is performed by spelling, character by character, candidate occurrences of P in T. Using a naive backtracking algorithm, an excessive number of unsuccessful branches near the dense root of the search tree will be explored, rendering backtracking computationally impractical even for modest values of K [44].
Kucherov et al. [42] proposed the concept of search schemes, which define how lossless approximate pattern matching should be conducted, such that the search space is strongly reduced. We adopt their notation. Pattern P is partitioned into p parts \(P_i\), with \(i \in \{0,...,p1\}\). A search \({\mathcal {S}} = (\pi , L, U)\) is a triplet of arrays of size p where \(\pi\) is a permutation over \(\{0, ..., p1\}\) that defines the order in which the parts of P are processed. It must satisfy the connectivity property in that sense that a partial match can only be extended, either to the left or to the right, in a contiguous manner. The arrays L and U define the lower and upper bound to the cumulative number of allowed errors after each part has been processed. The core idea is to only gradually increase the number of allowed errors when more parts of P are matched, significantly reducing the search space near the dense root of the search tree. To cover all possible error distributions over the length of a pattern, multiple searches are required that collectively form a search scheme. Search schemes require bidirectional matching functionality, i.e., a partial match P can be extended to \(cP\) as well as \(Pc\). This way, a pattern can be matched by starting with any part of P and then extending that partial match with adjacent parts, either to the left or to the right, in arbitrary order.
The simplest examples of search schemes are those based on the pigeonhole principle [41]. By partitioning search pattern P into \(p = K+1\) parts, with K the maximum allowed number of errors, it immediately follows that for each occurrence of P in T, at least one part must be errorfree. All occurrences are identified using \(K+1\) searches \({\mathcal {S}}_i\). In search \({\mathcal {S}}_i\), exact matching of piece \(P_i\) is performed first, and subsequently extended with the remaining pieces to the left and right, allowing up to K errors. For example, for \(K = 2\) errors, the search scheme based on the pigeonhole principle is given by \({\mathcal {S}}_0 = (012, 000, 022)\), \({\mathcal {S}}_1 = (210, 000, 022)\) and \({\mathcal {S}}_2 = (102, 000, 022)\). Search \({\mathcal {S}}_2\) for example, starts with the exact matching of the middle piece \(P_1\). Next, the match is extended to the left, and finally to the right, each allowing up to \(K=2\) errors. This illustrates the need for a bidirectional index.
Kucherov et al. proposed more efficient search schemes. Again, for the case of \(K = 2\) errors, pattern P is partitioned into \(K+1\) parts and the search procedure consists of three searches that are shown in Fig. 2. The key difference is that the search schemes by Kucherov et al. impose more stringent lower and upper bounds than those based on the pigeonhole principle, while still covering any distribution of errors over the different parts. In general, for larger values of K, search schemes can become quite complex to design and deviate significantly from the search schemes based on the pigeonhole principle. Kucherov et al. and Kianfar et al. [43] proposed search schemes for up to \(K=4\) errors. The implementation of search schemes in Columba serves as a foundation for the work in this paper.
Identifying an occurrence in the graph
Search schemes allow to efficiently identify all occurrences O of search pattern P in T such that the edit distance \(\textrm{ED}(O, P) \le K\), with K the maximum number of allowed substitutions and indels. Each occurrence O is represented by its suffix array intervals [b, e[ and \([b^r,e^r[\) and its length \(l = O\) such that \(O = T[\textrm{SA}[b], \textrm{SA}[b] + l[\). Analogously, the reverse occurrence \(O^r\) is found as \(O^r = T^r[\textrm{SA}^r[b^r], \textrm{SA}^r[b^r] + l[\). In other words, each approximate occurrence O of P is an exact substring of T.
In this section, we provide algorithms to identify the location in the graph that corresponds to O. If \(O \ge k\), with k the kmer size, O has a unique location in the ccdBG that can be represented by a sequence of connected nodes, along with a starting position in the first node. We consider the case \(O < k\) later and assume for now that \(O \ge k\).
The process involves two steps. First, we determine the node identifier for the leftmost kmer of O. Next, we identify the node identifiers for the remaining part of O.

Step 1: Determining the First Node Identifier
In general terms, identifying the node that contains the kmer involves shifting, character by character, a klength window through the node until a kmer is found that can be used to identify the node. Until now, only the extreme (i.e., left or rightmost) kmers of a node could be used to obtain the node identifier (cf. Algorithm 1). However, the required number of shift operations can grow very large for long nodes and can even be \(\textrm{O}(n)\) (with \(n = T\)) for large values of k. Therefore, we adapt bit vector \(\textrm{B}\) and \(\textrm{IDmap}\) such that determining the node identifier for an arbitrary kmer can be achieved in constant time. Specifically, \(\textrm{B}[i] = 1\) if the following two conditions apply:

1
kmer \(T[\textrm{SA}[i], \textrm{SA}[i]+k[\) has offset \((j \cdot s_{ cp })\) (for \(j = 0, 1, 2, \ldots )\) in a node or is the rightmost kmer of a node.

2
Suffix \(T_{\textrm{SA}[i]}\) is the lexicographically largest suffix of T that has kmer \(T[\textrm{SA}[i], \textrm{SA}[i]+k[\) as a prefix.
In other words, besides the rightmost kmer of each node, we also indicate every \(s_{ cp }\)th kmer of a node. We refer to these extra kmers as ‘checkpoint kmers’. Their density is controlled by the userdefined checkpoint sparseness factor \(s_{ cp }\). For the example in Fig. 1, and assuming a checkpoint sparseness factor of \(s_{ cp } = 2\), kmers “TAT” for node 1 and “GTT” and “TGG” for node 4 serve as checkpoint kmers. Hence, three extra 1bits in \(\textrm{B}\) need to be set at indexes 13, 11 and 16 (see Table 2, in parentheses).
For each checkpoint kmer, and hence, each additional 1bit in bit vector \(\textrm{B}\), a corresponding entry that points to the node identifier must be added to \(\textrm{IDmap}\). Note that due to the checkpoint kmers, the relationship between the 1bits in \(\textrm{B}\) and their corresponding nodes is now surjective, since multiple 1bits are set in \(\textrm{B}\) for nodes with \(len >k\). We also add an extra row (\(offset\)) to \(\textrm{IDmap}\) to identify the offset position of a kmer within a node. As a consequence, \(offset\) equals \(len k\) when the entry corresponds to a rightmost kmer (the end nodes must again be extended cyclically), or \(j \cdot s_ cp\) for each jth checkpoint kmer. The extended \(\textrm{IDmap}\) table corresponding to the example from Table 2 is illustrated in Table 6. Note that these modifications to bit vector B and \(\textrm{IDmap}\) do not break the functionality of Algorithm 1.
Given a kmer, we use the basic functionality of the bidirectional FMindex to find its interval \(\textrm{SA}[b,e[\) and \(\textrm{SA}^r[b^r,e^r[\). We assume that the kmer exists in T, i.e., these intervals are nonempty. Algorithm 6 shows how to retrieve the node identifier given the SA interval [b, e[ of the kmer. On line 3, we consider the index i of the lexicographically largest suffix that has the kmer of interest as a prefix. On lines 4 to 6, we consider the adjacent kmers within the node, by advance to the left, character by character. More precisely, the LF operation returns the lexicographically largest index of the suffix prefixed by such an adjacent kmer. This process continues until an index is encountered that is indicated by a 1bit in bit vector B. In that case, the identifier and offset are retrieved on lines 7 to 8 in a similar manner as in Algorithm 1. By keeping track of the number of times the LF operation was used, the positional offset of the kmer is easily computed on line 9.
For example, consider kmer “TTG” with \(\textrm{SA}\) interval [19, 20[ (see Table 2). Assume \(s_{ cp }=2\). Because \(\textrm{B}[19] = 0\), “TTG” is not the rightmost or a checkpoint kmer of its node. Using the LF operation, we shift the klength window one character to the left: \(\textrm{LF}[19]\) yields index 11. Suffix \(T_{\textrm{SA}[11]}\) is indeed prefixed by kmer “GTT”. Because \(\textrm{B}[11] = 1\) when \(s_{ cp }=2\) (indeed, “GTT” is a checkpoint kmer), we obtain \(id = 4\) (see Table 6).
Note that the information on these checkpoint kmers is only stored with respect to \(\textrm{SA}\) (not \(\textrm{SA}^r\)): both the \(\textrm{SA}\) range and the \(\textrm{SA}^r\) range will always be available when we want to identify the node corresponding to an arbitrary kmer (pattern matching to the bidirectional FMindex keeps track of both ranges in a synchronized manner).
In summary, at most \(s_{ cp }1\) LF operations are needed to find a kmer that can be used to identify its node. Because the LF operation requires \(\textrm{O}(1)\) time, the time complexity of Algorithm 6 is \(O(s_{ cp })\). The userdefined parameter \(s_{ cp }\) hence controls the timespace tradeoff.

Step 2: Extending the Node Path
The sequence of nodes with which O aligns can now be easily identified as shown in Algorithm 7. In lines 2 to 5, the node identifier and start position of the leftmost kmer O[0, k[ is found using the \(\texttt {findID}\) function. Next, the nodes to which O[k, O[ aligns are identified in lines 6 to 9. Note that because we know that O is an exact substring of T, it is not necessary to match O character by character to the graph. Rather, for each visited node, one can immediately jump to the end of that node and use the getSuccIDWithChar function from Algorithm 2 to find the next node, etc.
The computation of the suffix array interval [b, e[ of O[0, k[ on line 2 of Algorithm 7 can be avoided. Recall that each occurrence O is generated character by character using search schemes. Therefore, it suffices to save the suffix array interval when the (partially generated) occurrence O reaches a length of k. Note that this interval does not necessarily correspond to the leftmost kmer of O, as O can still be extended to the left and the right during the search scheme procedure. Nevertheless, it is easy to adapt Algorithm 7 such that one can start from any kmer of O and then extend the path both to the left and right, using the bidirectional functionality offered by the data structure.
Finally, we consider the case \(O < k\). This means that O could be found in multiple locations in the graph. In order to enumerate all locations, it suffices to enumerate, using the bidirectional FMindex, all possible klength extensions of O that exist in T, and to identify the corresponding node for each such extension using the FindID function. This procedure can lead to redundant results, which can be filtered afterwards.
Results and discussion
We implemented the algorithms of this paper in Nexus, an opensource tool written in standard C++14. The source code is available at https://github.com/biointec/nexus under AGPL3.0 license.
Data and hardware
We built pangenomes of up to ten human genome builds also used in [40]: (i) five different assemblies of the human reference genome (UCSC Genome Browser assembly IDs: hg16, hg17, hg18, hg19, and hg38), (ii) the maternal and paternal haplotype of individual NA12878 (Utah female) of the 1000 Genomes Project [53], and (iii) three long read (PacBio) assemblies (GenBank assembly accession numbers: GCA_000001405.27, GCA_000002125.2 and GCA_000306695.2). All occurrences of ‘N’ were replaced by a randomly chosen nucleotide (‘A’, ‘C’, ‘G’ or ‘T’) to limit the alphabet size. The chromosomes within each build are concatenated into one string.
For benchmarking, we consider 100 000 Illumina HiSeq 2000 reads (101 bp) randomly sampled from a larger whole genome sequencing dataset (accession number ERR194147). All benchmark experiments were run on a Red Hat Enterprise Linux 8 system, using a single core of two 18core Intel^{®} Xeon^{®} Gold 6240 CPUs running at a base clock frequency of 2.60 GHz with 738 GiB of RAM. Reported runtimes include the time for the approximate pattern matching procedure, but exclude the time to read the FMindex and graph data structures from disk.
We also conduct a case study on a pangenome of 341 M. tuberculosis strains. Analogous to what was done in [54], we selected one reference strain of H37Rv (GenBank accession number CP003248.2), the assemblies of three historical isolates collected from KwaZuluNatal [55, 56] (KZN4207, accession GCA_000669655.1; KZN1435, accession GCA_000669675.1; KZN605, accession GCA_000669635.1) and the assemblies of 337 clinical isolates, also collected from KwaZuluNatal [54] (subset of BioProjects PRJNA183624 and PRJNA235618).
Memory usage
Storing and using the data structure
Recall that we build our implicit pangenome graph representation directly on top of the bidirectional FMindex as implemented in Columba. This additional graph representation, along with navigation functionality, comes at only a limited supplementary memory cost. Table 7 details the memory usage of the components of the bidirectional implicit representation of the ccdBG for a pangenome of 10 human genomes, with \(s_{\textrm{SA}} = 16\), \(s_{ cp } = 128\) and \(k=25\). The suffix array sparseness factor \(s_{\textrm{SA}}\) is inversely proportional with the number of suffix array entries that are stored. This pangenome consists of 30 340 521 923 characters, 66 102 955 graph nodes and 4 166 716 509 graph edges (not explicitly stored). The complete representation comprises 95.46 GiB, or approximately 27.03 bits per character.
The five components corresponding to the representation of the graph are stored as follows. Node vector \(\textrm{G}\) stores attributes \(len\) (32 bits), \(mult\) (32 bits), \(left\_kmer\) (64 bits), and \(right\_kmer^r\) (64 bits) for each node. Bit vectors \(\textrm{B}\) and \(\textrm{B}^r\) support constanttime rank operations using the rank9 algorithm [51], i.e., 1.25 bits per character (25% overhead). Mapping \(\textrm{IDmap}\) stores the node identifier (32 bits) and offset (32 bits) for all V rightmost kmers and all \(V_{cp}\) checkpoint kmers. In the example pangenome of 10 human genomes (\(s_{ cp } = 128\) and \(k=25\)), 70 927 010 checkpoint kmers are stored. Finally, \(\textrm{IDmap}^r\) stores node identifiers (32 bits) for only the V leftmost kmers. Note that unlike the bidirectional FMindex, these five components depend on the value of k.
In conclusion, the memory usage of the components corresponding to this pangenome graph comprises less than 15% of the underlying bidirectional FMindex. This overhead is limited given the functionality that is provided to navigate and visualize the pangenome graph. By building upon the underlying (bidirectional) FMindex in a complementary and modular way, future developments on index structures can likely be incorporated easily. The application of search schemes to the graph demonstrates this principle. However, the drawback of the bidirectional FMindex is that its space usage increases linearly with the pangenome’s sequence content, limiting our current data structure to a few dozen human genomes. To address this, we plan to investigate the bidirectional rindex [57] as an alternative. The bidirectional rindex offers the same functionality as the bidirectional FMindex, but with sublinear index growth (i.e., proportional to the amount of new variation introduced by additional genomes incorporated into the pangenome).
State of the Art Table 8 compares the memory usage of Nexus with that of other linear or graph pangenome representations that can serve as a reference during read alignment. Both deBGA and Pufferfish represent the pangenome as a (cc)dBG and use a kmer hash table based data structure to index that (cc)dBG and label the unitigs with their corresponding occurrences in the input genomes. The memory usage of the index for deBGA and Nexus is quite similar, while Pufferfish is about 35% more spaceefficient. However, note that unlike deBGA and Pufferfish, Nexus also provides other functionalities (such as visualization) next to read alignment. In contrast to a kmer hash table, both the A4 algorithm by Beller and Ohlebusch and Nexus are based on a fulltext index of the concatenation of all input genomes. Algorithm A4 builds its index based on the unidirectional FMindex, whereas Nexus utilizes the bidirectional FMindex, clarifying the increase in memory usage. Note that Nexus’ memory use can be reduced using parameter \(s_{\textrm{SA}}\), see Additional file 1: Fig. S1. The indexes of BWA and Bowtie 2 are also based on the FMindex of the concatenation of all genomes in the pangenome. In this regard, these linear indexes are conceptually highly similar to the underlying data structure of the graphs in A4 and Nexus. This similarity is also reflected in their reported memory usage. Finally, the Giraffe index comprises 2 to 5 times more memory than any other index discussed here.
Building the data structure
For the building process of the bidirectional implicit representation of the ccdBG, we prioritize limiting RAM usage over optimizing performance, as we believe the RAM usage to be the main bottleneck when building such largescale graphs. The CPU and RAM usage of the building process depends on many factors:

The more (diverse) input data, the more CPU time and RAM is needed.

The smaller the suffix array sparseness factor \(s_{\textrm{SA}}\), the less CPU time and the more RAM is needed.

The lower parameter k, the more CPU time is needed.
The checkpoint sparseness factor \(s_{ cp }\) has a relatively small impact on the graph construction process.
We built the data structure for a pangenome of 10 human genomes with \(s_{\textrm{SA}} = 16\), \(s_{ cp } = 128\) and \(k=25\). The complete process took 15 hours and 44 minutes, of which 41% was required for building the underlying bidirectional FMindex and the remaining time was used for constructing the implicit graph representation. Most of the former time period is spent building the regular and reverse suffix arrays. Most of the latter duration is used for building the longest common prefix (LCP) array (which is necessary to build the graph representation). The peak RAM usage is 269.32 GiB, which is reached during suffix array construction (the complete suffix array must be built before it can be stored in sparse form).
State of the Art Table 8 reports the CPU time and RAM required to build the index for Nexus and the other tools we compare with. We observe that the results for Nexus are in the same ballpark as those for deBGA, Pufferfish and Bowtie 2. A4 leverages a semiexternal building process in order to limit the peak RAM usage, and it appears that this algorithm is also more efficient in terms of CPU usage. Also the BWA indexing process is more efficient than Nexus in terms of RAM usage. Building the variation graph using Giraffe was computationally more intensive than any of the other indexing processes, mainly in terms of RAM usage.
Approximate pattern matching performance
Breakdown of Nexus’ performance
Due to the underlying bidirectional FMindex, Nexus provides a very efficient implementation of lossless approximate pattern matching against the ccdBG. That is, every occurrence that matches the pattern of interest within a specified maximum edit distance is reported, along with its corresponding positions in the sequences of the pangenome. In Table 9, we analyze the runtimes for performing exact pattern matching using A4 and Nexus (\(K=0\)), and approximate pattern matching using Nexus (\(K=\{1,2,3,4\}\)). Note that A4 does not provide the option to match patterns approximately.
As can be observed in Table 9, Nexus performs pattern matching 4.8 times faster than A4. Moreover, whereas A4 reports only the node path in the graph for each occurrence, Nexus also reports the position(s) with respect to the original reference text. For Nexus, Table 9 provides a breakdown of total runtime as follows: approximate pattern matching against the underlying FMindex (i.e., finding the SA interval(s)), finding the node paths corresponding to each occurrence found in the former procedure, and performing postprocessing (i.e., finding all occurrences in the original reference text using suffix array accesses, extracting the corresponding pangenome sequence identifier, and filtering these text occurrences). Note that a single search pattern can have multiple occurrences in the graph (in case of approximate pattern matching, i.e., \(K > 0\)) and that each individual occurrence in the graph can have multiple underlying text occurrences (in case it is repeated within or between strains). From this breakdown, we conclude that the fraction of time spent on finding node paths is limited. For approximate pattern matching to the graph, extracting the node path corresponding to the occurrences only requires about 11% of the total runtime. In contrast, the postprocessing step (which is only present in sequencetograph aligners that report coordinates with respect to the underlying reference sequences) requires a substantial amount of time (40 to 50%).
Alignment sensitivity analysis
In Table 10, we compare the alignment results of Nexus (for different values of K) with other tools that support read alignment to the pangenome as a reference in some form. Note that unlike the other aligners, Nexus (and A4) is currently more of a proofofconcept implementation as it lacks support for certain features such as SAM output. Aligner deBGA is left out of the comparison for a few reasons. The authors state that deBGA is mainly focused on pairedend read alignment and may therefore not provide strong support for singleend read alignment. With a small alteration to deBGA’s code we got the singleend functionality running, but for reads with a large number of text occurrences, a segmentation fault occurs.
The first thing that stands out, is the difference in number of reported text occurrences between Nexus and any of the other tools, which highlights the difference in sensitivity between lossy and lossless aligners. The average number of text occurrences per read varies around 10 for all of the lossy aligners, with a maximum of 20.37 for BWAMEM. This is unsurprising since the pangenome consists of 10 genomes, and we expect some reads to appear at multiple positions due to duplications. From Nexus’ results however, we learn that these tools miss quite some alignments: even at \(K=0\) (i.e., exact alignment), Nexus identifies an average number of 28.38 text occurrences per read. This is because some reads have an extremely high number of text occurrences in the pangenome (up to 37 872 exact text occurrences). For such reads, only a limited subset of text occurrences (or none at all) is reported by the other tools. This trend continues for a higher number of allowed errors: the number of text occurrences increases exponentially, due to the highly abundant reads (up to 227 347 text occurrences for some reads at edit distance 4).
The average number of graph occurrences (i.e., node paths) per read shows a similar increasing trend, but not as pronounced. This was to be expected, as one graph occurrence can correspond to many text occurrences. A4 and Nexus (at \(K=0\)) produce identical results. None of the other tools report graph occurrences, which emphasizes a core difference with our approach: next to the text coordinates, Nexus reports the corresponding node paths that were identified in the graph, which can be used for visualization and downstream analysis.
From the fraction of aligned reads, we learn that the stateoftheart aligners are able to align slightly more reads than Nexus. This is a consequence of the hard limit of edit distance 4, which is imposed by Nexus as of now, while the other aligners are able to detect occurrences that have 5 (or more) errors. To address this limitation, Nexus can easily be extended to support search schemes for edit distance 5 (or higher). Moreover, we intend to develop an implementation for gapped alignment as well, to support the detection of longer indels.
InDepth Pairwise Sensitivity Comparisons Upon performing an indepth pairwise analysis of the aligned reads by Nexus (\(K=4\)) versus graph aligners PuffAligner or Giraffe (Additional file 1: Fig. S2), two distinct subsets of reads were each time identified, which could only be aligned by one of the tools. These subsets characterize the different search spaces of the two aligners. In both cases, the reads that are exclusively aligned by Nexus often correspond to a high number of text occurrences (i.e., multimapping reads). The reads exclusively aligned by PuffAligner or Giraffe tend to correspond to 10 or 8 text occurrences, respectively. These occurrences contain over 4 errors and are therefore missed by Nexus. Giraffe has an unexpected median of 8 text occurrences per read. Moreover, the distribution of Giraffe’s text occurrences across the 10 genomes in the pangenome is uniform (Additional file 1: Fig. S3). These observations indicate that Giraffe is unsuitable for aligning reads to pangenomes consisting of multiple complete genomes.
A pairwise comparison between Nexus and linear aligners BWAMEM and Bowtie 2 reveals that, in this case, there are (virtually) no reads exclusively aligned by Nexus (not shown). Upon further investigation however, we observe that although these linear aligners excel at identifying at least one occurrence for nearly all reads, they are more insensitive to detecting all (almost) equally good alternative alignments. Figure 3 shows a detailed comparison of the occurrences reported by Nexus and BWAMEM. The left panel demonstrates that Nexus identifies 40 times more occurrences than BWAMEM, 98.30% of which are reported exclusively by Nexus. On the other hand, 32.05% of BWAMEM’s alignments are not found by Nexus. The middle panel of Fig. 3 shows that, apart from 118 exceptions, the reads exclusively aligned by BWAMEM correspond to an edit distance larger than 4, which falls outside Nexus’ current limitations. These 118 exceptions are in fact also reported by Nexus, but at a slightly different coordinate in the reference genome. Furthermore, Nexus is guaranteed to report each alignment with its minimal edit distance (right panel of Fig. 3). Additional file 1: Figure S4 illustrates how Nexus accurately reports the minimal edit distance for an example read. The same analysis conducted with Bowtie 2 yields similar conclusions (see Additional file 1: Fig. S5; Table S3). In summary, Nexus’ core strength is complete sensitivity within its defined limitations, whilst BWAMEM for instance only finds 35.75% of all exact alignments and only 0.16% of all alignments within an edit distance of 4.
Alignment performance analysis
Table 10 also reports three performance metrics. In terms of CPU time usage, PuffAligner, BWAMEM and Bowtie 2 are the most efficient fully functional aligners in our comparison. Based on the number of reads mapped per second, Nexus is almost two times faster than PuffAligner at \(K=0\), and faster than BWAMEM and Bowtie 2 even at \(K=1\). For higher values of K on the other hand, Nexus appears to be slower. However, based on the number of reported text occurrences per second, Nexus is significantly faster than any other aligner in our comparison. In other words, Nexus’ perceived lower performance per read is primarily due to it reporting a larger number of occurrences compared to other tools. In cases where the focus is solely on obtaining the optimal alignment(s) instead of all possible alignments within a specific edit distance, we propose a multistratum design gradually increasing the value of K until the optimal alignment(s) for a read are found. As such, the same alignment fraction reported for \(K=4\) can be reached at a much higher speed. The results for peak RAM memory usage are similar to what was reported in Table 8. In conclusion, despite Nexus detecting all occurrences within a specified edit distance, it achieves similar or even better performance levels compared to its competitors.
The effect of \(s_{ cp }\) and k on Nexus’ memory usage and APM performance
The use of checkpoint kmers reduces the time complexity to identify the node in the graph that corresponds to an arbitrary kmer to constant time at the cost of higher memory requirements. In Fig. 4, we analyze this timespace tradeoff by performing APM on a pangenome of 10 human genomes for different values of \(s_{ cp }\) and k. We also benchmarked without using checkpoint kmers (\(s_{ cp }=\infty\)). We observe that for \(k=50\) and \(k=75\), decreasing \(s_{ cp }\) results in faster node path extraction (Fig. 4, left). For \(k=25\), we see that \(s_{ cp }\) has only a limited effect on runtime. This is because, in that case, the median node length is short (28 characters), which means that even without intermediate checkpoint kmers, only few LFiterations are required to identify the node.
The total memory usage of component \(\textrm{IDmap}\) is similar for all three values of k (Fig. 4, right). This is because where the average number of checkpoint kmers per node is lower for low values of k, the higher number of nodes in the graph cancels out this effect (and vice versa). Note that the memory usage at its highest point (\(k=75\), \(s_{ cp }=8\)) amounts to 4.89 GiB. This is a significant increase with respect to component \(\textrm{IDmap}^r\) (83.90 MiB), but still only about 5% of the total data structure (Table 7). For a good balance between better APM performance and limited additional memory usage, we recommend a checkpoint sparseness factor around 128 (default). Finding the node paths for \(k=75\) using this default value of \(s_{ cp }=128\) for example, is twice as fast than without checkpoint kmers with only 228 MiB additional memory usage.
Case study on the bacterium M. Tuberculosis
In this case study, we demonstrate the potential of visualizing subgraphs and extracting information from the pangenome graph topology. Specifically, we want to study antibiotic resistance in bacteria, as it remains a medically relevant topic for monitoring infectious diseases [54, 58,59,60]. Therefore, we built a pangenome containing 340 M. tuberculosis strains from KwaZuluNatal and one H37Rv reference strain, with \(k=19\) (which was chosen after manual investigation), to visualize and investigate regions that are related to rifampicin resistance. Cohen et al. [54] listed 18 mutations from the RRDR region (the Rifampicin Resistance Determining Region, i.e., the 81 bp core region of gene rpoB), which is known to be related to rifampicin resistance [61]. From these mutations, we select the three that were reported to be observed in more than 50 strains of the dataset for closer investigation, as to limit the extent of this case study. Table 11 shows these three mutations, along with their coordinates with respect to the reference strain and the number of strains in the dataset that carry it.
Using the visualization algorithms discussed earlier, these mutations with their surroundings can be visualized. As the visualization of the complete RRDR region is too large to include in this paper, we show the subgraph that contains mutations S450L and L452P (i.e., the end of the RRDR region) in Fig. 5. Such visualizations are beneficial during handson research, as they allow the enduser to manually investigate the regions of interest in depth. In this case, we can indeed confirm that the mutations observed in [54] are present in our dataset.
Compensatory mutations
Cohen et al. report, “While accumulation of drugresistance mutations can confer a fitness cost to bacteria, subsequent development of compensatory mutations can ameliorate these costs by restoring certain affected physiological functions while maintaining drug resistance” [54]. For rifampicin resistance, putative compensatory mutations in genes rpoA, rpoC and the nonRRDR regions of gene rpoB have already been discussed in previous literature [62,63,64]. Cohen et al. analyzed these reported compensatory mutations and investigated the pangenome for new ones. In total, they report 49 putative rifampicin compensatory mutations that meet their requirements (i.e., evolved after or concurrent to genotypic rifampicin resistance), 26 of which were newly identified. In this paper, we only consider putative compensatory mutations that are comutated with one of the three mutations in Table 11, and that occur at least twice. These limitations leave us with 15 putative compensatory mutations, which are reported in the second column of Table 12.
In this paper, we set up an independent search for putative compensatory mutations by leveraging the functionality of finding the neighboring nodes of a certain node path. Specifically, we find the neighboring nodes in the graph for genes rpoA, rpoC and the nonRRDR regions of gene rpoB in the H37Rv reference strain, using the visualization algorithms. If possible, we assign a coordinate to the neighboring nodes, by jumping back to predecessor nodes (Algorithm 3) until a node is encountered that can be unambiguously positioned with respect to reference H37Rv (details are omitted). We then consider a neighboring node to contain a candidate putative compensatory variation if the following conditions are met:

1
A coordinate was found within a limited number of steps back in the graph.

2
The neighboring node contains only strains that carry one of the three mutations from Table 11. All strains must carry the same RRDR mutation.

3
The neighboring node must have a multiplicity of at least two. In other words, the candidate putative compensatory variation must appear at least twice.
Applying this workflow to our pangenome, results in 14 candidate putative compensatory mutation nodes, for which their coordinate (with respect to reference H37Rv), node identifier, corresponding RRDR mutation and multiplicity is shown in Table 12. Note that most candidate compensatory genes correspond to the S450L RRDR mutation, as was also observed in [54]. We compare our results with the 15 putative compensatory mutations reported in [54] we selected previously. We can distinguish three categories in Table 12: 2 entries were only reported in [54], 1 entry was only reported by our pipeline, and 13 entries correspond to a matching polymorphism reported both in [54] and by our pipeline.
Two polymorphisms are missed by our pipeline due to the following reasons:

E750D: the transition from glutamic acid to aspartic acid happens in two ways: from “GAG” to “GAC” (node 67460) and from “GAG” to “GAT” (node 108256). Hence, they are presented as two separate mutations with a multiplicity of one, which do not meet the third condition.

V183G: this mutation can be found in node 61254, but one of the strains that passes through it is not genotypically rifampicin resistant (i.e., does not carry an RRDR mutation). Hence, this node does not meet the second condition.
In summary, our pipeline detects all putative compensatory mutations from [54] within the limits we imposed.
For the entry that is only reported by our pipeline, further research is required. First, we investigate the type of the variant: it could be a substitution (silent, missense or nonsense), or an insertion/deletion (possibly introducing frameshift). We do this manually, based on the visualization of the neighborhood of this variation (see Fig. 6) and the codon information of the reference genome on NCBI. As is detailed in Fig. 6, node 111083 contains a silent mutation, which is currently not known to have compensatory effects. If a missense mutation had been found, additional research would have been required (e.g., verifying the presence of this candidate in multiple, distinct phylogenetic clades or verifying that this candidate evolved after or concurrent to genotypic rifampicin resistance).
Note that the pipeline discussed in this section is more of an ad hoc solution to the problem of finding candidate compensatory mutations corresponding to mutations in the RRDR region of rpoB, rather than a general pipeline to be readily applied to other problems. Moreover, the choice of parameter k has a big impact on the results of this pipeline. Nevertheless, the results from [54] can be reproduced within the limitations we impose, without the need for variant calling. Hence, this application clearly shows the potential of exploiting the information that is embedded in the pangenome ccdBG, using the graph operations we propose in this paper.
Conclusions
In this paper we proposed Nexus, a memoryefficient representation of the colored compacted de Bruijn graph enabling subgraph visualization and lossless approximate pattern matching of reads to the graph, developed to store pangenomes. This implicit graph representation is built on top of the bidirectional FMindex in a modular and complementary way, with a limited additional memory cost (around 15%). We demonstrated that it allows for easy integration of recent developments for the bidirectional FMindex, by applying search schemes to our pangenome graph. Using search schemes, we provided a very efficient implementation of lossless approximate pattern matching of reads to the graph, showing similar performance to stateoftheart lossy read(tograph) aligners. We showed that Nexus’ strength is to identify all possible occurrences corresponding to a read, even if they are highly abundant. We also established a usecase demonstrating the advantage of Nexus’ versatility, by combining both the approximate pattern matching and visualization functionalities to analyze antimicrobial resistance mutations and their possible compensatory mutations. Future work includes extending the implementation to a complete aligner (e.g., providing SAM output), integrating pairedend read alignment, building a multistratum search scheme design and extending the search schemes to allow for more than 4 errors.
Availability of data and materials
The datasets supporting the conclusions of this article are publicly available, and the ‘Data and Hardware’ section lists all corresponding dataset identifiers and references. The C++ source code of Nexus is available at https://github.com/biointec/nexus under the GNU AGPL v3.0 license.
Abbreviations
 AGPL:

Affero general public license
 AMR:

Antimicrobial resistance
 APM:

Approximate pattern matching
 BWA:

BurrowsWheeler aligner
 BWT:

BurrowsWheeler transform
 ccdBG:

Colored compacted de Bruijn graph
 cdBG:

Compacted de Bruijn graph
 CPU:

Central processing unit
 dBG:

de Bruijn graph
 DNA:

Deoxyribonucleic acid
 DP:

Dynamic programming
 ED:

Edit distance
 ID:

Identifier
 LCP:

Longest common prefix
 LF:

LasttoFirst (property)
 MEM:

Maximal exact match
 NCBI:

National center for biotechnology information
 ONT:

Oxford nanopore technologies
 PacBio:

Pacific biosciences
 RAM:

Randomaccess memory
 RRDR:

Rifampicin resistance determining region
 SA:

Suffix array
 SNP:

Singlenucleotide polymorphism
 UCSC:

University of California Santa Cruz
References
Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial pangenome. Proc Natl Acad Sci. 2005;102(39):13950–5. https://doi.org/10.1073/pnas.0506758102.
Consortium TCPG. Computational pangenomics: status, promises and challenges. Brief Bioinform. 2016;19(1):118–35. https://doi.org/10.1093/bib/bbw089.
Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pangenome analysis with suffix skips. Bioinformatics. 2014;30(24):3476–83. https://doi.org/10.1093/bioinformatics/btu756.
Baier U, Beller T, Ohlebusch E. Graphical pangenome analysis with compressed suffix trees and the BurrowsWheeler transform. Bioinformatics. 2015;32(4):497–504. https://doi.org/10.1093/bioinformatics/btv603.
Brandt DYC, Aguiar VRC, Bitarello BD, Nunes K, Goudet J, Meyer D. Mapping bias overestimates reference allele frequencies at the HLA genes in the 1000 genomes project phase I data. Genes Genom Genet. 2015;5(5):931–41. https://doi.org/10.1534/g3.114.015784.
Degner JF, Marioni JC, Pai AA, Pickrell JK, Nkadori E, Gilad Y, et al. Effect of readmapping biases on detecting allelespecific expression from RNAsequencing data. Bioinformatics. 2009;25(24):3207–12. https://doi.org/10.1093/bioinformatics/btp579.
Martiniano R, Garrison E, Jones ER, Manica A, Durbin R. Removing reference bias and improving indel calling in ancient DNA data analysis by mapping to a sequence variation graph. Genome Biol. 2020;21(1):250. https://doi.org/10.1186/s13059020021607.
Groza C, Kwan T, Soranzo N, Pastinen T, Bourque G. Personalized and graph genomes reveal missing signal in epigenomic data. Genome Biol. 2020;21(1):124. https://doi.org/10.1186/s13059020020388.
Chen NC, Solomon B, Mun T, Iyer S, Langmead B. Reference flow: reducing reference bias using multiple population genomes. Genome Biol. 2021;22(1):8. https://doi.org/10.1186/s13059020022293.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWAMEM; 2013. https://doi.org/10.48550/arXiv.1303.3997.
Langmead B, Salzberg SL. Fast gappedread alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. https://doi.org/10.1038/nmeth.1923.
Ferragina P, Manzini G. Opportunistic data structures with applications. In: Proceedings 41st Annual Symposium on Foundations of Computer Science; 2000. p. 390–398. https://doi.org/10.1109/SFCS.2000.892127.
Gagie T, Navarro G, Prezza N. Fully functional suffix trees and optimal text searching in BWTruns bounded space. J ACM. 2020;67(1):635. https://doi.org/10.1145/3375890.
Schneeberger K, Hagmann J, Ossowski S, Warthmann N, Gesing S, Kohlbacher O, et al. Simultaneous alignment of short reads against multiple genomes. Genome Biol. 2009;10(9):R98. https://doi.org/10.1186/gb2009109r98.
Rakocevic G, Semenyuk V, Lee WP, Spencer J, Browning J, Johnson IJ, et al. Fast and accurate genomic analyses using genome graphs. Nat Genet. 2019;51(2):354–62. https://doi.org/10.1038/s4158801803164.
Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graphbased genome alignment and genotyping with HISAT2 and HISATgenotype. Nat Biotechnol. 2019;37(8):907–15. https://doi.org/10.1038/s4158701902014.
Jain C, Misra S, Zhang H, Dilthey A, Aluru S. Accelerating Sequence Alignment to Graphs. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2019. p. 451–461. https://doi.org/10.1109/IPDPS.2019.00055.
Vaddadi K, Srinivasan R, Sivadasan N. Read Mapping on Genome Variation Graphs. In: Huber KT, Gusfield D, editors. 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). vol. 143 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–LeibnizZentrum fuer Informatik; 2019. p. 7:1–7:17. https://doi.org/10.4230/LIPIcs.WABI.2019.7.
Darby CA, Gaddipati R, Schatz MC, Langmead B. Vargas: heuristicfree alignment for assessing linear and graph read aligners. Bioinformatics. 2020;36(12):3712–8. https://doi.org/10.1093/bioinformatics/btaa265.
Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol. 2018;36(9):875–9. https://doi.org/10.1038/nbt.4227.
Sirén J, Monlong J, Chang X, Novak AM, Eizenga JM, Markello C, et al. Pangenomics enables genotyping of known structural variants in 5202 diverse genomes. Science. 2021;374(6574):8871. https://doi.org/10.1126/science.abg8871.
Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequencetograph alignment. Genome Biol. 2020;21(1):253. https://doi.org/10.1186/s13059020021572.
Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol. 2020;21(1):265. https://doi.org/10.1186/s1305902002168z.
Jain C, Zhang H, Gao Y, Aluru S. On the complexity of sequencetograph alignment. J Comput Biol. 2020;27(4):640–54. https://doi.org/10.1089/cmb.2019.0066.
Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, et al. A wholegenome assembly of drosophila. Science. 2000;287(5461):2196–204. https://doi.org/10.1126/science.287.5461.2196.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;44(2):226–32. https://doi.org/10.1038/ng.1028.
Limasset A, Cazaux B, Rivals E, Peterlongo P. Read mapping on de Bruijn graphs. BMC Bioinform. 2016;17(1):237. https://doi.org/10.1186/s1285901611039.
Heydari M, Miclotte G, Van de Peer Y, Fostier J. BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs. BMC Bioinform. 2018;19(1):311. https://doi.org/10.1186/s1285901823197.
Dvorkina T, Antipov D, Korobeynikov A, Nurk S. SPAligner: alignment of long diverged molecular sequences to assembly graphs. BMC Bioinform. 2020;21(12):306. https://doi.org/10.1186/s12859020035907.
Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de Bruijn Graphs. In: Raphael B, Tang J, editors. Algorithms in Bioinformatics. Berlin: Springer; 2012. p. 225–235. https://doi.org/10.1007/9783642331220_18.
Boucher C, Bowe A, Gagie T, Puglisi SJ, Sadakane K. VariableOrder de Bruijn Graphs. In: 2015 Data Compression Conference; 2015. p. 383–392. https://doi.org/10.1109/DCC.2015.70.
Belazzougui D, Gagie T, Mäkinen V, Previtali M, Puglisi SJ. Bidirectional VariableOrder de Bruijn Graphs. In: Kranakis E, Navarro G, Chávez E, editors. LATIN 2016: Theoretical Informatics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2016. p. 164–178. https://doi.org/10.1007/9783662495292_13.
Muggli MD, Bowe A, Noyes NR, Morley PS, Belk KE, Raymond R, et al. Succinct colored de Bruijn graphs. Bioinformatics. 2017;33(20):3181–7. https://doi.org/10.1093/bioinformatics/btx067.
DíazDomínguez D, Gagie T, Navarro G. Simulating the DNA Overlap Graph in Succinct Space. In: Pisanti N, Pissis SP, editors. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). vol. 128 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl–LeibnizZentrum fuer Informatik; 2019. p. 26:1–26:20. https://doi.org/10.4230/LIPIcs.CPM.2019.26.
Alanko JN, Vuohtoniemi J, Mäklin T, Puglisi SJ. Themisto: a scalable colored kmer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. bioRxiv. 2023; https://doi.org/10.1101/2023.02.24.529942.
Liu B, Guo H, Brudno M, Wang Y. deBGA: read alignment with de Bruijn graphbased seed and extension. Bioinformatics. 2016;32(21):3224–32. https://doi.org/10.1093/bioinformatics/btw371.
Almodaresi F, Zakeri M, Patro R. PuffAligner: a fast, efficient and accurate aligner based on the Pufferfish index. Bioinformatics. 2021;37(22):4048–55. https://doi.org/10.1093/bioinformatics/btab408.
Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and timeefficient index for the compacted colored de Bruijn graph. Bioinformatics. 2018;34(13):i169–77. https://doi.org/10.1093/bioinformatics/bty292.
Beller T, Ohlebusch E. A representation of a compressed de Bruijn graph for pangenome analysis that enables search. Algorithms Mol Biol. 2016;11(1):20. https://doi.org/10.1186/s1301501600837.
Dede K, Ohlebusch E. Dynamic construction of pangenome subgraphs. Open Comput Sci. 2020;10(1):82–96. https://doi.org/10.1515/comp20200018.
Lam TW, Li R, Tam A, Wong S, Wu E, Yiu SM. High Throughput Short Read Alignment via Bidirectional BWT. In: 2009 IEEE International Conference on Bioinformatics and Biomedicine; 2009. p. 31–36. https://doi.org/10.1109/BIBM.2009.42.
Kucherov G, Salikhov K, Tsur D. Approximate String Matching Using a Bidirectional Index. In: Kulikov AS, Kuznetsov SO, Pevzner P, editors. Combinatorial Pattern Matching. Cham: Springer International Publishing; 2014. p. 222–231. https://doi.org/10.1007/9783319075662_23.
Kianfar K, Pockrandt C, Torkamandi B, Luo H, Reinert K. Optimum Search Schemes for Approximate String Matching Using Bidirectional FMIndex; 2018. https://doi.org/10.48550/arXiv.1711.02035.
Pockrandt CM. Approximate String Matching: Improving Data Structures and Algorithms [dissertation]. Free University of Berlin, Dahlem, Germany; 2019. https://doi.org/10.17169/refubium2185.
Renders L, Marchal K, Fostier J. Dynamic partitioning of search patterns for approximate pattern matching using search schemes. iScience. 2021;24(7):102687. https://doi.org/10.1016/j.isci.2021.102687.
Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc Natl Acad Sci. 2001;98(17):9748–53. https://doi.org/10.1073/pnas.171285098.
Manber U, Myers G. Suffix arrays: a new method for online string searches. SIAM J Comput. 1993;22(5):935–48. https://doi.org/10.1137/0222058.
Burrows M, Wheeler D. A BlockSorting Lossless Data Compression Algorithm. 130 Lytton Avenue, Palo Alto, California 94301: Digital Equipment Corporation Systems Research Center; 1994. 124.
Pockrandt C, Ehrhardt M, Reinert K. EPRDictionaries: A Practical and Fast Data Structure for Constant Time Searches in Unidirectional and Bidirectional FM Indices. In: Sahinalp SC, editor. Research in Computational Molecular Biology. Cham: Springer International Publishing; 2017. p. 190–206. https://doi.org/10.1007/9783319569703_12.
Renders L, Depuydt L, Fostier J. Approximate Pattern Matching Using Search Schemes and InText Verification. In: Rojas I, Valenzuela O, Rojas F, Herrera LJ, Ortuño F, editors. Bioinformatics and Biomedical Engineering. Cham: Springer International Publishing; 2022. p. 419–435. https://doi.org/10.1007/9783031078026_36.
Vigna S. Broadword Implementation of Rank/Select Queries. In: McGeoch CC, editor. Experimental Algorithms. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 154–168. https://doi.org/10.1007/9783540685524_12.
Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. https://doi.org/10.1101/gr.1239303.
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, et al. AlleleSeq: analysis of allelespecific expression and binding in a network framework. Mol Syst Biol. 2011;7(1):522. https://doi.org/10.1038/msb.2011.54.
Cohen KA, Abeel T, Manson McGuire A, Desjardins CA, Munsamy V, Shea TP, et al. Evolution of extensively drugresistant tuberculosis over four decades: whole genome sequencing and dating analysis of mycobacterium tuberculosis isolates from KwaZuluNatal. PLoS Med. 2015;12(9): e1001880. https://doi.org/10.1371/journal.pmed.1001880.
Koenig R. Few mutations divide some drugresistant TB strains. Science. 2007;318(5852):901–2. https://doi.org/10.1126/science.318.5852.901a.
Ioerger TR, Koo S, No EG, Chen X, Larsen MH, Jacobs WR Jr, et al. Genome analysis of multi and extensivelydrugresistant tuberculosis from KwaZuluNatal, South Africa. PLoS ONE. 2009;4(11): e7778. https://doi.org/10.1371/journal.pone.0007778.
Arakawa Y, Navarro G, Sadakane K. BiDirectional rIndexes. In: Bannai H, Holub J, editors. 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). vol. 223 of Leibniz International Proceedings in Informatics (LIPIcs). Dagstuhl, Germany: Schloss Dagstuhl – LeibnizZentrum für Informatik; 2022. p. 11:1–11:14. https://doi.org/10.4230/LIPIcs.CPM.2022.11.
Manson AL, Tyne DV, Straub TJ, Clock S, Crupain M, Rangan U, et al. Chicken meatassociated enterococci: influence of agricultural antibiotic use and connection to the clinic. Appl Environ Microbiol. 2019;85(22):e01559. https://doi.org/10.1128/AEM.0155919.
Tyne DV, Manson AL, Huycke MM, Karanicolas J, Earl AM, Gilmore MS. Impact of antibiotic treatment and host innate immune pressure on enterococcal adaptation in the human bloodstream. Sci Transl Med. 2019;11(487):8418. https://doi.org/10.1126/scitranslmed.aat8418.
Lebreton F, Manson AL, Saavedra JT, Straub TJ, Earl AM, Gilmore MS. Tracing the Enterococci from Paleozoic Origins to the Hospital. Cell. 2017;169(5):849861.e13. https://doi.org/10.1016/j.cell.2017.04.027.
Telenti A, Imboden P, Marchesi F, Matter L, Schopfer K, Bodmer T, et al. Detection of rifampicinresistance mutations in Mycobacterium tuberculosis. The Lancet. 1993;341(8846):647–51. https://doi.org/10.1016/01406736(93)90417F.
Comas I, Borrell S, Roetzer A, Rose G, Malla B, KatoMaeda M, et al. Wholegenome sequencing of rifampicinresistant Mycobacterium tuberculosis strains identifies compensatory mutations in RNA polymerase genes. Nat Genet. 2012;44(1):106–10. https://doi.org/10.1038/ng.1038.
Casali N, Nikolayevskyy V, Balabanova Y, Harris SR, Ignatyeva O, Kontsevaya I, et al. Evolution and transmission of drugresistant tuberculosis in a Russian population. Nat Genet. 2014;46(3):279–86. https://doi.org/10.1038/ng.2878.
de Vos M, Müller B, Borrell S, Black PA, van Helden PD, Warren RM, et al. Putative compensatory mutations in the rpoC gene of Rifampinresistant mycobacterium tuberculosis are associated with ongoing transmission. Antimicrob Agents Chemother. 2013;57(2):827–32. https://doi.org/10.1128/AAC.0154112.
Acknowledgements
The authors thank Enno Ohlebusch for reading the manuscript and providing useful suggestions.
Funding
L.D.: PhD Fellowship FR (1117322N) by the Research Foundation  Flanders (FWO). L.R.: PhD Fellowship SB (1SE7822N) by the Research Foundation  Flanders (FWO).
Author information
Authors and Affiliations
Contributions
L.D., L.R. and J.F. designed and implemented the algorithms. L.D. performed all benchmarks. T.A. and J.F. supervised the study. All authors have written and approved the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Supplementary information on the bidirectional FMindex, supplementary results, and instructions for reproducing the results
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Depuydt, L., Renders, L., Abeel, T. et al. Pangenome de Bruijn graph using the bidirectional FMindex. BMC Bioinformatics 24, 400 (2023). https://doi.org/10.1186/s12859023055316
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055316