SPAligner: alignment of long diverged molecular sequences to assembly graphs

Background Graph-based representation of genome assemblies has been recently used in different contexts — from improved reconstruction of plasmid sequences and refined analysis of metagenomic data to read error correction and reference-free haplotype reconstruction. While many of these applications heavily utilize the alignment of long nucleotide sequences to assembly graphs, first general-purpose software tools for finding such alignments have been released only recently and their deficiencies and limitations are yet to be discovered. Moreover, existing tools can not perform alignment of amino acid sequences, which could prove useful in various contexts — in particular the analysis of metagenomic sequencing data. Results In this work we present a novel SPAligner (Saint-Petersburg Aligner) tool for aligning long diverged nucleotide and amino acid sequences to assembly graphs. We demonstrate that SPAligner is an efficient solution for mapping third generation sequencing reads onto assembly graphs of various complexity and also show how it can facilitate the identification of known genes in complex metagenomic datasets. Conclusions Our work will facilitate accelerating the development of graph-based approaches in solving sequence to genome assembly alignment problem. SPAligner is implemented as a part of SPAdes tools library and is available on Github.

error correction [11], and haplotype separation [12]. At the same time, the choice of the practical aligners supporting long nucleotide sequences is currently limited to vg [4] and GraphAligner [13], both of which are under active development. Moreover, to the best of our knowledge, no existing graph-based aligner supports alignment of amino acid sequences.
Here we present the SPAligner (Saint Petersburg Aligner) tool for aligning long diverged molecular (both nucleotide and amino acid) sequences against assembly graphs produced by the popular short-read assemblers. The project stemmed from our previous efforts on the long-read alignment within the hybridSPAdes assembler [9]. Our benchmarks on various Pacbio and Oxford Nanopore datasets show that SPAligner is highly competitive to vg and GraphAligner in aligning long error-prone reads. We also demonstrate SPAligner's ability to accurately align amino-acid sequences (with up to 90% amino acid identity) onto complex assembly graphs of metagenomic datasets. To further motivate this application we show how SPAligner can be used for identification of biologically important (antibiotic-resistance) genes, which remain under the radar of conventional pipelines due to assembly fragmentation (e.g. genes exhibiting high variability in complex environmental samples).

Implementation
"Alignment of long nucleotide sequences" section describes the notation and presents the approach taken in SPAligner for semi-global alignment of nucleotide sequences to the assembly graph. Extension of this approach for aligning amino acid sequences is discussed in "Alignment of amino-acid sequences" section.

Alignment of long nucleotide sequences
Let G be a (directed) compacted de Bruijn graph with edges labeled by nucleotide sequences that we further refer to as assembly graph 1 . Length of a nucleotide sequence S is denoted by |S| and length of edge e, |e|, is defined as a length of its label. S[ x] denotes x-th symbol of S (starting from 0). Position x on string S belongs to a range [ 0 . . . |S|] and corresponds to location before S [ x], if x < |S|, and after S[ |S| − 1] otherwise. S[ a : b] is a substring of S between positions a and b. By the size of graph G, |G|, we denote the sum of the number of its vertices, edges and total length of all labels.
Position in the graph is naturally defined by a pair of an edge e and position in the sequence of e: p = (e, i), where 0 <= i <= |e|. Note that with such notation there are multiple positions in graph that corresponds to a vertex of G -it is located at the end of each incoming edge and at the beginning of each outgoing edge. G[ p] denotes a symbol on position i within edge e. A path P in G is defined by a sequence of consecutive edges e 1 , . . . , e n and a pair of positions pos e 1 and pos e n on e 1 and e n respectively. We denote the nucleotide sequence obtained by concatenation of edge labels in a path P (trimmed according to pos e 1 and pos e n ) as Label(P).
For the sake of clarity throughout the paper we are going to neglect the fact that adjacent edges produced by popular assembly tools typically overlap (in particular adjacent edges produced by majority of the de Bruijn graph based assemblers overlap by K bp, where 1 While common definition of assembly graph associates sequences with the vertices, assembly graphs produced by de Bruijn graph based assemblers can be readily represented as edge-labeled graphs, considered in this work. We also note that our methods can be straightforwardly re-implemented in application to other assembly graphs. K is the kmer size parameter of de Bruijn graph). Additional file 1 "Notes on running aligners" presents a formal description of the assembly graphs accepted by the current implementation of SPAligner.
The semi-global alignment of a sequence S (query) to an assembly graph G is defined as a path P in G, such that Label(P) has minimal alignment cost against S across all paths in G. As well as other practical tools for searching alignments in large sequence graphs (including vg and GraphAligner) SPAligner first identifies regions of high nucleotide identity between the query and the graph sequences that we refer to as anchor alignments (or anchors) and then attempts to extend them to desired semi-global alignments.
Below we outline the four steps implemented in SPAligner for aligning a nucleotide sequence query S to an assembly graph G (see Fig. 1), generally following the approach used in alignment module of hybridSPAdes assembler [9] Anchor search. Anchor alignments between S and the pre-indexed edge labels are identified using the BWA-MEM library [14] (every high-scored local alignment between a query and individual edge labels reported by BWA-MEM is considered as a potential anchor). Each anchor a is defined by a range on S -[ start s (a), end s (a)) and a range on a specific edge e(a) -[ start e (a), end e (a)). The span of an anchor a is defined as (end s (a) − start s (a)). Only anchors with span exceeding a certain threshold (by default equal to K -the kmer size parameter of de Bruijn graph) are considered. Since the goal of this step is to get a set of all potentially-relevant anchors, BWA-MEM settings were adjusted for higher sensitivity. In particular, we modified the defaults to retain as many alignments as possible: secondary alignments were included, mask_level and drop_ratio were assigned to 20, and the seed size was decreased from 17 bp to 14 bp.  1 Sequence-to-graph alignment with SPAligner. Alignment of query sequence S (orange bar) to assembly graph G. Assembly graph edges are considered directed left-to-right (explicit edge orientation was omitted to improve clarity). Upper-left: Anchor search. Anchor alignments (regions of high similarity) between the query and the edge labels are identified with BWA-MEM. Upper-right: Anchor filtering. Anchors shorter than K (anchors 2, 6, 11), anchors "in the middle" of long edge (anchor 7) or ambiguous anchors (anchor 10 mostly covered by anchor 9, both anchors 4 and 5) are discarded. Bottom-left: Anchor chaining. Heaviest chain of compatible anchors (chain 3 → 8 → 9) is determined. Bottom-right: Reconstructing the filling paths. Paths for fragments of the query between the consecutive chain anchors (as well as left-and right-most fragments) are reconstructed Anchor filtering. Unreliable and ambiguous anchors are further discarded based on their size, locations in the graph and the layout on the query sequence. In particular, the following groups of anchors are discarded (in the specified order): • Anchors "in the middle" of long edges, which span less than T bp (T = 500, by default). More formally we discard the anchor a if end e (a) − start e (a)+ min(start s (a), start e (a)) + min(|S|−end s (a), |e|−end e (a)) > 3 · (end e (a) − start e (a)).
• Anchors for which at least half of their query ranges are covered by other anchors.
Anchor chaining. Anchors are assigned weights equal to their span. Two anchors a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance between their positions on S. SPAligner searches for the heaviest chain of compatible anchors using dynamic programming and considers the resulting chain as a skeleton of the final alignment [9]. Two anchor alignments a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance in S (formally, dist G (p a , p b )/(start s (b) − end s (a)) > α, where p a = (e(a), end e (a)), p b = (e(b), start e (b) and α defaults to 1.3).
Reconstructing the filling paths. The goal of this step is to identify paths in G between the consecutive anchors in the skeleton (and beyond its leftmost/rightmost anchors) minimizing the alignment cost against corresponding regions of the query sequence. Following a number of previous works on alignment of sequences against general graphs (potentially containing cycles) [9,13,15,16], we use classic reformulation of the alignment problem as a minimal-weight path problem in an appropriately constructed alignment graph (or edit graph) [13,16]. The alignment graph is a weighted directed graph which is constructed in such a way that paths between certain types of vertices unambiguously correspond that valid alignments of the query sequence to G and the alignment costs are equal to the path weights.
The primary reason for using BWA alignments instead of exact matches was to utilize local chaining and extension of the k-mer matches implemented in BWA. Compared to the exact matches the resulting anchors allow for easier and more reliable filtering of spurious matches and are better suited for scoring (easier to sensibly prioritize) within our chaining procedure.
Though SPAligner was primarily designed for finding semi-global alignments, in some cases it may also produce split-read alignments, i.e. separate alignments covering nonoverlapping segments of the query sequence (see Additional file 1 "Support for split-read alignments").

Sequence to graph alignment via alignment graphs
While in this work we only consider linear gap penalty scoring scheme (with non-negative gap penalty parameter σ and mismatch cost μ), proposed approach can straightforwardly support affine gap cost model [17].
For the given query sequence fragment, assembly graph and the scoring parameters we define alignment graph SG(G, Sub) with vertices corresponding to all pairs of position in the graph and position in the query sequence.
The weighted 2 edges are introduced in SG(G, Sub) as follows: Consider a pair of consecutive anchors in the skeleton chain, a i : , end e (a i+1 ))), and a substring of query Sub = S[ end s (a i ) : start s (a i+1 )]. ED(S 1 , S 2 ) denotes alignment cost (linear gap penalty function with parameters μ and σ ) between strings S 1 and S 2 . Our goal is to find a path P, connecting positions p 1 = (e(a i ), end e (a i )) and p 2 = (e(a i+1 ), start e (a i+1 )) in the graph G minimizing ED(Label(P), Sub). It is easy to see (i.e. Lemma 5 in [15]) that path P can be recovered from the minimal-weighted path connecting vertices < p 1 , 0 > and < p 2 , |Sub| > in SG(G, Sub), moreover, its weight is equal to ED(Label(P), Sub). Note that SG(G, Sub) only has non-negative weights, so minimal-weighted paths search can be efficiently performed by Dijkstra algorithm [9,15].
In addition to reconstruction of paths between the consecutive anchors of the alignment skeleton, SPAligner also attempts to extend the alignment beyond the leftmost/rightmost anchors. Without loss of generality, we consider finding optimal alignment of the query fragment Suf beyond end s (a last ), fixing its starting position in the graph G as p s = (e(a last ), end e (a last )). The sought answer can be recovered from the minimal weight path in SG(G, Suf ) across all paths connecting < p s , 0 > with any of the vertices {< p t , |Suf | > }, where p t iterates through all positions in G, which can also be found by Dijkstra algorithm.
To speed-up the optimal path search SPAligner implements multiple heuristics (see Additional file 1 "Alignment extension heuristics and thresholds"). In particular, while we consider the entire graph G in the description above, a much smaller subgraph surrounding anchor alignments is considered by the SPAligner implementation. The implementation also benefits from available highly optimized solutions for sequenceto-sequence alignment (see Additional file 1 "Leveraging fast sequence-to-sequence alignment methods").
More efficient algorithms in terms of worst case time complexity have been earlier suggested in [18,19] for an important case of edit distance or Levenshtein distance (μ and σ equal to 1). Additional file 1"Shortest paths search in binary-weighted graphs" presents a simple modification of the basic approach described above achieving the same time complexity of O(|G| · |Sub|).
It worth to note that recently Jain et al. [6] suggested an elegant algorithm extending the same time complexity to both linear and affine gap penalty functions.

Alignment of amino-acid sequences
Consider an assembly graph G with edges labeled by nucleotide sequences and an amino acid query sequence S p . We will refer to a path P in G as an AA-path, if its label translates into a valid (i.e.without any stop codon) protein sequence (denoted by Translated(Label(P))). We define semi-global alignment of a sequence S p to graph G as an AA-path P in G, such that the alignment cost between S p and Translated(Label(P)) is minimal across all AA-paths in G 3 .
While considering fixed mismatch penalties is usually sufficient for aligning nucleotide sequences, most practical scoring schemes for amino acid sequence alignment use specialized substitution matrices (e.g. BLOSUM or PAM matrices) [20], which take into account the substitution rates between different pairs of amino acids over time as well as the relative frequencies of various amino acids.
To reconcile our cost-minimization formulation with the fact that typical substitution matrix M assumes better alignments to have higher cumulative scores, we will consider elements of matrix M = −M as match/mismatch score (by default M =BLOSUM90).
Again, though we only consider linear gap penalties with coefficient σ (in our experiments we used σ = 5) here, affine gap penalties can also be implemented without significant increase of running time or memory footprint [19].

Alignment graph for amino acid sequence alignment
For the given query sequence, assembly graph, substitution matrix and gap penalty coefficient (σ ), we introduce alignment graph, SG p . Each vertex corresponds to a triplet < p, pos Sub p , fs >, where p is position in the graph G, pos Sub p is the position in the query fragment sequence and fs is a frame state string (of length 0 − 2), encoding a partial codon sequence (empty string is denoted by ε). We further define the graph by the iterative procedure, initialized by introducing the vertices (p, 0, ε), where p iterates through all positions in G. Then, until no more edge can be added, we introduce edges and required vertices following one of the rules below: where p is a position in G, p is a position in G extending p (i.e. p = (e, i) and p = (e, i + 1) for some edge e and index i < |e|), pos Sub p -is a position in Sub p . As before, one can find the semi-global alignment path of the sequence Sub p (fixed on one or both sides by anchor alignments) by searching for minimal weight paths in SG p .
In particular, it is easy to show that the semi-global alignment of the amino acid sequence Sub p starting from position p s in the graph G corresponds to the minimal weight path in SG p (G, Sub p ) connecting vertex < p s , 0, ε > to any of the vertices {< p t , |Sub p |, 0 > }, where p t iterates through all positions in G.
Note that since any reasonable matrix M contains negative elements, the graph SG p will have edges of negative weight, preventing direct application of Dijkstra algorithm and its variations to the search of such minimal weight paths.
While Algorithm 2 from [19] (slightly modified to account for the frame states) can be applied here, we use an alternative approach resulting in the same time complexity 4 .
We define a new graph SG p (G, Sub p ) increasing the weights of edges introduced by rules 3 and 4 by an arbitrary constant C, exceeding the maximum value in M. Since the resulting graph has only edges with non-negative weights, Dijkstra algorithm can be used for the shortest paths search in this graph.
At the same time it is easy to show that the minimum weight path between two arbitrary nodes of the form s =< x, 0, ε > and e =< y, |Sub p | + 1, ε > in SG p is also the minimum weight path between s and e in SG p . Indeed, if we consider an arbitrary path P in SG p between s and e of weight w P , its weight after transformation (in SG p ) is w p + C · |Sub p |, because P contains exactly |Sub p | edges with modified weights. Therefore, the weights of all paths between two vertices s and e will be increased by the same constant |Sub p | · C.

Amino acid sequence alignment pipeline
Below we outline the four steps implemented in SPAligner for aligning an amino acid sequence S p to an assembly graph G, highlighting the differences with the pipeline presented in "Alignment of long nucleotide sequences" section.
Anchors search. As before SPAligner first identifies anchor alignments between S p and the pre-indexed six-frame translated edge labels. Current implementation relies on finding high-scoring local alignments between the "canonical" nucleotide representations of protein sequences, obtained by substituting every amino acid by its lexicographically minimal codon. The search is performed by BWA-MEM configured to prevent gapped alignments (gap_open penalty set to infinity) and we additionally check that the aligned ranges are consistently located with respect to the reading frame.
Anchor filtering. Unreliable and ambiguous anchors are discarded based on their size and query range. Anchors shorter than K are filtered out. Smaller lengths of typical protein sequences as compared to long read sequences allow us to omit the anchor chaining step in order to increase the accuracy of the resulting amino acid sequence alignments. It also eliminates the need to coordinate the reading frames of the chained anchors.
Alignment extension. SPAligher then uses the adjusted alignment graph approach (see "Alignment graph for amino acid sequence alignment") to search for optimal alignments extending each of the remaining anchors. The procedure employs several heuristics and constraints, which can be found in Additional file 1 "Alignment extension heuristics and thresholds".
Alignments post-processing. SPAligner removes alignment paths shorter than a certain fraction of the query coding sequence length (0.8 by default) and filters perfectly duplicating paths. Resulting paths are then translated and re-scored against the query sequence with Parasail library [21].
The vg alignment pipeline [4] starts from searching for super maximal equal matches (SMEMs) with GCSA2 library [22]. After filtering and chaining of SMEMs, the SMEMs in a chain are linked by SIMD-accelerated banded dynamic programming. In order to use this method, vg first "unrolls" the cycles to transform the graph into a directed acyclic graph (DAG). It is worth noting that this step can produce large intermediate graphs and potentially lead to suboptimal alignments. While working with long sequences, vg splits them into overlapping "chunks" (default 256 bp with 32 bp overlap) and maps them separately, further combining them by the same method as for SMEMs chaining.
GraphAligner [23] is a novel tool for aligning nucleotide sequences onto general sequence graphs, which implements many innovative techniques. GraphAligner v1.0.4 uses MUMmer4 [24] to identify potential alignment seeds to restrict the search for the potential alignment paths. Each seed is then extended separately by improved bit-parallel algorithm for sequence-to-graph alignment under an edit distance (unit costs) model. Extended description of GraphAligner implementation approach and detailed comparison with SPAligner can be found in Additional file 1 "Sequence-to-graph alignment implementation insights".
Besides the default configuration of GraphAligner we have also benchmarked its "try-all-seeds" mode (disabling the filtering of likely false positive seed alignments), recommended by the developers for bacterial datasets.
For benchmarking we considered PacBio and ONT reads (both real and simulated) for three different organisms: E. coli strain K12, S. cerevisiae strain S288C and C. elegans strain Bristol N2 (see Supplemental Text "Datasets availability" for details on the simulation and accession numbers). Assembly graphs were generated with SPAdes v3.12 [1] assembler from the appropriate short-read Illumina datasets (using "-k 21,33,55,77" and setting other parameters to defaults). For running the aligners we tried to follow the recommendations given by the developers (see Supplemental Text "Notes on running aligners" for details).
Results of benchmarking on real sequencing data are summarized in Table 1, while results of additional benchmarking on the simulated PacBio and ONT reads can be found in Supplementary Table S2. Additional analysis of dependency between reads alignment quality and their length can be found in Additional file 1 "Dependence of alignment quality on read length".
Table columns present the following metrics:  Vg showed highest average alignment identity for all considered datasets, but at the same time resulted in very low fraction of mapped reads. The same tendency can be seen on simulated datasets (see Additional file 1).
For E. coli datasets SPAligner showed 2-4% advantage in number of mapped reads over GraphAligner, with no significant difference in mapping identity.
For S. cerevisiae SPAligner showed 3% advantage in number of mapped reads on PacBio dataset while GraphAligner was able to map 4% more reads from ONT dataset. In both cases advantage in number of mapped reads is accompanied by 1% reduction of average mapping identity, likely reflecting alignment of additional lower identity reads. Supplementary Table S3 shows that on the set of reads mapped by both SPAligner and GraphAligner (51% for S. cerevisiae PacBio dataset and 60% for S. cerevisiae ONT) mapping identity is nearly the same for both tools.
SPAligner shows a 2% advantage in the number of mapped reads on C. elegans ONT dataset with similar mapping identity, and on C. elegans PacBio dataset both tools show similar results.
Overall, our benchmarking suggests that SPAligner and GraphAligner produce similar alignments of long error-prone reads with SPAligner being faster and more memoryefficient than GraphAligner in "try-all-seeds" mode and slower and less memory-efficient than GraphAligner in default mode.

Aligning protein sequences
In contrast to other sequence-to-graph aligners available at the moment, SPAligner supports alignment of protein sequences onto assembly graphs. In this section we demonstrate how SPAligner can be efficiently used to identify regions of the assembly graph exhibiting high similarity to known prokaryotic protein sequences.
Contigs originating from metagenomic assemblies can be excessively fragmented due to various reasons, including inter-species repeats (conservative sequences, horizontally transferred genes, etc) and within-species microdiversity [25]. Contig fragmentation can lead to loss of important insights in modern metagenomic studies, which increasingly rely on annotation of assembled contigs for functional analysis of the microbial community members [26,27]. Alignment of known protein sequences onto metagenomic assembly graph provides a complementary approach capable of identifying genes fragmented over several contigs.
The capacity of applying SPAligner in this context is demonstrated below in two experiments, considering alignments of: • coding sequences from entire UniProt (The UniProt Consortium, 2019) database onto an assembly graph of a well-defined synthetic metagenomic community; • a comprehensive set of known antibiotic resistance genes onto an assembly graph for a wastewater metagenomic dataset.
While alignment of a protein sequence with SPAligner might result in several alignment paths (initiated by different anchors), in this work we will only consider single best scoring alignment for each query.

Annotating genes in synthetic metagenomic community
Shakya et al. [28] performed sequencing of synthetic DNA mixture for 64 diverse bacterial and archaeal organisms (SRA acc. no. SRX200676). This dataset, containing 109 million 2 × 100 bp Illumina reads (with mean insert size of 206 bp) has been extensively used for benchmarking of tools for metagenomic analysis [29,30]. Reference genome annotations available for all organisms forming the community contain a total of 196150 protein coding sequences, further referred as reference protein sequences.
We model the situation in which we aim to identify potential close homologs of known protein sequences within the metagenomic assembly graph. In particular, we searched for semi-global alignments of 350 thousand sequences comprising "reviewed" UniProt database of bacterial proteins [31]. In order to identify UniProt sequences that are expected to align to the assembly graph (assuming that the graph represents all necessary sequences), we aligned UniProt sequences to the set of reference protein sequences. We used BLASTp [32] with default settings in order to quickly find most similar sequences to the reference one. As a result, 26209 UniProt sequences were mapped to some reference protein with 90% amino acid identity threshold. We consider assembly graph, built by metaSPAdes v3.12 [29] with default settings, and use SPAligner to identify alignments of UniProt sequences at 90% amino-acid identity threshold. SPAligner was able to align 93% (24324 out of 26209) of UniProt sequences with matching reference proteins (see above), as well as 1299 extra sequences. Further analysis showed that the majority of falsely identified sequences are similar to substrings of some reference protein sequences.

Identification of antibiotic resistance genes in a hospital wastewater sample
In this section we demonstrate how SPAligner can be applied to identify genes of particular interest within real metagenomic dataset.
Spread of antimicrobial resistance is an escalating problem and a threat to public health. Metagenomic sequencing provides an efficient methodology for detection and tracking of the antibiotic resistance genes (ARGs) in the environment samples. Ng et al. [33] explored wastewater and urban surface water metagenomic datasets for the presence of various antimicrobial resistance (AMR) proteins, in particular beta-lactamases. To illustrate the ability of SPAligner to identify additional AMR gene families as compared to conventional approaches based on analysis of assembled contigs, we focus on a hospital wastewater discharge dataset, which, according to the original study, had the highest total coverage of beta-lactamase genes. The dataset consisting of 3.3 million 2×250 bp Illumina reads with mean insert size of 350 bp (SRA acc. no. SRR5997548) was assembled by metaSPAdes v3.12 with default settings.
First, we used state-of-the-art AMRFinder tool [34] to identify AMR genes in assembled contigs. AMRFinder performs the BLASTx [32] searches of the protein sequences from Bacterial Antimicrobial Resistance Reference Gene Database [34], containing protein sequences for over 800 gene families representing 34 classes of antimicrobials and disinfectants. Since we are aiming at the identification of the (almost) complete protein sequences, partial predictions covering no more than 75% of the corresponding database protein were discarded. All other parameters were left to defaults, including the 90% alignment identity threshold.
We then used SPAligner to identify best alignments (with at least 90% amino-acid identity threshold) for all 4810 ARG sequences from the same database onto the graph.
To check whether our sequence-to-graph alignments were able to capture additional AMR protein families as compared to the baseline contig analysis approach, protein sequences identified by AMRFinder and SPAligner were clustered by single-linkage clustering with a 90% similarity cut-off. Clusters where only SPAligner alignments are presented tend to be really far by identity from alignments found by AMRFinder and vice versa. Out of the resulting 89 clusters a single cluster consisted of only AMRFinder prediction. Further analysis showed that SPAligner also successfully identified corresponding protein sequence which spanned 89.7% of the query length, while AMRFinder prediction spans 90.7%. At the same time, 4 clusters contained only SPAligner's alignments. tBLASTx [32] alignments of the corresponding database protein sequences onto the assembled contigs either covered less than 75% of the query or had low identity (< 70%). Figure 2 shows a region of the assembly graph surrounding one of the putative proteins predicted exclusively from SPAligner results. Corresponding alignment of the database entry for IMP-4 beta-lactamase covered the entire query sequence with > 98% identity.
We further relate our analysis to the original study [33] with respect to the identified beta-lactamase families. Alignments were obtained for instances of seven out of Fig. 2 Prediction of putative beta-lactamase protein sequence, fragmented in the metagenomic assembly contigs. a Relevant region of assembly graph visualized by Bandage software [35]. Blue, red, green and brown colors correspond to 4 different contigs within the metaSPAdes assembly. b Visualization of the alignement path for IMP-4 gene (246 aa long, alignment identity -98%). Edges comprising the path are colored in green nine beta-lactamase protein families from the original study (bla KPC , bla CTX−M , bla SHV , bla TEM , bla IMP , bla VIM and bla OXA ). While SPAligner was not able to produce alignments for two remaining gene families (bla NDM and bla CMY ), in the original study their instances were estimated to be poorly covered compared to other families 6 .
Note that in [33] authors tried to identify multiple beta-lactamase sequences within the same protein family, which we are not aiming to do here, since minor differences within highly similar gene sequences are likely to be lost in the metagenomic assembly process. Such high-resolution analysis should be possible within the sequence-to-graph alignment framework via configuring the assembly graph construction stage to preserve necessary genomic variation. This promising direction is left for future research.

Future improvements
In particular, our results suggest that SPAligner is an efficient solution for mapping third generation sequencing data and can also facilitate the identification of known genes in complex metagenomic datasets.
Besides general improvements of the codebase, we are planning to migrate to more efficient libraries for searching anchor alignments [22,24] and extension algorithms [6,13].
An important direction with respect to the nucleotide sequence alignment is improving the alignment of reads structurally different from the genome, represented by the assembly graph [36]. This includes a more reliable strategies for ignoring poorly sequenced read extremities and producing split-read alignments.
With respect to the amino acid sequence alignment our future goals include improving the sensitivity of the anchor alignment to allow for searching of the more diverged protein sequences, as well as development of strategies for identification of multiple closelyrelated gene sequences within the custom metagenomic assembly graphs with preserved variation.

Conclusions
In this work we present SPAligner tool for aligning long diverged molecular sequences to assembly graphs and demonstrated its effectiveness for aligning both reads produced by third-generation sequencing technologies and protein sequences.
SPAligner was benchmarked against the two sequence-to-graph aligners supporting long reads -vg and GraphAligner, and showed highly competitive results on various Pacbio and Oxford Nanopore datasets. At the same time being the only tool that produce alignment of amino acid sequences, SPAligner was successfully used to identify gene sequences, which were not fully recovered by metagenomic assembly.