“Alignment of long nucleotide sequences” section describes the notation and presents the approach taken in SPAligner for semiglobal alignment of nucleotide sequences to the assembly graph. Extension of this approach for aligning amino acid sequences is discussed in “Alignment of aminoacid sequences” section.
Alignment of long nucleotide sequences
Let G be a (directed) compacted de Bruijn graph with edges labeled by nucleotide sequences that we further refer to as assembly graph^{Footnote 1}. Length of a nucleotide sequence S is denoted by S and length of edge e, e, is defined as a length of its label. S[x] denotes xth symbol of S (starting from 0). Position x on string S belongs to a range [0…S] and corresponds to location before S[x], if x<S, and after S[S−1] otherwise. S[a:b] is a substring of S between positions a and b. By the size of graph G, G, we denote the sum of the number of its vertices, edges and total length of all labels.
Position in the graph is naturally defined by a pair of an edge e and position in the sequence of e: p=(e,i), where 0<=i<=e. Note that with such notation there are multiple positions in graph that corresponds to a vertex of G — it is located at the end of each incoming edge and at the beginning of each outgoing edge. G[p] denotes a symbol on position i within edge e. A path P in G is defined by a sequence of consecutive edges e_{1},…,e_{n} and a pair of positions \(pos_{e_{1}}\) and \(pos_{e_{n}}\) on e_{1} and e_{n} respectively. We denote the nucleotide sequence obtained by concatenation of edge labels in a path P (trimmed according to \(pos_{e_{1}}\) and \(pos_{e_{n}}\)) as Label(P).
For the sake of clarity throughout the paper we are going to neglect the fact that adjacent edges produced by popular assembly tools typically overlap (in particular adjacent edges produced by majority of the de Bruijn graph based assemblers overlap by K bp, where K is the kmer size parameter of de Bruijn graph). Additional file 1 “Notes on running aligners” presents a formal description of the assembly graphs accepted by the current implementation of SPAligner.
The semiglobal alignment of a sequence S (query) to an assembly graph G is defined as a path P in G, such that Label(P) has minimal alignment cost against S across all paths in G. As well as other practical tools for searching alignments in large sequence graphs (including vg and GraphAligner) SPAligner first identifies regions of high nucleotide identity between the query and the graph sequences that we refer to as anchor alignments (or anchors) and then attempts to extend them to desired semiglobal alignments.
Below we outline the four steps implemented in SPAligner for aligning a nucleotide sequence query S to an assembly graph G (see Fig. 1), generally following the approach used in alignment module of hybridSPAdes assembler [9]
Anchor search. Anchor alignments between S and the preindexed edge labels are identified using the BWAMEM library [14] (every highscored local alignment between a query and individual edge labels reported by BWAMEM is considered as a potential anchor). Each anchor a is defined by a range on S — [start_{s}(a),end_{s}(a)) and a range on a specific edge e(a) — [start_{e}(a),end_{e}(a)). The span of an anchor a is defined as (end_{s}(a)−start_{s}(a)). Only anchors with span exceeding a certain threshold (by default equal to K — the kmer size parameter of de Bruijn graph) are considered. Since the goal of this step is to get a set of all potentiallyrelevant anchors, BWAMEM settings were adjusted for higher sensitivity. In particular, we modified the defaults to retain as many alignments as possible: secondary alignments were included, mask_level and drop_ratio were assigned to 20, and the seed size was decreased from 17 bp to 14 bp.
Anchor filtering. Unreliable and ambiguous anchors are further discarded based on their size, locations in the graph and the layout on the query sequence. In particular, the following groups of anchors are discarded (in the specified order):

Anchors “in the middle” of long edges, which span less than T bp (T=500, by default). More formally we discard the anchor a if end_{e}(a)−start_{e}(a)+ min(start_{s}(a),start_{e}(a))+min(S−end_{s}(a),e−end_{e}(a)) >3·(end_{e}(a)−start_{e}(a)).

Anchors for which at least half of their query ranges are covered by other anchors.

Anchors spanning less than t bp (t=200, by default).
Anchor chaining. Anchors are assigned weights equal to their span. Two anchors a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance between their positions on S. SPAligner searches for the heaviest chain of compatible anchors using dynamic programming and considers the resulting chain as a skeleton of the final alignment [9]. Two anchor alignments a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance in S (formally, dist_{G}(p_{a},p_{b})/(start_{s}(b)−end_{s}(a))>α, where p_{a}=(e(a),end_{e}(a)),p_{b}=(e(b),start_{e}(b) and α defaults to 1.3).
Reconstructing the filling paths. The goal of this step is to identify paths in G between the consecutive anchors in the skeleton (and beyond its leftmost/rightmost anchors) minimizing the alignment cost against corresponding regions of the query sequence. Following a number of previous works on alignment of sequences against general graphs (potentially containing cycles) [9, 13, 15, 16], we use classic reformulation of the alignment problem as a minimalweight path problem in an appropriately constructed alignment graph (or edit graph) [13, 16]. The alignment graph is a weighted directed graph which is constructed in such a way that paths between certain types of vertices unambiguously correspond that valid alignments of the query sequence to G and the alignment costs are equal to the path weights.
The primary reason for using BWA alignments instead of exact matches was to utilize local chaining and extension of the kmer matches implemented in BWA. Compared to the exact matches the resulting anchors allow for easier and more reliable filtering of spurious matches and are better suited for scoring (easier to sensibly prioritize) within our chaining procedure.
Though SPAligner was primarily designed for finding semiglobal alignments, in some cases it may also produce splitread alignments, i.e. separate alignments covering nonoverlapping segments of the query sequence (see Additional file 1 “Support for splitread alignments”).
Sequence to graph alignment via alignment graphs
While in this work we only consider linear gap penalty scoring scheme (with nonnegative gap penalty parameter σ and mismatch cost μ), proposed approach can straightforwardly support affine gap cost model [17].
For the given query sequence fragment, assembly graph and the scoring parameters we define alignment graph SG(G,Sub) with vertices corresponding to all pairs of position in the graph and position in the query sequence.
The weighted ^{Footnote 2} edges are introduced in SG(G,Sub) as follows:

1.
<p,pos_{Sub}>→<p^{′},pos_{Sub}> of weight σ,

2.
<p,pos_{Sub}>→<p,pos_{Sub}+1> of weight σ,

3.
<p,pos_{Sub}>→<p^{′},pos_{Sub}+1>, of weight zero if Sub[pos_{Sub}+1] matches G[p^{′}], and μ otherwise,

4.
<p_{end},pos_{Sub}>→<p_{start},pos_{Sub}> of weight zero if p_{end} and p_{start} both corresponds to the same vertex in G (i.e. p_{end}=(e_{1},e_{1}),p_{start}=(e_{2},0) and the end of edge e_{1} is same vertex as the start of e_{2}),
where p iterates through all positions in G, while p^{′} iterates through all graph positions extending p (i.e. p=(e,i) and p^{′}=(e,i+1) for some edge e and index i<e), and pos_{Sub} — through all (permissible) positions in Sub.
Consider a pair of consecutive anchors in the skeleton chain, a_{i}: ([start_{s}(a_{i}),end_{s}(a_{i})), (e(a_{i}),[start_{e}(a_{i}),end_{e}(a_{i})))), and a_{i+1}: ([start_{s}(a_{i+1}),end_{s}(a_{i+1}), (e(a_{i+1}),[start_{e}(a_{i+1}),end_{e}(a_{i+1}))), and a substring of query Sub=S[end_{s}(a_{i}):start_{s}(a_{i+1})]. ED(S_{1},S_{2}) denotes alignment cost (linear gap penalty function with parameters μ and σ) between strings S_{1} and S_{2}. Our goal is to find a path P, connecting positions p_{1}=(e(a_{i}),end_{e}(a_{i})) and p_{2}=(e(a_{i+1}),start_{e}(a_{i+1})) in the graph G minimizing ED(Label(P),Sub). It is easy to see (i.e. Lemma 5 in [15]) that path P can be recovered from the minimalweighted path connecting vertices <p_{1},0> and <p_{2},Sub> in SG(G,Sub), moreover, its weight is equal to ED(Label(P),Sub). Note that SG(G,Sub) only has nonnegative weights, so minimalweighted paths search can be efficiently performed by Dijkstra algorithm [9, 15].
In addition to reconstruction of paths between the consecutive anchors of the alignment skeleton, SPAligner also attempts to extend the alignment beyond the leftmost/rightmost anchors. Without loss of generality, we consider finding optimal alignment of the query fragment Suf beyond end_{s}(a_{last}), fixing its starting position in the graph G as p_{s}=(e(a_{last}),end_{e}(a_{last})). The sought answer can be recovered from the minimal weight path in SG(G,Suf) across all paths connecting <p_{s},0> with any of the vertices {<p_{t},Suf>}, where p_{t} iterates through all positions in G, which can also be found by Dijkstra algorithm.
To speedup the optimal path search SPAligner implements multiple heuristics (see Additional file 1 “Alignment extension heuristics and thresholds”). In particular, while we consider the entire graph G in the description above, a much smaller subgraph surrounding anchor alignments is considered by the SPAligner implementation. The implementation also benefits from available highly optimized solutions for sequencetosequence alignment (see Additional file 1 “Leveraging fast sequencetosequence alignment methods”).
More efficient algorithms in terms of worst case time complexity have been earlier suggested in [18, 19] for an important case of edit distance or Levenshtein distance (μ and σ equal to 1). Additional file 1“Shortest paths search in binaryweighted graphs” presents a simple modification of the basic approach described above achieving the same time complexity of O(G·Sub).
It worth to note that recently Jain et al.[6] suggested an elegant algorithm extending the same time complexity to both linear and affine gap penalty functions.
Alignment of aminoacid sequences
Consider an assembly graph G with edges labeled by nucleotide sequences and an amino acid query sequence S_{p}. We will refer to a path P in G as an AApath, if its label translates into a valid (i.e.without any stop codon) protein sequence (denoted by Translated(Label(P))). We define semiglobal alignment of a sequence S_{p} to graph G as an AApath P in G, such that the alignment cost between S_{p} and Translated(Label(P)) is minimal across all AApaths in G^{Footnote 3}.
While considering fixed mismatch penalties is usually sufficient for aligning nucleotide sequences, most practical scoring schemes for amino acid sequence alignment use specialized substitution matrices (e.g. BLOSUM or PAM matrices) [20], which take into account the substitution rates between different pairs of amino acids over time as well as the relative frequencies of various amino acids.
To reconcile our costminimization formulation with the fact that typical substitution matrix M assumes better alignments to have higher cumulative scores, we will consider elements of matrix M^{′}=−M as match/mismatch score (by default M=BLOSUM90).
Again, though we only consider linear gap penalties with coefficient σ (in our experiments we used σ=5) here, affine gap penalties can also be implemented without significant increase of running time or memory footprint [19].
Alignment graph for amino acid sequence alignment
For the given query sequence, assembly graph, substitution matrix and gap penalty coefficient (σ), we introduce alignment graph, SG_{p}. Each vertex corresponds to a triplet \(< p, pos_{Sub_{p}}, fs>\), where p is position in the graph G, \(pos_{Sub_{p}}\) is the position in the query fragment sequence and fs is a frame state string (of length 0−2), encoding a partial codon sequence (empty string is denoted by ε). We further define the graph by the iterative procedure, initialized by introducing the vertices (p,0,ε), where p iterates through all positions in G. Then, until no more edge can be added, we introduce edges and required vertices following one of the rules below:

1.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, fs + G[p']>\) of zero weight if fs<2,

2.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, \varepsilon >\) of weight σ if fs=2,

3.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}} + 1, \varepsilon >\) of weight \(\phantom {\dot {i}\!}M'[Translate(fs + G[p']), Sub_{p}[pos_{Sub_{p}} + 1]]\), and fs=2,

4.
\(\phantom {\dot {i}\!}< p, pos_{Sub_{p}}, \varepsilon > \rightarrow < p, pos_{Sub_{p}} + 1, \varepsilon >\) of weight σ,

5.
\(\phantom {\dot {i}\!}< p_{end}, pos_{Sub_{p}}, fs> \rightarrow < p_{start}, pos_{Sub_{p}}, fs>\) of weight zero if p_{end} and p_{start} correspond to the same position in graph on consecutive edges of G (i.e. p_{end}=(e_{1},e_{1}),p_{start}=(e_{2},0) and end of edge e_{1} is same vertex as the start of e_{2}),
where p is a position in G, p^{′} is a position in G extending p (i.e. p=(e,i) and p^{′}=(e,i+1) for some edge e and index i<e), \(pos_{Sub_{p}}\) – is a position in Sub_{p}.
As before, one can find the semiglobal alignment path of the sequence Sub_{p} (fixed on one or both sides by anchor alignments) by searching for minimal weight paths in SG_{p}.
In particular, it is easy to show that the semiglobal alignment of the amino acid sequence Sub_{p} starting from position p_{s} in the graph G corresponds to the minimal weight path in SG_{p}(G,Sub_{p}) connecting vertex <p_{s},0,ε> to any of the vertices {<p_{t},Sub_{p},0> }, where p_{t} iterates through all positions in G.
Note that since any reasonable matrix M^{′} contains negative elements, the graph SG_{p} will have edges of negative weight, preventing direct application of Dijkstra algorithm and its variations to the search of such minimal weight paths.
While Algorithm 2 from [19] (slightly modified to account for the frame states) can be applied here, we use an alternative approach resulting in the same time complexity^{Footnote 4}.
We define a new graph \(SG^{\prime }_{p}(G, Sub_{p})\) increasing the weights of edges introduced by rules 3 and 4 by an arbitrary constant C, exceeding the maximum value in M. Since the resulting graph has only edges with nonnegative weights, Dijkstra algorithm can be used for the shortest paths search in this graph.
At the same time it is easy to show that the minimum weight path between two arbitrary nodes of the form s=<x,0,ε> and e=<y,Sub_{p}+1,ε> in \(SG^{\prime }_{p}\) is also the minimum weight path between s and e in SG_{p}. Indeed, if we consider an arbitrary path P in SG_{p} between s and e of weight w_{P}, its weight after transformation (in \(SG^{\prime }_{p}\)) is w_{p}+C·Sub_{p}, because P contains exactly Sub_{p} edges with modified weights. Therefore, the weights of all paths between two vertices s and e will be increased by the same constant Sub_{p}·C.
Amino acid sequence alignment pipeline
Below we outline the four steps implemented in SPAligner for aligning an amino acid sequence S_{p} to an assembly graph G, highlighting the differences with the pipeline presented in “Alignment of long nucleotide sequences” section.
Anchors search. As before SPAligner first identifies anchor alignments between S_{p} and the preindexed sixframe translated edge labels. Current implementation relies on finding highscoring local alignments between the “canonical” nucleotide representations of protein sequences, obtained by substituting every amino acid by its lexicographically minimal codon. The search is performed by BWAMEM configured to prevent gapped alignments (gap_open penalty set to infinity) and we additionally check that the aligned ranges are consistently located with respect to the reading frame.
Anchor filtering. Unreliable and ambiguous anchors are discarded based on their size and query range. Anchors shorter than K are filtered out. Smaller lengths of typical protein sequences as compared to long read sequences allow us to omit the anchor chaining step in order to increase the accuracy of the resulting amino acid sequence alignments. It also eliminates the need to coordinate the reading frames of the chained anchors.
Alignment extension. SPAligher then uses the adjusted alignment graph approach (see “Alignment graph for amino acid sequence alignment”) to search for optimal alignments extending each of the remaining anchors. The procedure employs several heuristics and constraints, which can be found in Additional file 1 “Alignment extension heuristics and thresholds”.
Alignments postprocessing. SPAligner removes alignment paths shorter than a certain fraction of the query coding sequence length (0.8 by default) and filters perfectly duplicating paths. Resulting paths are then translated and rescored against the query sequence with Parasail library [21].