“Alignment of long nucleotide sequences” section describes the notation and presents the approach taken in SPAligner for semi-global alignment of nucleotide sequences to the assembly graph. Extension of this approach for aligning amino acid sequences is discussed in “Alignment of amino-acid sequences” section.
Alignment of long nucleotide sequences
Let G be a (directed) compacted de Bruijn graph with edges labeled by nucleotide sequences that we further refer to as assembly graphFootnote 1. Length of a nucleotide sequence S is denoted by |S| and length of edge e, |e|, is defined as a length of its label. S[x] denotes x-th symbol of S (starting from 0). Position x on string S belongs to a range [0…|S|] and corresponds to location before S[x], if x<|S|, and after S[|S|−1] otherwise. S[a:b] is a substring of S between positions a and b. By the size of graph G, |G|, we denote the sum of the number of its vertices, edges and total length of all labels.
Position in the graph is naturally defined by a pair of an edge e and position in the sequence of e: p=(e,i), where 0<=i<=|e|. Note that with such notation there are multiple positions in graph that corresponds to a vertex of G — it is located at the end of each incoming edge and at the beginning of each outgoing edge. G[p] denotes a symbol on position i within edge e. A path P in G is defined by a sequence of consecutive edges e1,…,en and a pair of positions \(pos_{e_{1}}\) and \(pos_{e_{n}}\) on e1 and en respectively. We denote the nucleotide sequence obtained by concatenation of edge labels in a path P (trimmed according to \(pos_{e_{1}}\) and \(pos_{e_{n}}\)) as Label(P).
For the sake of clarity throughout the paper we are going to neglect the fact that adjacent edges produced by popular assembly tools typically overlap (in particular adjacent edges produced by majority of the de Bruijn graph based assemblers overlap by K bp, where K is the kmer size parameter of de Bruijn graph). Additional file 1 “Notes on running aligners” presents a formal description of the assembly graphs accepted by the current implementation of SPAligner.
The semi-global alignment of a sequence S (query) to an assembly graph G is defined as a path P in G, such that Label(P) has minimal alignment cost against S across all paths in G. As well as other practical tools for searching alignments in large sequence graphs (including vg and GraphAligner) SPAligner first identifies regions of high nucleotide identity between the query and the graph sequences that we refer to as anchor alignments (or anchors) and then attempts to extend them to desired semi-global alignments.
Below we outline the four steps implemented in SPAligner for aligning a nucleotide sequence query S to an assembly graph G (see Fig. 1), generally following the approach used in alignment module of hybridSPAdes assembler [9]
Anchor search. Anchor alignments between S and the pre-indexed edge labels are identified using the BWA-MEM library [14] (every high-scored local alignment between a query and individual edge labels reported by BWA-MEM is considered as a potential anchor). Each anchor a is defined by a range on S — [starts(a),ends(a)) and a range on a specific edge e(a) — [starte(a),ende(a)). The span of an anchor a is defined as (ends(a)−starts(a)). Only anchors with span exceeding a certain threshold (by default equal to K — the kmer size parameter of de Bruijn graph) are considered. Since the goal of this step is to get a set of all potentially-relevant anchors, BWA-MEM settings were adjusted for higher sensitivity. In particular, we modified the defaults to retain as many alignments as possible: secondary alignments were included, mask_level and drop_ratio were assigned to 20, and the seed size was decreased from 17 bp to 14 bp.
Anchor filtering. Unreliable and ambiguous anchors are further discarded based on their size, locations in the graph and the layout on the query sequence. In particular, the following groups of anchors are discarded (in the specified order):
-
Anchors “in the middle” of long edges, which span less than T bp (T=500, by default). More formally we discard the anchor a if ende(a)−starte(a)+ min(starts(a),starte(a))+min(|S|−ends(a),|e|−ende(a)) >3·(ende(a)−starte(a)).
-
Anchors for which at least half of their query ranges are covered by other anchors.
-
Anchors spanning less than t bp (t=200, by default).
Anchor chaining. Anchors are assigned weights equal to their span. Two anchors a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance between their positions on S. SPAligner searches for the heaviest chain of compatible anchors using dynamic programming and considers the resulting chain as a skeleton of the final alignment [9]. Two anchor alignments a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance in S (formally, distG(pa,pb)/(starts(b)−ends(a))>α, where pa=(e(a),ende(a)),pb=(e(b),starte(b) and α defaults to 1.3).
Reconstructing the filling paths. The goal of this step is to identify paths in G between the consecutive anchors in the skeleton (and beyond its leftmost/rightmost anchors) minimizing the alignment cost against corresponding regions of the query sequence. Following a number of previous works on alignment of sequences against general graphs (potentially containing cycles) [9, 13, 15, 16], we use classic reformulation of the alignment problem as a minimal-weight path problem in an appropriately constructed alignment graph (or edit graph) [13, 16]. The alignment graph is a weighted directed graph which is constructed in such a way that paths between certain types of vertices unambiguously correspond that valid alignments of the query sequence to G and the alignment costs are equal to the path weights.
The primary reason for using BWA alignments instead of exact matches was to utilize local chaining and extension of the k-mer matches implemented in BWA. Compared to the exact matches the resulting anchors allow for easier and more reliable filtering of spurious matches and are better suited for scoring (easier to sensibly prioritize) within our chaining procedure.
Though SPAligner was primarily designed for finding semi-global alignments, in some cases it may also produce split-read alignments, i.e. separate alignments covering non-overlapping segments of the query sequence (see Additional file 1 “Support for split-read alignments”).
Sequence to graph alignment via alignment graphs
While in this work we only consider linear gap penalty scoring scheme (with non-negative gap penalty parameter σ and mismatch cost μ), proposed approach can straightforwardly support affine gap cost model [17].
For the given query sequence fragment, assembly graph and the scoring parameters we define alignment graph SG(G,Sub) with vertices corresponding to all pairs of position in the graph and position in the query sequence.
The weighted Footnote 2 edges are introduced in SG(G,Sub) as follows:
-
1.
<p,posSub>→<p′,posSub> of weight σ,
-
2.
<p,posSub>→<p,posSub+1> of weight σ,
-
3.
<p,posSub>→<p′,posSub+1>, of weight zero if Sub[posSub+1] matches G[p′], and μ otherwise,
-
4.
<pend,posSub>→<pstart,posSub> of weight zero if pend and pstart both corresponds to the same vertex in G (i.e. pend=(e1,|e1|),pstart=(e2,0) and the end of edge e1 is same vertex as the start of e2),
where p iterates through all positions in G, while p′ iterates through all graph positions extending p (i.e. p=(e,i) and p′=(e,i+1) for some edge e and index i<|e|), and posSub — through all (permissible) positions in Sub.
Consider a pair of consecutive anchors in the skeleton chain, ai: ([starts(ai),ends(ai)), (e(ai),[starte(ai),ende(ai)))), and ai+1: ([starts(ai+1),ends(ai+1), (e(ai+1),[starte(ai+1),ende(ai+1))), and a substring of query Sub=S[ends(ai):starts(ai+1)]. ED(S1,S2) denotes alignment cost (linear gap penalty function with parameters μ and σ) between strings S1 and S2. Our goal is to find a path P, connecting positions p1=(e(ai),ende(ai)) and p2=(e(ai+1),starte(ai+1)) in the graph G minimizing ED(Label(P),Sub). It is easy to see (i.e. Lemma 5 in [15]) that path P can be recovered from the minimal-weighted path connecting vertices <p1,0> and <p2,|Sub|> in SG(G,Sub), moreover, its weight is equal to ED(Label(P),Sub). Note that SG(G,Sub) only has non-negative weights, so minimal-weighted paths search can be efficiently performed by Dijkstra algorithm [9, 15].
In addition to reconstruction of paths between the consecutive anchors of the alignment skeleton, SPAligner also attempts to extend the alignment beyond the leftmost/rightmost anchors. Without loss of generality, we consider finding optimal alignment of the query fragment Suf beyond ends(alast), fixing its starting position in the graph G as ps=(e(alast),ende(alast)). The sought answer can be recovered from the minimal weight path in SG(G,Suf) across all paths connecting <ps,0> with any of the vertices {<pt,|Suf|>}, where pt iterates through all positions in G, which can also be found by Dijkstra algorithm.
To speed-up the optimal path search SPAligner implements multiple heuristics (see Additional file 1 “Alignment extension heuristics and thresholds”). In particular, while we consider the entire graph G in the description above, a much smaller subgraph surrounding anchor alignments is considered by the SPAligner implementation. The implementation also benefits from available highly optimized solutions for sequence-to-sequence alignment (see Additional file 1 “Leveraging fast sequence-to-sequence alignment methods”).
More efficient algorithms in terms of worst case time complexity have been earlier suggested in [18, 19] for an important case of edit distance or Levenshtein distance (μ and σ equal to 1). Additional file 1“Shortest paths search in binary-weighted graphs” presents a simple modification of the basic approach described above achieving the same time complexity of O(|G|·|Sub|).
It worth to note that recently Jain et al.[6] suggested an elegant algorithm extending the same time complexity to both linear and affine gap penalty functions.
Alignment of amino-acid sequences
Consider an assembly graph G with edges labeled by nucleotide sequences and an amino acid query sequence Sp. We will refer to a path P in G as an AA-path, if its label translates into a valid (i.e.without any stop codon) protein sequence (denoted by Translated(Label(P))). We define semi-global alignment of a sequence Sp to graph G as an AA-path P in G, such that the alignment cost between Sp and Translated(Label(P)) is minimal across all AA-paths in GFootnote 3.
While considering fixed mismatch penalties is usually sufficient for aligning nucleotide sequences, most practical scoring schemes for amino acid sequence alignment use specialized substitution matrices (e.g. BLOSUM or PAM matrices) [20], which take into account the substitution rates between different pairs of amino acids over time as well as the relative frequencies of various amino acids.
To reconcile our cost-minimization formulation with the fact that typical substitution matrix M assumes better alignments to have higher cumulative scores, we will consider elements of matrix M′=−M as match/mismatch score (by default M=BLOSUM90).
Again, though we only consider linear gap penalties with coefficient σ (in our experiments we used σ=5) here, affine gap penalties can also be implemented without significant increase of running time or memory footprint [19].
Alignment graph for amino acid sequence alignment
For the given query sequence, assembly graph, substitution matrix and gap penalty coefficient (σ), we introduce alignment graph, SGp. Each vertex corresponds to a triplet \(< p, pos_{Sub_{p}}, fs>\), where p is position in the graph G, \(pos_{Sub_{p}}\) is the position in the query fragment sequence and fs is a frame state string (of length 0−2), encoding a partial codon sequence (empty string is denoted by ε). We further define the graph by the iterative procedure, initialized by introducing the vertices (p,0,ε), where p iterates through all positions in G. Then, until no more edge can be added, we introduce edges and required vertices following one of the rules below:
-
1.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, fs + G[p']>\) of zero weight if |fs|<2,
-
2.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, \varepsilon >\) of weight σ if |fs|=2,
-
3.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}} + 1, \varepsilon >\) of weight \(\phantom {\dot {i}\!}M'[Translate(fs + G[p']), Sub_{p}[pos_{Sub_{p}} + 1]]\), and |fs|=2,
-
4.
\(\phantom {\dot {i}\!}< p, pos_{Sub_{p}}, \varepsilon > \rightarrow < p, pos_{Sub_{p}} + 1, \varepsilon >\) of weight σ,
-
5.
\(\phantom {\dot {i}\!}< p_{end}, pos_{Sub_{p}}, fs> \rightarrow < p_{start}, pos_{Sub_{p}}, fs>\) of weight zero if pend and pstart correspond to the same position in graph on consecutive edges of G (i.e. pend=(e1,|e1|),pstart=(e2,0) and end of edge e1 is same vertex as the start of e2),
where p is a position in G, p′ is a position in G extending p (i.e. p=(e,i) and p′=(e,i+1) for some edge e and index i<|e|), \(pos_{Sub_{p}}\) – is a position in Subp.
As before, one can find the semi-global alignment path of the sequence Subp (fixed on one or both sides by anchor alignments) by searching for minimal weight paths in SGp.
In particular, it is easy to show that the semi-global alignment of the amino acid sequence Subp starting from position ps in the graph G corresponds to the minimal weight path in SGp(G,Subp) connecting vertex <ps,0,ε> to any of the vertices {<pt,|Subp|,0> }, where pt iterates through all positions in G.
Note that since any reasonable matrix M′ contains negative elements, the graph SGp will have edges of negative weight, preventing direct application of Dijkstra algorithm and its variations to the search of such minimal weight paths.
While Algorithm 2 from [19] (slightly modified to account for the frame states) can be applied here, we use an alternative approach resulting in the same time complexityFootnote 4.
We define a new graph \(SG^{\prime }_{p}(G, Sub_{p})\) increasing the weights of edges introduced by rules 3 and 4 by an arbitrary constant C, exceeding the maximum value in M. Since the resulting graph has only edges with non-negative weights, Dijkstra algorithm can be used for the shortest paths search in this graph.
At the same time it is easy to show that the minimum weight path between two arbitrary nodes of the form s=<x,0,ε> and e=<y,|Subp|+1,ε> in \(SG^{\prime }_{p}\) is also the minimum weight path between s and e in SGp. Indeed, if we consider an arbitrary path P in SGp between s and e of weight wP, its weight after transformation (in \(SG^{\prime }_{p}\)) is wp+C·|Subp|, because P contains exactly |Subp| edges with modified weights. Therefore, the weights of all paths between two vertices s and e will be increased by the same constant |Subp|·C.
Amino acid sequence alignment pipeline
Below we outline the four steps implemented in SPAligner for aligning an amino acid sequence Sp to an assembly graph G, highlighting the differences with the pipeline presented in “Alignment of long nucleotide sequences” section.
Anchors search. As before SPAligner first identifies anchor alignments between Sp and the pre-indexed six-frame translated edge labels. Current implementation relies on finding high-scoring local alignments between the “canonical” nucleotide representations of protein sequences, obtained by substituting every amino acid by its lexicographically minimal codon. The search is performed by BWA-MEM configured to prevent gapped alignments (gap_open penalty set to infinity) and we additionally check that the aligned ranges are consistently located with respect to the reading frame.
Anchor filtering. Unreliable and ambiguous anchors are discarded based on their size and query range. Anchors shorter than K are filtered out. Smaller lengths of typical protein sequences as compared to long read sequences allow us to omit the anchor chaining step in order to increase the accuracy of the resulting amino acid sequence alignments. It also eliminates the need to coordinate the reading frames of the chained anchors.
Alignment extension. SPAligher then uses the adjusted alignment graph approach (see “Alignment graph for amino acid sequence alignment”) to search for optimal alignments extending each of the remaining anchors. The procedure employs several heuristics and constraints, which can be found in Additional file 1 “Alignment extension heuristics and thresholds”.
Alignments post-processing. SPAligner removes alignment paths shorter than a certain fraction of the query coding sequence length (0.8 by default) and filters perfectly duplicating paths. Resulting paths are then translated and re-scored against the query sequence with Parasail library [21].