SPAligner: alignment of long diverged molecular sequences to assembly graphs

Dvorkina, Tatiana; Antipov, Dmitry; Korobeynikov, Anton; Nurk, Sergey

doi:10.1186/s12859-020-03590-7

Volume 21 Supplement 12

Selected abstracts and papers of Bioinformatics: from Algorithms to Applications 2019 conference

Software
Open access
Published: 24 July 2020

SPAligner: alignment of long diverged molecular sequences to assembly graphs

Tatiana Dvorkina ORCID: orcid.org/0000-0003-2253-1702¹,
Dmitry Antipov¹,
Anton Korobeynikov^1,2 &
…
Sergey Nurk³

BMC Bioinformatics volume 21, Article number: 306 (2020) Cite this article

2881 Accesses
12 Citations
5 Altmetric
Metrics details

Abstract

Background

Graph-based representation of genome assemblies has been recently used in different contexts — from improved reconstruction of plasmid sequences and refined analysis of metagenomic data to read error correction and reference-free haplotype reconstruction. While many of these applications heavily utilize the alignment of long nucleotide sequences to assembly graphs, first general-purpose software tools for finding such alignments have been released only recently and their deficiencies and limitations are yet to be discovered. Moreover, existing tools can not perform alignment of amino acid sequences, which could prove useful in various contexts — in particular the analysis of metagenomic sequencing data.

Results

In this work we present a novel SPAligner (Saint-Petersburg Aligner) tool for aligning long diverged nucleotide and amino acid sequences to assembly graphs. We demonstrate that SPAligner is an efficient solution for mapping third generation sequencing reads onto assembly graphs of various complexity and also show how it can facilitate the identification of known genes in complex metagenomic datasets.

Conclusions

Our work will facilitate accelerating the development of graph-based approaches in solving sequence to genome assembly alignment problem. SPAligner is implemented as a part of SPAdes tools library and is available on Github.

Background

Many popular short read assemblers [1–3] provide the user not only with a set of contig sequences, but also with assembly graphs, encoding the information on the potential adjacencies of the assembled sequences. Naturally arising problem of sequence-to-graph alignment has been a topic of many recent studies [4–8]. Identifying alignments of long error-prone reads (such as Pacbio and ONT reads) to assembly graphs is particularly important and has recently been applied to hybrid genome assembly [9, 10], read error correction [11], and haplotype separation [12]. At the same time, the choice of the practical aligners supporting long nucleotide sequences is currently limited to vg [4] and GraphAligner [13], both of which are under active development. Moreover, to the best of our knowledge, no existing graph-based aligner supports alignment of amino acid sequences.

Here we present the SPAligner (Saint Petersburg Aligner) tool for aligning long diverged molecular (both nucleotide and amino acid) sequences against assembly graphs produced by the popular short-read assemblers. The project stemmed from our previous efforts on the long-read alignment within the hybridSPAdes assembler [9]. Our benchmarks on various Pacbio and Oxford Nanopore datasets show that SPAligner is highly competitive to vg and GraphAligner in aligning long error-prone reads. We also demonstrate SPAligner’s ability to accurately align amino-acid sequences (with up to 90% amino acid identity) onto complex assembly graphs of metagenomic datasets. To further motivate this application we show how SPAligner can be used for identification of biologically important (antibiotic-resistance) genes, which remain under the radar of conventional pipelines due to assembly fragmentation (e.g. genes exhibiting high variability in complex environmental samples).

Implementation

“Alignment of long nucleotide sequences” section describes the notation and presents the approach taken in SPAligner for semi-global alignment of nucleotide sequences to the assembly graph. Extension of this approach for aligning amino acid sequences is discussed in “Alignment of amino-acid sequences” section.

Alignment of long nucleotide sequences

Let G be a (directed) compacted de Bruijn graph with edges labeled by nucleotide sequences that we further refer to as assembly graph^{Footnote 1}. Length of a nucleotide sequence S is denoted by |S| and length of edge e, |e|, is defined as a length of its label. S[x] denotes x-th symbol of S (starting from 0). Position x on string S belongs to a range [0…|S|] and corresponds to location before S[x], if x<|S|, and after S[|S|−1] otherwise. S[a:b] is a substring of S between positions a and b. By the size of graph G, |G|, we denote the sum of the number of its vertices, edges and total length of all labels.

Position in the graph is naturally defined by a pair of an edge e and position in the sequence of e: p=(e,i), where 0<=i<=|e|. Note that with such notation there are multiple positions in graph that corresponds to a vertex of G — it is located at the end of each incoming edge and at the beginning of each outgoing edge. G[p] denotes a symbol on position i within edge e. A path P in G is defined by a sequence of consecutive edges e₁,…,e_n and a pair of positions \(pos_{e_{1}}\) and \(pos_{e_{n}}\) on e₁ and e_n respectively. We denote the nucleotide sequence obtained by concatenation of edge labels in a path P (trimmed according to \(pos_{e_{1}}\) and \(pos_{e_{n}}\)) as Label(P).

For the sake of clarity throughout the paper we are going to neglect the fact that adjacent edges produced by popular assembly tools typically overlap (in particular adjacent edges produced by majority of the de Bruijn graph based assemblers overlap by K bp, where K is the kmer size parameter of de Bruijn graph). Additional file 1 “Notes on running aligners” presents a formal description of the assembly graphs accepted by the current implementation of SPAligner.

The semi-global alignment of a sequence S (query) to an assembly graph G is defined as a path P in G, such that Label(P) has minimal alignment cost against S across all paths in G. As well as other practical tools for searching alignments in large sequence graphs (including vg and GraphAligner) SPAligner first identifies regions of high nucleotide identity between the query and the graph sequences that we refer to as anchor alignments (or anchors) and then attempts to extend them to desired semi-global alignments.

Below we outline the four steps implemented in SPAligner for aligning a nucleotide sequence query S to an assembly graph G (see Fig. 1), generally following the approach used in alignment module of hybridSPAdes assembler [9]

Anchor search. Anchor alignments between S and the pre-indexed edge labels are identified using the BWA-MEM library [14] (every high-scored local alignment between a query and individual edge labels reported by BWA-MEM is considered as a potential anchor). Each anchor a is defined by a range on S — [start_s(a),end_s(a)) and a range on a specific edge e(a) — [start_e(a),end_e(a)). The span of an anchor a is defined as (end_s(a)−start_s(a)). Only anchors with span exceeding a certain threshold (by default equal to K — the kmer size parameter of de Bruijn graph) are considered. Since the goal of this step is to get a set of all potentially-relevant anchors, BWA-MEM settings were adjusted for higher sensitivity. In particular, we modified the defaults to retain as many alignments as possible: secondary alignments were included, mask_level and drop_ratio were assigned to 20, and the seed size was decreased from 17 bp to 14 bp.

Anchor filtering. Unreliable and ambiguous anchors are further discarded based on their size, locations in the graph and the layout on the query sequence. In particular, the following groups of anchors are discarded (in the specified order):

Anchors “in the middle” of long edges, which span less than T bp (T=500, by default). More formally we discard the anchor a if end_e(a)−start_e(a)+ min(start_s(a),start_e(a))+min(|S|−end_s(a),|e|−end_e(a)) >3·(end_e(a)−start_e(a)).
Anchors for which at least half of their query ranges are covered by other anchors.
Anchors spanning less than t bp (t=200, by default).

Anchor chaining. Anchors are assigned weights equal to their span. Two anchors a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance between their positions on S. SPAligner searches for the heaviest chain of compatible anchors using dynamic programming and considers the resulting chain as a skeleton of the final alignment [9]. Two anchor alignments a and b are considered compatible if the minimal distance between them in G does not significantly exceed the distance in S (formally, dist_G(p_a,p_b)/(start_s(b)−end_s(a))>α, where p_a=(e(a),end_e(a)),p_b=(e(b),start_e(b) and α defaults to 1.3).

Reconstructing the filling paths. The goal of this step is to identify paths in G between the consecutive anchors in the skeleton (and beyond its leftmost/rightmost anchors) minimizing the alignment cost against corresponding regions of the query sequence. Following a number of previous works on alignment of sequences against general graphs (potentially containing cycles) [9, 13, 15, 16], we use classic reformulation of the alignment problem as a minimal-weight path problem in an appropriately constructed alignment graph (or edit graph) [13, 16]. The alignment graph is a weighted directed graph which is constructed in such a way that paths between certain types of vertices unambiguously correspond that valid alignments of the query sequence to G and the alignment costs are equal to the path weights.

The primary reason for using BWA alignments instead of exact matches was to utilize local chaining and extension of the k-mer matches implemented in BWA. Compared to the exact matches the resulting anchors allow for easier and more reliable filtering of spurious matches and are better suited for scoring (easier to sensibly prioritize) within our chaining procedure.

Though SPAligner was primarily designed for finding semi-global alignments, in some cases it may also produce split-read alignments, i.e. separate alignments covering non-overlapping segments of the query sequence (see Additional file 1 “Support for split-read alignments”).

Sequence to graph alignment via alignment graphs

While in this work we only consider linear gap penalty scoring scheme (with non-negative gap penalty parameter σ and mismatch cost μ), proposed approach can straightforwardly support affine gap cost model [17].

For the given query sequence fragment, assembly graph and the scoring parameters we define alignment graph SG(G,Sub) with vertices corresponding to all pairs of position in the graph and position in the query sequence.

The weighted ^{Footnote 2} edges are introduced in SG(G,Sub) as follows:

1.
<p,pos_Sub>→<p^′,pos_Sub> of weight σ,
2.
<p,pos_Sub>→<p,pos_Sub+1> of weight σ,
3.
<p,pos_Sub>→<p^′,pos_Sub+1>, of weight zero if Sub[pos_Sub+1] matches G[p^′], and μ otherwise,
4.
<p_end,pos_Sub>→<p_start,pos_Sub> of weight zero if p_end and p_start both corresponds to the same vertex in G (i.e. p_end=(e₁,|e₁|),p_start=(e₂,0) and the end of edge e₁ is same vertex as the start of e₂),

where p iterates through all positions in G, while p^′ iterates through all graph positions extending p (i.e. p=(e,i) and p^′=(e,i+1) for some edge e and index i<|e|), and pos_Sub — through all (permissible) positions in Sub.

Consider a pair of consecutive anchors in the skeleton chain, a_i: ([start_s(a_i),end_s(a_i)), (e(a_i),[start_e(a_i),end_e(a_i)))), and a_i+1: ([start_s(a_i+1),end_s(a_i+1), (e(a_i+1),[start_e(a_i+1),end_e(a_i+1))), and a substring of query Sub=S[end_s(a_i):start_s(a_i+1)]. ED(S₁,S₂) denotes alignment cost (linear gap penalty function with parameters μ and σ) between strings S₁ and S₂. Our goal is to find a path P, connecting positions p₁=(e(a_i),end_e(a_i)) and p₂=(e(a_i+1),start_e(a_i+1)) in the graph G minimizing ED(Label(P),Sub). It is easy to see (i.e. Lemma 5 in [15]) that path P can be recovered from the minimal-weighted path connecting vertices <p₁,0> and <p₂,|Sub|> in SG(G,Sub), moreover, its weight is equal to ED(Label(P),Sub). Note that SG(G,Sub) only has non-negative weights, so minimal-weighted paths search can be efficiently performed by Dijkstra algorithm [9, 15].

In addition to reconstruction of paths between the consecutive anchors of the alignment skeleton, SPAligner also attempts to extend the alignment beyond the leftmost/rightmost anchors. Without loss of generality, we consider finding optimal alignment of the query fragment Suf beyond end_s(a_last), fixing its starting position in the graph G as p_s=(e(a_last),end_e(a_last)). The sought answer can be recovered from the minimal weight path in SG(G,Suf) across all paths connecting <p_s,0> with any of the vertices {<p_t,|Suf|>}, where p_t iterates through all positions in G, which can also be found by Dijkstra algorithm.

To speed-up the optimal path search SPAligner implements multiple heuristics (see Additional file 1 “Alignment extension heuristics and thresholds”). In particular, while we consider the entire graph G in the description above, a much smaller subgraph surrounding anchor alignments is considered by the SPAligner implementation. The implementation also benefits from available highly optimized solutions for sequence-to-sequence alignment (see Additional file 1 “Leveraging fast sequence-to-sequence alignment methods”).

More efficient algorithms in terms of worst case time complexity have been earlier suggested in [18, 19] for an important case of edit distance or Levenshtein distance (μ and σ equal to 1). Additional file 1“Shortest paths search in binary-weighted graphs” presents a simple modification of the basic approach described above achieving the same time complexity of O(|G|·|Sub|).

It worth to note that recently Jain et al.[6] suggested an elegant algorithm extending the same time complexity to both linear and affine gap penalty functions.

Alignment of amino-acid sequences

Consider an assembly graph G with edges labeled by nucleotide sequences and an amino acid query sequence S_p. We will refer to a path P in G as an AA-path, if its label translates into a valid (i.e.without any stop codon) protein sequence (denoted by Translated(Label(P))). We define semi-global alignment of a sequence S_p to graph G as an AA-path P in G, such that the alignment cost between S_p and Translated(Label(P)) is minimal across all AA-paths in G^{Footnote 3}.

While considering fixed mismatch penalties is usually sufficient for aligning nucleotide sequences, most practical scoring schemes for amino acid sequence alignment use specialized substitution matrices (e.g. BLOSUM or PAM matrices) [20], which take into account the substitution rates between different pairs of amino acids over time as well as the relative frequencies of various amino acids.

To reconcile our cost-minimization formulation with the fact that typical substitution matrix M assumes better alignments to have higher cumulative scores, we will consider elements of matrix M^′=−M as match/mismatch score (by default M=BLOSUM90).

Again, though we only consider linear gap penalties with coefficient σ (in our experiments we used σ=5) here, affine gap penalties can also be implemented without significant increase of running time or memory footprint [19].

Alignment graph for amino acid sequence alignment

For the given query sequence, assembly graph, substitution matrix and gap penalty coefficient (σ), we introduce alignment graph, SG_p. Each vertex corresponds to a triplet \(< p, pos_{Sub_{p}}, fs>\), where p is position in the graph G, \(pos_{Sub_{p}}\) is the position in the query fragment sequence and fs is a frame state string (of length 0−2), encoding a partial codon sequence (empty string is denoted by ε). We further define the graph by the iterative procedure, initialized by introducing the vertices (p,0,ε), where p iterates through all positions in G. Then, until no more edge can be added, we introduce edges and required vertices following one of the rules below:

1.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, fs + G[p']>\) of zero weight if |fs|<2,
2.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}}, \varepsilon >\) of weight σ if |fs|=2,
3.
\(< p, pos_{Sub_{p}}, fs> \rightarrow < p', pos_{Sub_{p}} + 1, \varepsilon >\) of weight \(\phantom {\dot {i}\!}M'[Translate(fs + G[p']), Sub_{p}[pos_{Sub_{p}} + 1]]\), and |fs|=2,
4.
\(\phantom {\dot {i}\!}< p, pos_{Sub_{p}}, \varepsilon > \rightarrow < p, pos_{Sub_{p}} + 1, \varepsilon >\) of weight σ,
5.
\(\phantom {\dot {i}\!}< p_{end}, pos_{Sub_{p}}, fs> \rightarrow < p_{start}, pos_{Sub_{p}}, fs>\) of weight zero if p_end and p_start correspond to the same position in graph on consecutive edges of G (i.e. p_end=(e₁,|e₁|),p_start=(e₂,0) and end of edge e₁ is same vertex as the start of e₂),

where p is a position in G, p^′ is a position in G extending p (i.e. p=(e,i) and p^′=(e,i+1) for some edge e and index i<|e|), \(pos_{Sub_{p}}\) – is a position in Sub_p.

As before, one can find the semi-global alignment path of the sequence Sub_p (fixed on one or both sides by anchor alignments) by searching for minimal weight paths in SG_p.

In particular, it is easy to show that the semi-global alignment of the amino acid sequence Sub_p starting from position p_s in the graph G corresponds to the minimal weight path in SG_p(G,Sub_p) connecting vertex <p_s,0,ε> to any of the vertices {<p_t,|Sub_p|,0> }, where p_t iterates through all positions in G.

Note that since any reasonable matrix M^′ contains negative elements, the graph SG_p will have edges of negative weight, preventing direct application of Dijkstra algorithm and its variations to the search of such minimal weight paths.

While Algorithm 2 from [19] (slightly modified to account for the frame states) can be applied here, we use an alternative approach resulting in the same time complexity^{Footnote 4}.

We define a new graph \(SG^{\prime }_{p}(G, Sub_{p})\) increasing the weights of edges introduced by rules 3 and 4 by an arbitrary constant C, exceeding the maximum value in M. Since the resulting graph has only edges with non-negative weights, Dijkstra algorithm can be used for the shortest paths search in this graph.

At the same time it is easy to show that the minimum weight path between two arbitrary nodes of the form s=<x,0,ε> and e=<y,|Sub_p|+1,ε> in \(SG^{\prime }_{p}\) is also the minimum weight path between s and e in SG_p. Indeed, if we consider an arbitrary path P in SG_p between s and e of weight w_P, its weight after transformation (in \(SG^{\prime }_{p}\)) is w_p+C·|Sub_p|, because P contains exactly |Sub_p| edges with modified weights. Therefore, the weights of all paths between two vertices s and e will be increased by the same constant |Sub_p|·C.

Amino acid sequence alignment pipeline

Below we outline the four steps implemented in SPAligner for aligning an amino acid sequence S_p to an assembly graph G, highlighting the differences with the pipeline presented in “Alignment of long nucleotide sequences” section.

Anchors search. As before SPAligner first identifies anchor alignments between S_p and the pre-indexed six-frame translated edge labels. Current implementation relies on finding high-scoring local alignments between the “canonical” nucleotide representations of protein sequences, obtained by substituting every amino acid by its lexicographically minimal codon. The search is performed by BWA-MEM configured to prevent gapped alignments (gap_open penalty set to infinity) and we additionally check that the aligned ranges are consistently located with respect to the reading frame.

Anchor filtering. Unreliable and ambiguous anchors are discarded based on their size and query range. Anchors shorter than K are filtered out. Smaller lengths of typical protein sequences as compared to long read sequences allow us to omit the anchor chaining step in order to increase the accuracy of the resulting amino acid sequence alignments. It also eliminates the need to coordinate the reading frames of the chained anchors.

Alignment extension. SPAligher then uses the adjusted alignment graph approach (see “Alignment graph for amino acid sequence alignment”) to search for optimal alignments extending each of the remaining anchors. The procedure employs several heuristics and constraints, which can be found in Additional file 1 “Alignment extension heuristics and thresholds”.

Alignments post-processing. SPAligner removes alignment paths shorter than a certain fraction of the query coding sequence length (0.8 by default) and filters perfectly duplicating paths. Resulting paths are then translated and re-scored against the query sequence with Parasail library [21].

Results and discussion

Aligning long error-prone reads

We benchmarked SPAligner against the two sequence-to-graph aligners supporting long reads – vg v1.17 and GraphAligner v1.0.4.

The vg alignment pipeline [4] starts from searching for super maximal equal matches (SMEMs) with GCSA2 library [22]. After filtering and chaining of SMEMs, the SMEMs in a chain are linked by SIMD-accelerated banded dynamic programming. In order to use this method, vg first “unrolls” the cycles to transform the graph into a directed acyclic graph (DAG). It is worth noting that this step can produce large intermediate graphs and potentially lead to suboptimal alignments. While working with long sequences, vg splits them into overlapping “chunks” (default 256 bp with 32 bp overlap) and maps them separately, further combining them by the same method as for SMEMs chaining.

GraphAligner [23] is a novel tool for aligning nucleotide sequences onto general sequence graphs, which implements many innovative techniques. GraphAligner v1.0.4 uses MUMmer4 [24] to identify potential alignment seeds to restrict the search for the potential alignment paths. Each seed is then extended separately by improved bit-parallel algorithm for sequence-to-graph alignment under an edit distance (unit costs) model. Extended description of GraphAligner implementation approach and detailed comparison with SPAligner can be found in Additional file 1 “Sequence-to-graph alignment implementation insights”.

Besides the default configuration of GraphAligner we have also benchmarked its “try-all-seeds” mode (disabling the filtering of likely false positive seed alignments), recommended by the developers for bacterial datasets.

For benchmarking we considered PacBio and ONT reads (both real and simulated) for three different organisms: E. coli strain K12, S. cerevisiae strain S288C and C. elegans strain Bristol N2 (see Supplemental Text “Datasets availability” for details on the simulation and accession numbers). Assembly graphs were generated with SPAdes v3.12 [1] assembler from the appropriate short-read Illumina datasets (using “-k 21,33,55,77” and setting other parameters to defaults). For running the aligners we tried to follow the recommendations given by the developers (see Supplemental Text “Notes on running aligners” for details).

Results of benchmarking on real sequencing data are summarized in Table 1, while results of additional benchmarking on the simulated PacBio and ONT reads can be found in Supplementary Table S2. Additional analysis of dependency between reads alignment quality and their length can be found in Additional file 1 “Dependence of alignment quality on read length”.

Table 1 Summary statistics of aligning PacBio/ONT reads to short-read assembly graphs (constructed by SPAdes with kmer size equal to 77)

Full size table

Table columns present the following metrics:

Number of mapped reads (MR): Read is considered mapped if 80% of its length is covered by a single alignment path.
Alignment identity (%)(AI): Mean nucleotide identity ^{Footnote 5} across the longest continuous alignments of the mapped reads.
Wall-clock time (h:m)(T): Wall-clock time with 16 threads.
Memory (Gb)(M): Peak RAM usage with 16 threads.

Vg showed highest average alignment identity for all considered datasets, but at the same time resulted in very low fraction of mapped reads. The same tendency can be seen on simulated datasets (see Additional file 1).

For E. coli datasets SPAligner showed 2–4% advantage in number of mapped reads over GraphAligner, with no significant difference in mapping identity.

For S. cerevisiae SPAligner showed 3% advantage in number of mapped reads on PacBio dataset while GraphAligner was able to map 4% more reads from ONT dataset. In both cases advantage in number of mapped reads is accompanied by 1% reduction of average mapping identity, likely reflecting alignment of additional lower identity reads. Supplementary Table S3 shows that on the set of reads mapped by both SPAligner and GraphAligner (51% for S. cerevisiae PacBio dataset and 60% for S. cerevisiae ONT) mapping identity is nearly the same for both tools.

SPAligner shows a 2% advantage in the number of mapped reads on C. elegans ONT dataset with similar mapping identity, and on C. elegans PacBio dataset both tools show similar results.

Overall, our benchmarking suggests that SPAligner and GraphAligner produce similar alignments of long error-prone reads with SPAligner being faster and more memory-efficient than GraphAligner in “try-all-seeds” mode and slower and less memory-efficient than GraphAligner in default mode.

Aligning protein sequences

In contrast to other sequence-to-graph aligners available at the moment, SPAligner supports alignment of protein sequences onto assembly graphs. In this section we demonstrate how SPAligner can be efficiently used to identify regions of the assembly graph exhibiting high similarity to known prokaryotic protein sequences.

Contigs originating from metagenomic assemblies can be excessively fragmented due to various reasons, including inter-species repeats (conservative sequences, horizontally transferred genes, etc) and within-species microdiversity [25]. Contig fragmentation can lead to loss of important insights in modern metagenomic studies, which increasingly rely on annotation of assembled contigs for functional analysis of the microbial community members [26, 27]. Alignment of known protein sequences onto metagenomic assembly graph provides a complementary approach capable of identifying genes fragmented over several contigs.

The capacity of applying SPAligner in this context is demonstrated below in two experiments, considering alignments of:

coding sequences from entire UniProt (The UniProt Consortium, 2019) database onto an assembly graph of a well-defined synthetic metagenomic community;
a comprehensive set of known antibiotic resistance genes onto an assembly graph for a wastewater metagenomic dataset.

While alignment of a protein sequence with SPAligner might result in several alignment paths (initiated by different anchors), in this work we will only consider single best scoring alignment for each query.

Annotating genes in synthetic metagenomic community

Shakya et al. [28] performed sequencing of synthetic DNA mixture for 64 diverse bacterial and archaeal organisms (SRA acc. no. SRX200676). This dataset, containing 109 million 2×100 bp Illumina reads (with mean insert size of 206 bp) has been extensively used for benchmarking of tools for metagenomic analysis [29, 30]. Reference genome annotations available for all organisms forming the community contain a total of 196150 protein coding sequences, further referred as reference protein sequences.

We model the situation in which we aim to identify potential close homologs of known protein sequences within the metagenomic assembly graph. In particular, we searched for semi-global alignments of 350 thousand sequences comprising “reviewed” UniProt database of bacterial proteins [31]. In order to identify UniProt sequences that are expected to align to the assembly graph (assuming that the graph represents all necessary sequences), we aligned UniProt sequences to the set of reference protein sequences. We used BLASTp [32] with default settings in order to quickly find most similar sequences to the reference one. As a result, 26209 UniProt sequences were mapped to some reference protein with 90% amino acid identity threshold. We consider assembly graph, built by metaSPAdes v3.12 [29] with default settings, and use SPAligner to identify alignments of UniProt sequences at 90% amino-acid identity threshold. SPAligner was able to align 93% (24324 out of 26209) of UniProt sequences with matching reference proteins (see above), as well as 1299 extra sequences. Further analysis showed that the majority of falsely identified sequences are similar to substrings of some reference protein sequences.

Identification of antibiotic resistance genes in a hospital wastewater sample

In this section we demonstrate how SPAligner can be applied to identify genes of particular interest within real metagenomic dataset.

Spread of antimicrobial resistance is an escalating problem and a threat to public health. Metagenomic sequencing provides an efficient methodology for detection and tracking of the antibiotic resistance genes (ARGs) in the environment samples. Ng et al. [33] explored wastewater and urban surface water metagenomic datasets for the presence of various antimicrobial resistance (AMR) proteins, in particular beta-lactamases. To illustrate the ability of SPAligner to identify additional AMR gene families as compared to conventional approaches based on analysis of assembled contigs, we focus on a hospital wastewater discharge dataset, which, according to the original study, had the highest total coverage of beta-lactamase genes. The dataset consisting of 3.3 million 2×250 bp Illumina reads with mean insert size of 350 bp (SRA acc. no. SRR5997548) was assembled by metaSPAdes v3.12 with default settings.

First, we used state-of-the-art AMRFinder tool [34] to identify AMR genes in assembled contigs. AMRFinder performs the BLASTx [32] searches of the protein sequences from Bacterial Antimicrobial Resistance Reference Gene Database [34], containing protein sequences for over 800 gene families representing 34 classes of antimicrobials and disinfectants. Since we are aiming at the identification of the (almost) complete protein sequences, partial predictions covering no more than 75% of the corresponding database protein were discarded. All other parameters were left to defaults, including the 90% alignment identity threshold.

We then used SPAligner to identify best alignments (with at least 90% amino-acid identity threshold) for all 4810 ARG sequences from the same database onto the graph.

To check whether our sequence-to-graph alignments were able to capture additional AMR protein families as compared to the baseline contig analysis approach, protein sequences identified by AMRFinder and SPAligner were clustered by single-linkage clustering with a 90% similarity cut-off. Clusters where only SPAligner alignments are presented tend to be really far by identity from alignments found by AMRFinder and vice versa. Out of the resulting 89 clusters a single cluster consisted of only AMRFinder prediction. Further analysis showed that SPAligner also successfully identified corresponding protein sequence which spanned 89.7% of the query length, while AMRFinder prediction spans 90.7%. At the same time, 4 clusters contained only SPAligner’s alignments. tBLASTx [32] alignments of the corresponding database protein sequences onto the assembled contigs either covered less than 75% of the query or had low identity (<70%).

Figure 2 shows a region of the assembly graph surrounding one of the putative proteins predicted exclusively from SPAligner results. Corresponding alignment of the database entry for IMP-4 beta-lactamase covered the entire query sequence with >98% identity.

We further relate our analysis to the original study [33] with respect to the identified beta-lactamase families. Alignments were obtained for instances of seven out of nine beta-lactamase protein families from the original study (bla_KPC,bla_CTX−M,bla_SHV,bla_TEM,bla_IMP,bla_VIM and bla_OXA). While SPAligner was not able to produce alignments for two remaining gene families (bla_NDM and bla_CMY), in the original study their instances were estimated to be poorly covered compared to other families^{Footnote 6}.

Note that in [33] authors tried to identify multiple beta-lactamase sequences within the same protein family, which we are not aiming to do here, since minor differences within highly similar gene sequences are likely to be lost in the metagenomic assembly process. Such high-resolution analysis should be possible within the sequence-to-graph alignment framework via configuring the assembly graph construction stage to preserve necessary genomic variation. This promising direction is left for future research.

Future improvements

In particular, our results suggest that SPAligner is an efficient solution for mapping third generation sequencing data and can also facilitate the identification of known genes in complex metagenomic datasets.

Besides general improvements of the codebase, we are planning to migrate to more efficient libraries for searching anchor alignments [22, 24] and extension algorithms [6, 13].

An important direction with respect to the nucleotide sequence alignment is improving the alignment of reads structurally different from the genome, represented by the assembly graph [36]. This includes a more reliable strategies for ignoring poorly sequenced read extremities and producing split-read alignments.

With respect to the amino acid sequence alignment our future goals include improving the sensitivity of the anchor alignment to allow for searching of the more diverged protein sequences, as well as development of strategies for identification of multiple closely-related gene sequences within the custom metagenomic assembly graphs with preserved variation.

Conclusions

In this work we present SPAligner tool for aligning long diverged molecular sequences to assembly graphs and demonstrated its effectiveness for aligning both reads produced by third-generation sequencing technologies and protein sequences.

SPAligner was benchmarked against the two sequence-to-graph aligners supporting long reads – vg and GraphAligner, and showed highly competitive results on various Pacbio and Oxford Nanopore datasets. At the same time being the only tool that produce alignment of amino acid sequences, SPAligner was successfully used to identify gene sequences, which were not fully recovered by metagenomic assembly.

Availability and requirements

Project name: SPAlignerProject home page: http://cab.spbu.ru/software/spalignerOperating system(s): Platform independentProgramming language: C++Other requirements: g++ (version 5.3.1 or higher), cmake (version 2.8.12 or higher), zlib, libbz2License: GNU GPLv2Any restrictions to use by non-academics: None

Availability of data and materials

Source code and manual of SPAligner tool is available on https://figshare.com/s/3d2badbf1a9afb167023. The datasets analysed during the current study are available in https://figshare.com/s/b0dc3f715e0224e3e962.

Notes

While common definition of assembly graph associates sequences with the vertices, assembly graphs produced by de Bruijn graph based assemblers can be readily represented as edge-labeled graphs, considered in this work. We also note that our methods can be straightforwardly re-implemented in application to other assembly graphs.
We use “weights” rather than “lengths” notation here to simplify introduction of negative weights in later sections.
Note that here we are not interested in paths spelling the “frame-shifted” version of the protein coding sequence (in particular, |Label(P)| is always divided by 3). Here we implicitly rely on the assumption that the assembly graph was constructed from highly accurate reads and thus rare “frame shift” events need not be considered.
Though method from [19] achieves better worst case space complexity, our approach is more straightforward.
Identity of aligning sequence S onto path P is defined as 1−ED(S,Label(P))/|S|.
Total abundance of the identified families exceeds 150X, while the total coverage of NDM and CMY families was estimated as 20X. The exact formula for abundance calculation is presented in the original study [33].

Abbreviations

ED:: Edit distance
ARG:: Antibiotic resistance gene
AMR:: Antimicrobial resistance

References

Nurk S, Bankevich A, Antipov D, Gurevich A, Korobeynikov A, Lapidus A, et al. Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads In In: Deng M, Jiang R, Sun F, Zhang X, editors. Research in Computational Molecular Biology, vol. 7821. Berlin Heidelberg: Springer. p. 158–170. Available from: http://link.springer.com/10.1007/978-3-642-37195-0_13.
Chikhi R, Rizk G. Space-Efficient and Exact de Bruijn Graph Representation Based on a Bloom Filter. In: WABI. vol. 7534 of Lecture Notes in Computer Science. Springer. p. 236–248.
Li D, Liu CM, Luo R, Sadakane K, Lam TW. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics; 31(10):1674–1676. Available from: http://dx.doi.org/10.1093/bioinformatics/btv033.
Garrison E, Sirén J, Novak AM, Hickey G, Eizenga JM, Dawson ET, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol; 36(875). Available from: http://dx.doi.org/10.1038/nbt.4227.
Heydari M, Miclotte G, Van de Peer Y, Fostier J. BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs. BMC Bioinformatics; 19(1). https://doi.org/10.1186/s12859-018-2319-7.
Jain C, Zhang H, Gao Y, Aluru S. On the Complexity of Sequence to Graph Alignment. Available from: http://biorxiv.org/lookup/doi/10.1101/522912.
Kavya VNS, Tayal K, Srinivasan R, Sivadasan N. Sequence Alignment on Directed Graphs. https://doi.org/10.1089/cmb.2017.0264.
Limasset A, Cazaux B, Rivals E, Peterlongo P. Read mapping on de Bruijn graphs. BMC Bioinformatics; 17(1). http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1103-9.
Antipov D, Korobeynikov A, McLean JS, Pevzner PA. hybridSPAdes: an algorithm for hybrid assembly of short and long reads. Bioinformatics. 2016; 32(7):1009–15. doi:10.1093/bioinformatics/btv688.
Article CAS PubMed Google Scholar
Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017; 13(6):e1005595. https://doi.org/10.1371/journal.pcbi.1005595.
Article PubMed PubMed Central Google Scholar
Salmela L, Rivals E. LoRDEC: accurate and efficient long read error correction. Bioinformatics. 2014; 30(24):3506–14. doi:10.1093/bioinformatics/btu538.
Article CAS PubMed PubMed Central Google Scholar
Garg S, Rautiainen M, Novak AM, Garrison E, Durbin R, Marschall T. A graph-based approach to diploid genome assembly. Bioinformatics. 2018; 34(13):i105–14. doi:10.1093/bioinformatics/bty279.
Article CAS PubMed PubMed Central Google Scholar
Rautiainen M, Mäkinen V, Marschall T. Bit-parallel sequence-to-graph alignment. Bioinformatics. 2019. https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btz162/5372677.
Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. 2013. https://arxiv.org/abs/1303.3997.
Amir A, Lewenstein M, Lewenstein N. Pattern Matching in Hypertext. J Algorithms. 2000; 35(1):82–99. https://linkinghub.elsevier.com/retrieve/pii/S0196677499910635.
Article Google Scholar
Myers EW. AnO(ND) difference algorithm and its variations. 1986; 1(1):251–66. http://link.springer.com/10.1007/BF01840446.
Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol. 1982; 162(3):705–8. https://linkinghub.elsevier.com/retrieve/pii/0022283682903989.
Article CAS PubMed Google Scholar
Navarro G. A guided tour to approximate string matching. ACM Comput Surv (CSUR). 2001; 33(1):31–88. http://portal.acm.org/citation.cfm?doid=375360.375365.
Article Google Scholar
Rautiainen M, Marschall T. Aligning sequences to general graphs in (+) time. http://biorxiv.org/lookup/doi/10.1101/216127.
Pearson WR. Selecting the Right Similarity-Scoring Matrix: Selecting the Right Similarity-Scoring Matrix In In: Bateman A, Pearson WR, Stein LD, Stormo GD, Yates JR, editors. Current Protocols in Bioinformatics. Wiley. p. 3.5.1–9. http://doi.wiley.com/10.1002/0471250953.bi0305s43.
Daily J. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC Bioinformatics. 2016; 17(1):81. https://doi.org/10.1186/s12859-016-0930-z.
Article PubMed PubMed Central Google Scholar
Sirén J. Indexing Variation Graphs:13–27. http://arxiv.org/abs/1604.06605.
Rautiainen M, Marschall T. GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment. http://biorxiv.org/lookup/doi/10.1101/810812.
Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: A fast and versatile genome alignment system. 2018; 14(1):e1005944. https://dx.plos.org/10.1371/journal.pcbi.1005944.
Nagarajan N, Pop M. Sequence assembly demystified. Nat Rev Genet. 2013; 14(3):157–67. http://www.nature.com/articles/nrg3367.
Article CAS PubMed Google Scholar
Barnum TP, Figueroa IA, Carlström CI, Lucas LN, Engelbrektson AL, Coates JD. Genome-resolved metagenomics identifies genetic mobility, metabolic interactions, and unexpected diversity in perchlorate-reducing communities; 12(6):1568–81. http://www.nature.com/articles/s41396-018-0081-5.
Sharon I, Kertesz M, Hug LA, Pushkarev D, Blauwkamp TA, Castelle CJ, et al. Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 2015; 25(4):534–43. http://genome.cshlp.org/lookup/doi/10.1101/gr.183012.114.
Article CAS PubMed PubMed Central Google Scholar
Shakya M, Quince C, Campbell JH, Yang ZK, Schadt CW, Podar M. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities: Metagenomic and rRNA diversity characterization. Environ Microbiol. 2013; 15(6):1882–99. http://doi.wiley.com/10.1111/1462-2920.12086.
Article CAS PubMed PubMed Central Google Scholar
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 2017; 27(5):824–34. http://genome.cshlp.org/lookup/doi/10.1101/gr.213959.116.
Article CAS PubMed PubMed Central Google Scholar
Awad S, Irber L, Brown CT. Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants. http://biorxiv.org/lookup/doi/10.1101/155358.
Bairoch A. The SWISS-PROT protein sequence database and its supplement TrEMBL in. Nucleic Acids Res. 2000; 28(1):45–48. https://academic.oup.com/nar/article-lookup/doi/10.1093/nar/28.1.45.
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. 1990; 215(3):403–410. https://linkinghub.elsevier.com/retrieve/pii/S0022283605803602.
Ng C, Tay M, Tan B, Le TH, Haller L, Chen H, et al. Characterization of Metagenomes in Urban Aquatic Compartments Reveals High Prevalence of Clinically Relevant Antibiotic Resistance Genes in Wastewaters. Front Microbiol. 2017; 8. http://journal.frontiersin.org/article/10.3389/fmicb.2017.02200/full.
Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, et al. Using the NCBI AMRFinder Tool to Determine Antimicrobial Resistance Genotype-Phenotype Correlations Within a Collection of NARMS Isolates. http://biorxiv.org/lookup/doi/10.1101/550707.
Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies: Fig. 1. Bioinformatics. 2015; 31(20):3350–3352. https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btv383.
Article CAS PubMed PubMed Central Google Scholar
Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018; 15(6):461–68. https://doi.org/10.1038/s41592-018-0001-7.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

We wish to thank Alexander Shlemov for great ideas and all the members of Center for Algorithmic Biotechnology for the helpful discussion.

About this supplement

This article has been published as part of [BMC Bioinformatics, Volume 21 Supplement 12, 2020: Selected abstracts and papers of Bioinformatics: from Algorithms to Applications 2019 conference. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-21-supplement-12].

Funding

Publication of this supplement is funded by the Russian Science Foundation (grant 19-14-00172). Research was carried out in part by computational resources provided by Resource Center “Computer Center of SPbU”. The authors are grateful to Saint Petersburg State University for the overall support of this work (project id: 51555639). Funders had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Center for Algorithmic Biotechnology, Institute of Translational Biomedicine, St. Petersburg State University, St. Petersburg, Russia
Tatiana Dvorkina, Dmitry Antipov & Anton Korobeynikov
Department of Statistical Modelling, St. Petersburg State University, St. Petersburg, Russia
Anton Korobeynikov
Genome Informatics Section, NHGRI, National Institutes of Health, Bethesda MD, USA
Sergey Nurk

Authors

Tatiana Dvorkina
View author publications
You can also search for this author in PubMed Google Scholar
Dmitry Antipov
View author publications
You can also search for this author in PubMed Google Scholar
Anton Korobeynikov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Nurk
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SN suggested the idea. TD is the main author of SPAligner implementation. DA took part in implementation of anchor chaining part. AK integrated BWA-MEM module. TD conducted SPAligner comparison with other tools and interpreted the data. SN, DA and AK revised the tool algorithms. SN, TD and DA prepared the manuscript. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to Tatiana Dvorkina.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1

Supplementary file contains additional algorithms and datasets description (bmc_supplement.pdf).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Dvorkina, T., Antipov, D., Korobeynikov, A. et al. SPAligner: alignment of long diverged molecular sequences to assembly graphs. BMC Bioinformatics 21 (Suppl 12), 306 (2020). https://doi.org/10.1186/s12859-020-03590-7

Download citation

Received: 02 June 2020
Accepted: 08 June 2020
Published: 24 July 2020
DOI: https://doi.org/10.1186/s12859-020-03590-7

Selected abstracts and papers of Bioinformatics: from Algorithms to Applications 2019 conference

SPAligner: alignment of long diverged molecular sequences to assembly graphs

Abstract

Background

Results

Conclusions

Background

Implementation

Alignment of long nucleotide sequences

Sequence to graph alignment via alignment graphs

Alignment of amino-acid sequences

Alignment graph for amino acid sequence alignment

Amino acid sequence alignment pipeline

Results and discussion

Aligning long error-prone reads

Aligning protein sequences

Annotating genes in synthetic metagenomic community

Identification of antibiotic resistance genes in a hospital wastewater sample

Future improvements

Conclusions

Availability and requirements

Availability of data and materials

Notes

Abbreviations

References

Acknowledgments

About this supplement

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary information

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us