SPAdes assembly pipeline [14] consists of the four major steps: (i) de Bruijn graph construction from short reads, (ii) graph simplification, which removes erroneous edges from the graph and produces a so-called assembly graph, (ii) alignment of paired reads to the assembly graph and (ii) repeat resolution and scaffolding in the exSPAnder module [15, 16].
HybridSPAdes [12] additionally includes mapping long error-prone reads using BWA MEM algorithm [17] and exploiting these alignments during repeat resolution stage. Since hybridSPAdes is designed for genomic data, it heavily relies on unique (non-repetitive) edges in the assembly graph, which are selected using coverage and length criteria. An edge is considered to be unique if it has coverage close to the average coverage of the dataset and its length exceeds a certain threshold [12]. Indeed, such heuristics is not applicable for transcriptomics data, where the majority of edges are short and coverage is non-uniform.
In rnaSPAdes, graph simplification is modified specifically for RNA-Seq data and the repeat resolution step is substituted with an isoform reconstruction procedure [11]. However, the current version of rnaSPAdes is capable of using only short paired-end and single reads. To extend its functionality for hybrid transcriptome assembly, we combine it with procedures implemented in hybridSPAdes (see Fig. 1). While the read mapping step for transcriptomic data remains unmodified (with the exception of some alignment parameters), alterations were introduced to the isoform reconstruction procedure.
Similarly to genomic SPAdes, in rnaSPAdes isoform reconstruction is based on the concept of path extension implemented in the exSPAnder module. During path prolongation exSPAnder uses all available information simultaneously. In case of hybrid assembly, at every step exSPAnder tries to find correct extension edge using paired-end reads first, and then applies long-read path extension only if paired-end reads do not help (see [12, 15] for details).
Since alternatively spliced isoforms may form very similar paths, e.g. differing only by a single alternative exon, the key modification introduced to the path-extension procedure compared to the genomic pipeline is the possibility to select more than a single extension edge at each step. The same idea can be used for exploiting long-read alignments during the isoform reconstruction stage.
To extend a path P=(p1,…,pn) the algorithm considers all long-reads paths matching with P. A path R obtained from a long read alignment is defined as matching with P if there exists a suffix of P that is a prefix of R, or P is contained inside R (Fig. 2a). Formally, either (i) R=(pi,…,pn,x1,…,xk),i>=1 or (ii) R=(r1,…,rl,p1,…,pn,x1,…,xk), where r1,…,rl and x1,…,xk are arbitrary edges in the graph. Further, from a set of all matching long-read paths the algorithm selects only those, for which the longest common subpath with P is (i) at least Lmin long and (ii) contains at least Nmin edges (default parameters are Lmin=200 bp and Nmin=2). The final set of matching long-read paths is denoted as RP. Then, among the set of all possible extension edges {e1,…,em}, the algorithm selects all ei, such that at least one path from RP matches (p1,…,pn,ei) (Fig. 2b). Using only paths from RP instead of all matching long-read paths prevents from selecting all possible extensions for path P.
Paths in the graph are iteratively extended using paired-end and long reads until every edge is included in at least one path. Finally, to exploit reads capturing full-length transcripts, rnaSPAdes aligns them to the graph and produces FL paths, which are directly added to the set of resulting paths. Identical paths and exact subpaths are removed to avoid duplications, and the resulting set of paths is outputted in FASTA format.