Introducing difference recurrence relations for faster semi-global alignment of long sequences

Background The read length of single-molecule DNA sequencers is reaching 1 Mb. Popular alignment software tools widely used for analyzing such long reads often take advantage of single-instruction multiple-data (SIMD) operations to accelerate calculation of dynamic programming (DP) matrices in the Smith–Waterman–Gotoh (SWG) algorithm with a fixed alignment start position at the origin. Nonetheless, 16-bit or 32-bit integers are necessary for storing the values in a DP matrix when sequences to be aligned are long; this situation hampers the use of the full SIMD width of modern processors. Results We proposed a faster semi-global alignment algorithm, “difference recurrence relations,” that runs more rapidly than the state-of-the-art algorithm by a factor of 2.1. Instead of calculating and storing all the values in a DP matrix directly, our algorithm computes and stores mainly the differences between the values of adjacent cells in the matrix. Although the SWG algorithm and our algorithm can output exactly the same result, our algorithm mainly involves 8-bit integer operations, enabling us to exploit the full width of SIMD operations (e.g., 32) on modern processors. We also developed a library, libgaba, so that developers can easily integrate our algorithm into alignment programs. Conclusions Our novel algorithm and optimized library implementation will facilitate accelerating nucleotide long-read analysis algorithms that use pairwise alignment stages. The library is implemented in the C programming language and available at https://github.com/ocxtal/libgaba. Electronic supplementary material The online version of this article (10.1186/s12859-018-2014-8) contains supplementary material, which is available to authorized users.

(4) The initial conditions (where i = 0 or j = 0) are as follows:

S1.1 Derivation of the difference recurrence relations
Difference values ∆H [i, j] are defined for i ≥ 1, and ∆V [i, j] for j ≥ 1. ∆E [i, j] and ∆F [i, j] are defined across the cells in the whole DP matrices (i ≥ 0 and j ≥ 0).
S1.1.1 ∆H and ∆V First, we derive the update formulae of ∆H by substituting the right-hand sides of the update formula of S (Eq 4) into S [i, j] in the definition of ∆H (Eq 10).
A is defined the same as in the main text: The update formula for ∆V is derived similarly: The update formula for ∆F is obtained in a similar fashion:

S1.1.3 Initial conditions
These are defined across the cells in the left edge column i = 0 and the top edge row j = 0. From Equation 4, the initial conditions for ∆H are derived as follows: For the initial conditions for ∆V : The initial conditions for ∆E and ∆F are defined as shown below. The differences from the − inf value are clipped to −Go H and −Go V , which are the minimum values of ∆E and ∆F , as we prove in Section S1.2.3. This modification does not cause a gap penalization error at the edges as pointed out by Flouri et al. (2015) because this setting guarantees for the first updates of ∆E (where i = 1) and ∆F (j = 1), respectively.

S1.2 Proof of the bounding formulae
We use the following lemma in the proof: Lemma 1 If p = max{q, r}, then p ≥ q and p ≥ r always hold.
A similar process is applicable to ∆F : A new variable, B H [i, j], is introduced to replace the ternary minimum block in the formula. The variable is evaluated by means of Lemma 1: Thus, variable ∆E[i, j] is bounded from above: Similar derivation is applicable to ∆F : Thus: Similar reasoning applies to ∆V :

S1.2.4 Upper bounds of ∆H and ∆V
We first postulate the following lemma: Proof: The update formula for S is defined as in Equation 4 for i ≥ 1 and j ≥ 1. Then the following transformation logically follows: In the last transformation, we used Lemma 1. Hence, Similarly, for the vertical: Using Lemma 2, we next prove the following lemma, which postulates that the diagonal difference is always bounded by the maximum value in the substitution matrix: Proof: Let us assume that the following inequality holds for any 1 ≤ s ≤ i and 1 ≤ t ≤ j except for the s = i and t = j pair: Then, we evaluate the upper bound of the S[i, j] value using the following transformation and three resulting terms (i)-(iii): Case (i): When the first term (i) reaches the maximum of the ternary maximum block, the equation below holds for S[i, j] and S[i − 1, j − 1]: Thus, Equation 35 is also true at (i, j) in this case.
Case (ii): When the second term (ii) reaches the maximum, the case is further subdivided into the following two subcases (a) and (b): The first term (a), where k = i, is evaluated as follows by means of Equation 35, Lemma 2(horizontal), and the maximum of the initial conditions for ∆V (Eq 20): The second term (b), where 1 ≤ k ≤ i − 1, is evaluated as follows: Thus, Equation 35 also holds at (i, j) in this case: Case (iii): Similar to Case (ii), with Lemma 2 (vertical).

S1.2.5
Next, we evaluate another antidiagonal interlayer difference (i.e., the difference between a pair of cells at the same coordinates in the DP matrices) to state another lemma: Proof: Case (i): The first term (i) is evaluated from above by means of the maximum value of the initial conditions for ∆V and Lemma 2 as shown below: Case (ii): The second term (ii) is evaluated using Lemma 2 and 3: Hence, The proof is similar for F :

S1.3 Derivation of the offsetted difference recurrence relation
The definitions of the difference DP matrices with an offset and of the substitution matrix with an offset are the same as in the main text: A G [i, j] is defined as in the main text: Similarly, for ∆V G [i, j]: Likewise, for ∆F G [i, j]: S1.4 Bounding formulae for difference DP matrices with an offset S1.4.1 ∆H G and ∆V G The bounds of ∆H G instantly follow from Inequality 30 and 48 after addition of the gap penalty offsets: Similarly, for ∆V G , from Inequality 31 and 49, we get S1.4.2 ∆E G and ∆F G From Lemmas 2 and 4, the upper bound of ∆E G is derived in the following way: The lower bound is instantly obtained from Inequality 23 via addition of the gap penalty offsets. The complete bounding formulae are shown below: In a similar fashion for ∆F G :

S2 Library design and implementation details
The library is implemented in the pure C programming language in an object-oriented manner. Given that the C language does not explicitly support classes or object-specialized functions (class methods), we designed APIs to take a pointer to an object instance as the first argument to treat a C function call as a class method call. The return object and other reference arguments are also handled by pointers, and thus we consider an object instance and the pointer to an instance equivalent in the context of argument passing in the description below.

S2.1 Target architectures
The library implies 64-bit little-endian architectures with 128-or 256-bit-wide SIMD instruction and unaligned load/store capability. The possible targets are x86 64 with SSE4.1 or AVX2 (Intel Corporation (2016) (2017)). The library currently supports only the x86 64 architecture, whereas the architecturedependent operations are separated from the algorithm implementation into headers, which are specified to provide several abstract vector types (e.g., v32i8 t, v16i8 t, and v2i64 t), operations on them, and several bit manipulation operations like popcnt and trailing zero count. We provided two variants, SSE4.1 and AVX2, in the current implementation.

S2.2 API design
The library is supposed to be a component of seed-and-extend-style alignment algorithms, that is, the library supports only the semi-global extension alignment (not local or global alignment). We also regard the library as thread-safe, with a global immutable configuration context and the thread-local DP matrix context. The global context is initialized with a set of substitution matrix, gap penalties, and several other parameters like the X-drop threshold. The local context is generated from the global object, inheriting the configuration and initializing its own DP matrix and memory arena. The memory management in the local context is not designed to be interthread-portable; thus, the users must not pass any derived object of a local context (an object that is returned from a function that takes a local context as the first argument) to another local context. The following code snippet shows the signature of the two context initialization functions: global and local. 1 The two boundary arguments, alim and blim, are added to inform the library at the tail address of the user space, which is utilized to index reverse-complemented sequences. A sequence pointer that leads to an address larger than the boundary is treated as a "phantom sequence," and the reverse-complemented one at the mirrored address is used. This behavior enables library users to save memory for sequences, where only forward ones are kept in memory and reverse-complemented ones are distinguished by mirrored pointers.
/* global context i n i t i a l i z a t i o n f u n c t i o n */ gaba_t * gaba_init ( gaba_params_t const * params ); /* local context i n i t i a l i z a t i o n f u n c t i o n */ gaba_dp_t * gaba_dp_init ( gaba_t const * global_context , uint8_t const * alim , uint8_t const * blim ); The alignment function takes three arguments-"tail object" of the previous band and reference side and query side sequences-and tries to extend alignment after the tail object with two input sequences. The function returns a new tail object with at least one of the following three states: X-drop termination, reference-side sequence depletion, or query side sequence depletion. This behavior enables us to handle the input sequence as a linear concatenation (or list) of subsequences. This feature is introduced to make the API compatible with circular or graphically structured genomic sequences like string graphs. The fill root function is provided to start the banded alignment, which internally creates an "empty" tail object and pass it to the normal matrix fill-in function. The following code snippet is a simple linear-to-linear alignment calculation with a 32-base-long "margin" sequence, which is generally an array of zeros, to ensure that the ends of the input sequences are covered by the band, where r and q are pointers to the reference side and the query side sequence segments, and m is a pointer to the margin, whereas rsp and qsp are respectively start positions in the reference and query. Note that each subsequence is distinguished by a "sequence ID," which is a 32-bit unsigned number uniquely assigned to the subsequences.

G A B A _ S T A T U S _ T E R M ; do { if (f -> status & G A B A _ S T A T U S _ U P D A T E _ A ) { rp = & msec ; } if (f -> status & G A B A _ S T A T U S _ U P D A T E _ B ) { qp = & msec ; } flag |= f -> status & ( G A B A _ S T A T U S _ U P D A T E _ A | G A B A _ S T A T U S _ U P D A T E _ B )
; f = gaba_dp_fill ( dp , f , rp , qp ); m = (f -> max > m -> max )? f : m ; } while (!( flag & f -> status )); /* fill -in stage is done , m holds block with the maximum */ The tail object also keeps information on the maximum-scoring cell in the band. The search max function returns a set of reference side and query side sequence IDs and local positions within the subsequences. ga b a_ po s _p ai r_ t g a b a _ d p _ s e a r c h _ m a x ( gaba_dp_t * local_context , gaba_fill_t const * section ); The traceback function is designed to handle seed-and-extend-style alignment efficiently accepting two tail objects: forward and reverse. The resulting paths are concatenated at their root in opposite directions to generate the complete (full-length) alignment path. It is also possible to insert short matches between the two roots as the seed sequence of the alignment. Passing tail objects to the function results in an alignment object, which contains the score of the alignment, alignment path, and a list of the corresponding sections and breakpoint coordinates for them. g a b a _ a l i g n m e n t _ t * gaba_dp_trace ( gaba_dp_t * local_context , gaba_fill_t const * fw_tail , gaba_fill_t const * rv_tail , g a b a _ t r a c e _ p a r a m s _ t const * params );

S3 Results for SSE 4.1 variants
Our libgaba and the other SIMD implementations (non-diff and diff-raw) support both SSE4.1 128-bit-wide and AVX2 256bit-wide vectorizations. We compared the performance of the two vectorization variants on several combinations of CPU microarchitectures and compilers. The detailed configurations of the four machines are listed in Table S1, and the results are shown in Table S2. We tested two additional variants in instruction encoding, REX-and VEX-prefix encoded, for the SSE4.1 ones to investigate the effect of the instruction encoding on performance.
The results showed that the libgaba implementation was generally as fast as or slightly slower than editdist in all the other tested environments when the AVX2 instruction was enabled, regardless of the compiler and its version, indicating that the design of the algorithm and data structures and tuning applied to the library were generally effective for x86 64 processors. It is also noteworthy that the acceleration ratio trends were roughly consistent with the results on the Skylake system, which are presented in the main text. The systems are distinguished by their CPU microarchitectures and the other components (DRAM, OS, and available compilers). The Haswell system is a single node of the Shirokane3 cluster at the Human Genome Center, the University of Tokyo. All the others are personal. The benchmark setting and the definitions of the Fill, Trace, Conv, and Total columns are the same as in the main text. The detailed specifications of the systems are listed in Table S1. The boldfaced numbers show the fastest variant for each pair system-stage (column). The REX-encoded binaries were generated using architecture flag -msse4.2 to enable both SSE4.1 instructions and the popcnt instruction (included in SSE 4.2). The VEX-encoded binaries were generated with the -mavx flag, where SSE 4.1 instructions were used in the explicit vectorizations and several AVX instructions (e.g., vmovaps) were employed in the automatic loop vectorization and some libc functions like memset and memcpy. The -march=native flag was applied to enable all the available instructions and optimizations as the fastest baselines on each system.