CacheOblivious parallel SIMD Viterbi decoding for sequence search in HMMER
 Miguel Ferreira^{1, 2},
 Nuno Roma^{1, 2} and
 Luis MS Russo^{1, 2}Email author
https://doi.org/10.1186/1471210515165
© Ferreira et al.; licensee BioMed Central Ltd. 2014
Received: 22 October 2013
Accepted: 4 April 2014
Published: 30 May 2014
Abstract
Background
HMMER is a commonly used bioinformatics tool based on Hidden Markov Models (HMMs) to analyze and process biological sequences. One of its main homology engines is based on the Viterbi decoding algorithm, which was already highly parallelized and optimized using Farrar’s striped processing pattern with Intel SSE2 instruction set extension.
Results
A new SIMD vectorization of the Viterbi decoding algorithm is proposed, based on an SSE2 intertask parallelization approach similar to the DNA alignment algorithm proposed by Rognes. Besides this alternative vectorization scheme, the proposed implementation also introduces a new partitioning of the Markov model that allows a significantly more efficient exploitation of the cache locality. Such optimization, together with an improved loading of the emission scores, allows the achievement of a constant processing throughput, regardless of the innermostcache size and of the dimension of the considered model.
Conclusions
The proposed optimized vectorization of the Viterbi decoding algorithm was extensively evaluated and compared with the HMMER3 decoder to process DNA and protein datasets, proving to be a rather competitive alternative implementation. Being always faster than the already highly optimized ViterbiFilter implementation of HMMER3, the proposed CacheOblivious Parallel SIMD Viterbi (COPS) implementation provides a constant throughput and offers a processing speedup as high as two times faster, depending on the model’s size.
Keywords
Background
Sequence alignment algorithms
One of the most used alignment algorithms for sequence homology search is the SmithWaterman algorithm [1]. It computes the optimal local alignment and the respective similarity score between the most conserved regions of two sequences, with a complexity proportional to $\mathcal{O}\left({N}^{2}\right)$. The algorithm is based on a Dynamic Programming (DP) approach that considers three possible mismatches: insertions, deletions, and substitutions. To ensure that a local alignment is found, the computed scores are constrained to a minimum value of 0, corresponding to a restart in the alignment. To circumvent the computational complexity of the SmithWaterman and similar alignment algorithms, alternative heuristic methods (like BLAST [2]) were developed. However, their lower complexity is obtained at the cost of sacrificing the resulting sensibility and accuracy.
Other authors have even focused on the use of more specialized hardware architectures, such as GPUs [5], ASICs [6], or on parallelizing the algorithm onto a multinode grid, usually by dividing the sequence database in blocks and independently searching on each block.
Markov models and Viterbi decoding
Instead of searching with a single query sequence, several applications have adopted a previously built consensus, conveniently defined from a family of similar sequences. This consensus structure is usually known as a consensus profile and it provides a more flexible way to identify homologs of a certain family, by highlighting the family’s common features and by downplaying the divergences between the family’s sequences.
A common method to perform a profile homology search rests on a wellknown machine learning technique: Hidden Markov Modelss (HMMs). As an example, an HMM may be constructed to model the probabilistic structure of a group of sequences, such as a family of proteins. Such resulting HMM is then used to search within a sequence database, by computing the probability of that sequence being generated by the model. HMMs may also be used to find distant homologs, by iteratively building and refining a model that describes them (such as in the SAM tool [7]).
The most important algorithms to process HMMs are the Forward algorithm, which gives the full probability for all possible model state paths; and the Viterbi’s algorithm, used to compute the most likely sequence of model states for the generation of the considered sequence. The complete path of states that is extracted by the application of Viterbi’s procedure thus corresponds to an optimal alignment of the considered sequence against the profiled model.
In this equation, P(x_{ i }V_{ j }) represents the probability of observing x_{ i } in state V_{ j }. The ${t}_{{j}^{\prime}j}$ term represents the transition probability from state ${V}_{{j}^{\prime}}$ to state V_{ j }.
The terms e_{ I j }(x_{ i })/q_{ x i }, relating the emission probabilities (e_{ I j }(x_{ i })) and the background probability distribution (q_{ x i }) of belonging to the standard random model, represent a normalized probability of observing the event x_{ i } at state Ij. The remaining variables in these equations represent the following: ${V}_{j}^{M}\left(i\right)$ represents the logarithm of the probability of the most likely path ending at state M_{ j } in column j, after processing i letters from a given sequence. Likewise, ${V}_{j}^{I}\left(i\right)$ and ${V}_{j}^{D}\left(i\right)$ represent the logarithm of the probability of an insertion and deletion, respectively. t_{ X Y } represents the probability of transitioning from one state to another (for example, ${t}_{{M}_{2}{D}_{3}}$ represents the probability of transitioning from M_{2} to D_{3}).
HMMER
This latest HMMER version also introduced a processing pipeline, comprehending a combination of several incremental filters. Each incremental filter is more accurate, restrictive and expensive than the previous one. All of these filters have already been parallelized by SingleInstruction MultipleData (SIMD) vectorization using Farrar’s striped processing pattern [3]. The ViterbiFilter, in particular, has been parallelized with 16bit integer scores. Accordingly, the present work proposes a new parallelization approach of this filter based on Rognes’ processing pattern [4], with a novel strategy to improve the cache efficiency.
CacheOblivious SIMD Viterbi with intersequence parallelism
The proposed CacheOblivious Parallel SIMD Viterbi (COPS) algorithm represents an optimization of the Viterbi filter implementation in local unihit mode (i.e., the mode corresponding to the original SmithWaterman local alignment algorithm). Global alignment is not currently supported by the latest version of HMMER.
The presented implementation was developed on top of the HMMER suite, as a standalone tool. A full integration into the HMMER pipeline was deemed unsuitable, since the pipeline was designed to execute a single sequence search at a time, while the proposed approach exploits intersequence parallelism, i.e., it concurrently processes several sequences at a time in the SIMD SSE2 vector elements.
Rognes’ strategy applied to Viterbi decoding
Although HMMER extensively adopts the Farrar’s intrasequence vectorization approach, the presented research demonstrates that the intersequence parallel alignment strategy that was proposed by Rognes [4] can be equally applied to implement the Viterbi decoding algorithm. The proposed vectorization comprehends the computation of the recursive Viterbi relations, by using three auxiliary arrays to hold the previous values of the Match (M), Insert (I) and Delete (D) states (see Figure 4). After each loop over the normal states, the special states (E and C) are updated. Since the proposed implementation does not support multhit alignments, the J transitions were removed from the original model.
Just like Farrar’s and Rognes’ vectorizations, the implementation that is now proposed uses 128bit SIMD registers, composed by eight 16bit integer scores, to simultaneously process eight different sequences. Furthermore, similarly to the HMMER implementation, the scores are discretized by using a simple scaling operation, with an additional bias and saturated arithmetic. Hence, just like the ‘2.0 nat approximation’ that is used by HMMER, the N → N and C → C transitions were set to zero, and a 2.0 score offset was added at the end. This value approximates the cumulative contribution of N → N and C → C transitions which, for a large L, is given by $log\frac{L}{L+2}$. As a result, the B contributions become constant, since they only depend on the N values (which are constant) and on the J values (which are zero in unihit modes).
Inline preprocessing of the scores
Rognes’ method to preload and preprocess the emission scores before each inner loop iteration (i.e., iteration over the model states) suffers from a considerable handicap: it needs an additional rewrite of the scores to memory, before the actual Viterbi decoding can start. To circumvent this problem, an alternative approach is herein proposed. Instead of transposing all the emission scores for each tuple of residues in the outer loop of the algorithm (Loop B in Listing 1 (a), over the sequence residues), the transposition was moved inwards to the inner loop (Loop A) and subsequently unrolled for 8 iterations. Hence, each iteration starts by preloading 8 emission values: one from each of the 8 continuous arrays. These emission values are then transposed and striped into 8 temporary SSE2 vectors and used in the computation of the next model state for each of the 8 sequences under processing. Hence, the inner loop is unrolled into the 8 statetriplets that are processed by each loop iteration. With this approach, the emission scores can be kept in close memory, thus improving the memory and cache efficiency. Furthermore, the rewriting in memory during this preloading phase is also avoided.
To take full advantage of this vectorization approach, the number of considered model states should be always a multiple of 8 (in order to occupy the 8 available SSE channels). Nevertheless, this restriction is easily fulfilled, by padding the model with dummy states up to the next multipleof8 state barrier. These dummy states should carry dummy scores (set to ∞), so that they have a null influence on the final results, representing a negligible effect on the overall performance. According to the conducted evaluations (further detailed in the latest sections of this manuscript), this optimization of the inlined scores loading procedure leads to an execution time roughly 30% faster than the preloading method used by Rognes’ tool.
Model partitioning
One common problem that is often observed in these algorithms is concerned with the degradation of the cache efficiency when the score arrays exceed the capacity of the innermostlevel data caches, leading to an abrupt increase of the number of cache misses and causing a substantial reduction of the overall performance. This type of performance penalties is also present in HMMER Farrarbased ViterbiFilter implementation, whenever larger models are considered.
To circumvent this cache efficiency problem, a looptiling (a.k.a. stripmining) strategy based on a partitioning of the model states was devised in the proposed implementation, in order to limit the amount of memory required by the core loop. The required code transformations are illustrated in Listing 1(b). Accordingly, the M, I and D model states are split in blocks (or partitions), whose optimal dimension (Maximum Partition (MP) length) is parameterized according to the size and organization of the L1 data (L1D) cache. With this approach, all the defined partitions are now iterated in a new outermostloop (Loop C, in Listing 1(b)). As a result, the inner loop (Loop A) is substantially shorter and it is now possible to obtain an optimal cache use in loops A and B — the middle loop (Loop B) iterates over the 8 database sequences, while the inner loop (Loop A) iterates over a single partition of model states.
Memory footprint (in Bytes) required by the proposed COPS implementation, when compared with the original HMMER ViterbyFilter
Data structure  COPS (proposed)  ViterbiFilter 

(HMMER)  
Mmx, Dmx, Imx  3×M×16  3×M×2 
Transition scores  8×M×16  8×M×2 
Emission match (E.M.) scores  M×16  M×2 
Auxiliary emission array  24×16  – 
∼20 aux. variables  20×16  20×16 
Total  192×M+700  24×M+320 
Total minus E.M. scores  176×M+700  22×M+320 
Max. M to fill a 32 KB cache  $\frac{32768700}{192}\approx 167$  $\frac{32768320}{24}\approx 1350$ 
Nevertheless, a conservative tolerance should be considered when approaching this maximum estimate, justified by the sharing of the L1D cache with other variables not correlated with this processing loop, process or thread. In fact, the conducted experimental procedures demonstrated that the actual MP values are very close to the best values that were theoretically estimated:

112 to 120 states, for 32 KB L1D CPUs (e.g. Intel Core, Core2, Nehalem, Sandy Bridge, Ivy Bridge and Haswell);

around 48 states, for 16 KB L1D CPUs (e.g. AMD Opteron Bulldozer and Piledriver);

216 to 224 states, for 64 KB L1D CPUs (e.g. AMD Opteron K8, K10, Phenom I and II).
There are, however, two memory blocks that cannot be stripmined:

Emission scores, which must be refreshed (recomputed) for each new round of sequence tokens. These values are accessed only once, so it is counterproductive to consider their cacheability.

Dependencies that must be exchanged between adjacent partitions. The last Match (M), Insert (I) and Delete (D) contributions from each partition have to be carried on in the next partition, and so they have to be saved at the end of each partition. Hence, each partition receives as input one line of previous states, with one statetriplet for each 8fold round of sequences, and produces as output another line of values to be used by the next partition. These dependencies can be minimized to 3 values per sequence round (xmxE, Mnext and Dcv) after refactoring the core code and moving the computation of Mnext with the 3 state dependencies to the end. The refactored inner loop code can be seen in Listing 4.
Methods
To conduct a comparative and comprehensive evaluation of the proposed approach, the COPS algorithm was ran against the ViterbiFilter implementation of HMMER 3.1b1, based on Farrar’s striped vectorization. For such purpose, a benchmark dataset comprehending both DNA and protein data was adopted, covering a wide spectrum of model lengths, ranging from 50 to 3000 model states, with a step of about 100.
In particular, the DNA data consisted on HMMs sampled from Dfam 1.2 database of Human DNA HMMs [11], and the human genome retrieved from the NCBI archive.
As of March 2013, Dfam uses HMMER3.1b1 to create the models. The complete list of HMMs is the following (the length is prefixed to the model name): M0063U7, M0804LTR1E, M1597Tigger6b, M2500L1M4c_5end, M0101HY3, M0900MER4D1, M1683FordPrefect, M2596L1P4a_5end, M0200MER107, M1000L1MEg2_5end, M1795L1MB4_5end, M2706Charlie3, M0301Eulor9A, M1106L1MD2_3end, M1961Charlie4, M2789L1MC4_3end, M0401MER121, M1204Charlie17b, M2101L1MEg_5end, M2904L1M2_5end, M0500LTR72B, M1302HSMAR2, M2204CR1_Mam, M2991HAL1M8, M0600MER4A1, M1409MLT1Hint, M2275L1P2_5end, M0700MER77B, M1509LTR104_Mam, M2406Tigger5.
The protein data consisted on a mix of 13 small and mediumsized HMMs from Pfam 27.0 [12] and 17 large HMMs created with hmmerbuild tool from Protein Isoforms sampled from Uniprot, and the NRDB90 [13] nonredundant protein database. The short protein models, from Pfam, were the following: M0063ACT_5, M0400Alginate_exp, M0800Patched, M1201DUF3584, M0101Bactofilin, M0500Lant_dehyd_C, M0900PolC_DP2, M1301Orbi_VP1, M0201Adeno_52K, M0600Mpp10, M1002SrfB, M0300Aldose_epim, M0700Pox_VERT_large, M1098CobNMg_che.
The longer models used were generated from the following Uniprot Isoforms: M1400Q8CGB6, M1800Q9BYP7, M2203P27732, M2602O75369, M1500Q9V4C8, M1901Q64487, M2295Q3UHQ6, M2703Q8BTI8, M1600Q6NZJ6, M2000Q9NY46, M2403Q9UGM3, M2802Q9DER5, M1700Q3UH06, M2099Q8NF50, M2505O00555, M2898Q868Z9, M3003A2AWL7.
The benchmarks were run on two different machines:

Intel Core i7 3770 K, with an Ivy Bridge architecture, running at 3.50 GHz with a 32 KB L1D cache;

AMD Opteron 6276, with a Bulldozer architecture, running at 2.3 GHz with a 16 KB L1D cache.
All the timings were measured as total walltime, by using the Linux ftime function.
Results and discussion
Cache misses
To evaluate the cache usage efficiency of the considered algorithms, the number of L1D cache misses for the COPS tool and for the HMMER ViterbiFilter implementations were measured with PAPI performance instrumentation framework [14]. To ensure a broader and more comprehensive coverage of measures, a wider and random dataset of DNA models was considered in this specific evaluation.
Performance
For short models (< 100 bps), the penalizing overhead of Farrar’s LazyF loop is clearly evident. As a result, the HMMER ViterbiFilter has a very poor performance on these models. In contrast, the proposed COPS solution does not suffer from this problem and presents a much smaller performance penalty in these small models (mainly from the initialization costs between each innerloop execution). As a result, with these short models, COPS achieved a considerable 1.7fold speedup, when compared with HMMER.
For mediumlength models (between 100 and 500 bps on 16 KBL1D machines, and up to ≈1000 bps on 32 KBL1D machines), the proposed COPS implementation is about as good as HMMER, reducing the observed speedup to about 1.2fold. These performance values correspond to model lengths where the striped version does not exceed the size of the innermost data cache.
For longer models, from 500 bps or 1000 bps (depending on the L1D size), it can be observed that the performance of HMMER quickly deteriorates as the length of the model increases and the memory requirements of HMMER Farrarbased approach reach the maximum that the innermost L1D caches can provide (usually, 32 KB on Intel and 16 KB on AMD CPUs). In contrast, the proposed intersequence COPS is able to consistently maintain the same performance level with increasingly long models, thus achieving a 2fold speedup on AMD and a 1.5fold on Intel, against the HMMER version for longer models.
Conclusions
The main insight of the presented approach is based on the observation that current parallel HMM implementations may suffer severe cache penalties when processing long models. To circumvent this limitation, a new vectorization of the Viterbi decoding algorithm is proposed to process arbitrarily sized HMM models. The presented algorithm is based on a SSE2 intersequence parallelization approach, similar to the DNA alignment algorithm proposed by Rognes [4]. Besides the adopted alternative vectorization approach, the proposed algorithm introduces a new partitioning of the Markov model that allows a significantly more efficient exploitation of the cache locality. Such optimization, together with an improved loading of the emission scores, allows the achievement of a constant processing throughput, regardless of the innermostcache size and of the dimension of the considered model.
In what concerns the partitioning, the proposed implementation was based on the observation that the poor cache performance of HMMER is related to the size of the model and to the fact that it is necessary to update all the states in the model for every letter of a query sequence. As a result, large models will force recently computed values out of cache. When this phenomena occurs for every letter in a query, it naturally results in a significant bottleneck.
We speculate that a similar phenomena occurs for the striped pattern of Farrar, in which case our partitioning technique could prove useful. Still, Farrar’s algorithm processes one single query at a time, instead of 8. Therefore, the slowdown should only occurs for models 8 times larger, i.e., models of size larger than 10800.
According to the extensive set of assessments and evaluations that were conducted, the proposed vectorized optimization of the Viterbi decoding algorithm proved to be a rather competitive alternative implementation, when compared with the state of the art HMMER3 decoder. Being always faster than the already highly optimized HMMER ViterbiFilter implementation, the proposed implementation provides a constant throughput and proved to offer a processing speedup as high as 2, depending on the considered HMM model size and L1D cache size.
Future work may also extend this approach to Intel’s recent instructionset extension AVX2, allowing the processing of twice more vector elements at a time.
Availability and requirements
Project name: COPS (CacheOblivious SIMD Viterbi with InterSequence Parallelism)Project home page:https://kdbio.inescid.pt/~lsr/COPSOperating system(s): LinuxPlatform independent Programming language: Crequirements: gcc, makeLicense: a variation of the Internet Systems Consortium (ISC) license.Restrictions to use by nonacademics: Referencing this work.
Declarations
Acknowledgements
This work was partially supported by national funds through Fundação para a Ciência e a Tecnologia (FCT), under project “HELIX: Heterogeneous MultiCore Architecture for Biological Sequence Analysis” (reference number PTDC/EEAELC/113999/2009), project “DataStorm  Large scale data management in cloud environments” (reference number EXCL/EEIESS/0257/2012) and project PEstOE/EEI/LA0021/2013.
Authors’ Affiliations
References
 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147: 195197. 10.1016/00222836(81)900875.View ArticlePubMedGoogle Scholar
 Altschul S, Gish W, Miller W, Myers E, Lipman D: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403410. 10.1016/S00222836(05)803602.View ArticlePubMedGoogle Scholar
 Farrar M: Striped SmithWaterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007, 23 (2): 156161. 10.1093/bioinformatics/btl582.View ArticlePubMedGoogle Scholar
 Rognes T: Faster SmithWaterman database searches with intersequence SIMD parallelisation. BMC Bioinformatics. 2011, 12: 22110.1186/1471210512221.View ArticlePubMed CentralPubMedGoogle Scholar
 Ganesan N, Chamberlain RD, Buhler J, Taufer M: Accelerating HMMER on GPUs by implementing hybrid data and task parallelism. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology. 2010, New York: ACM, 418421.View ArticleGoogle Scholar
 Derrien S, Quinton P: Hardware acceleration of HMMER on FPGAs. Sci J Circ Syst Signal Process. 2010, 58: 5367. 10.1007/s112650080262y.View ArticleGoogle Scholar
 Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998, 14 (10): 846856. 10.1093/bioinformatics/14.10.846.View ArticlePubMedGoogle Scholar
 Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol. 1994, 235 (5): 15011531. 10.1006/jmbi.1994.1104.View ArticlePubMedGoogle Scholar
 Eddy SR: Profile Hidden Markov models. Bioinformatics. 1998, 14 (9): 755763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
 Eddy SR: Accelerated profile HMM searches. PLoS Comput Biol. 2011, 7 (10): e100219510.1371/journal.pcbi.1002195.View ArticlePubMed CentralPubMedGoogle Scholar
 Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AF, Finn RD: Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 2013, 41 (D1): D70D82. 10.1093/nar/gks1265.View ArticlePubMed CentralPubMedGoogle Scholar
 Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Jones SG, Khanna A, Marshall M, Moxon S, Sonnhammer ELL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res. 2004, 32 (suppl 1): D138D141.View ArticlePubMed CentralPubMedGoogle Scholar
 Holm L, Sander C: Removing nearneighbour redundancy from large protein sequence collections. Bioinformatics. 1998, 14 (5): 423429. 10.1093/bioinformatics/14.5.423.View ArticlePubMedGoogle Scholar
 Browne S, Dongarra J, Garner N, Ho G, Mucci P: A portable programming interface for performance evaluation on modern processors. Int J High Perform Comput Appl. 2000, 14 (3): 189204. 10.1177/109434200001400303.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.