zipHMMlib: a highly optimised HMM library exploiting repetitions in the input to speed up the forward algorithm

Background Hidden Markov models are widely used for genome analysis as they combine ease of modelling with efficient analysis algorithms. Calculating the likelihood of a model using the forward algorithm has worst case time complexity linear in the length of the sequence and quadratic in the number of states in the model. For genome analysis, however, the length runs to millions or billions of observations, and when maximising the likelihood hundreds of evaluations are often needed. A time efficient forward algorithm is therefore a key ingredient in an efficient hidden Markov model library. Results We have built a software library for efficiently computing the likelihood of a hidden Markov model. The library exploits commonly occurring substrings in the input to reuse computations in the forward algorithm. In a pre-processing step our library identifies common substrings and builds a structure over the computations in the forward algorithm which can be reused. This analysis can be saved between uses of the library and is independent of concrete hidden Markov models so one preprocessing can be used to run a number of different models. Using this library, we achieve up to 78 times shorter wall-clock time for realistic whole-genome analyses with a real and reasonably complex hidden Markov model. In one particular case the analysis was performed in less than 8 minutes compared to 9.6 hours for the previously fastest library. Conclusions We have implemented the preprocessing procedure and forward algorithm as a C++ library, zipHMM, with Python bindings for use in scripts. The library is available at http://birc.au.dk/software/ziphmm/.

http://www.biomedcentral.com/1471-2105/14/339 order since different parts of the computation now can be handled independently. This then makes it possible to reuse computations whenever the input contains repeated substrings [10] or to parallelise the algorithm across a number of independent threads [11].
The main contribution of this paper is a software library that uses both of these ideas to greatly speed up the forward algorithm. We present a preprocessing of the observed sequence that finds common substrings and constructs a data structure that makes the evaluation of the likelihood close to two orders of magnitude faster (not including the preprocessing time). The preprocessing of a specific sequence can be saved and later reused in the analysis of a different HMM topology. The algorithms have been implemented in a C++ library, zipHMMlib, available at http://birc.au.dk/ software/ziphmm/. The library also provides an interface to the Python programming language.
Much of the theory used in zipHMMlib was also developed by Lifshits et al. [10], but while they developed the theory in the context of the Viterbi algorithm, where the preprocessing cannot be reused, we concentrate on the forward algorithm and introduce a data structure to save the preprocessing for later reuse. We furthermore extend the theory to make the computations numerically stable and introduce practical measures to make the algorithm run fast in practice and make the library accessible.
Our implementation is tested on simulated data and on alignments of chromosomes from humans with chimpanzees, gorillas and orangutans analysed with the CoalHMM framework [7,8,12], a framework which uses changes in coalescence trees along a sequence alignment to make inference in population genetics and phylogenetics and which has been used in a number of whole-genome analyses [13][14][15][16]. Using an "isolation-withmigration" CoalHMM [17], we train the model using the Nelder-Mead-simplex algorithm and measure the preprocessing time and total optimisation time. Looking at the time required to perform the entire training procedure, we achieve up to 78 times shorter wall-clock time compared to the previously fastest implementation of the forward algorithm. Even for data of high complexity and with few repetitions we achive a speedup of a factor 4.4.

Hidden Markov models
A Hidden Markov Model (HMM) describes a joint probability distribution over an observed sequence Y 1:T = y 1 y 2 . . . y T ∈ O * and a hidden sequence X 1:T = x 1 x 2 . . . x T ∈ H * , where O and H are finite alphabets of observables and hidden states, respectively. The hidden sequence is a realisation of a Markov process which explains hidden properties of the observed data. We can formally define an HMM [18] as consisting of: • H = {h 1 , h 2 , . . . , h N }, a finite alphabet of hidden states; the probability of the model starting in hidden state An HMM is parameterised by π, A and B, which we will denote by λ = (π, A, B).

The classical forward algorithm
The forward algorithm [18] finds the probability of observing a sequence Y 1:T in a model λ by summing the joint probability of the observed and hidden sequences for all possible hidden sequences: Pr (Y 1: column by column from left to right, using the recursion After filling out α, Pr (Y 1:T | λ) can be computed as

The algorithm as linear algebra
In the classical forward algorithm, we compute the columns of α from left to right by the recursion in equation (1). If we can compute the last one of these columns, α T , efficiently, we can compute Pr (Y 1: . Now let α t be the column vector containing the α t (x t )'s: let B o i be the diagonal matrix, having the emission probabilities of o i on the diagonal: and let http://www.biomedcentral.com/1471-2105/14/339 where A * is the transpose of A. Then α t can be computed using only C y t and the previous column vector α t−1 : Thus the classical forward algorithm can be described as a series of matrix-vector multiplications of length T − 1 as illustrated in Figure 1. The classical forward algorithm corresponds to computing this from right to left, but since matrix-matrix multiplication is associative the product can be computed in any order. Since repeated substrings corresponds to repeated matrix-matrix multiplications, running time can be improved by reusing shared expressions [10,11]. In the following sections we will show how we precompute a grouping of the terms based on Y 1:T in order to minimise the workload in the actual computation of the likelihood. To mimic this in the computation described above, we introduce a new C matrix:

Exploiting repetitions in the observed sequence
Now notice that we only need to compute this matrix once and substitute it for all occurrences of C o i C o j in equation (3). Hence we can save n o i o j matrix-vector multiplications by introducing one matrix-matrix multiplication, potentially saving us a large amount of work. These observations suggest that we can split the computation of the likelihood of a given observed sequence in a preprocessing of the sequence and in the actual computation of the likelihood. In the preprocessing phase we compress the observed sequence by repeatedly finding the most frequent pair of symbols o i o j in the current sequence and replacing all occurrences of this pair by a new symbol. This is repeated until n o i o j becomes too small to gain a speedup (see next section). The result is a sequence . This compression will be identical independent of the HMM, meaning that we can save it along with the observed sequence and reuse it for any HMM.
The actual computation of the likelihood is then split in two stages. In the first stage we compute C 1 and C o i for i = 1, . . . , M using (2). We then compute C o i for increasing i = M + 1, . . . , M by C o i = C l i C r i . In the second stage, we compute α T by This is illustrated in Figure 2 where the actual computation is drawn in solid black, while the saved work due to redundancy is shown in gray.

Compression stopping criterion
While the first iterations of the preprocessing procedure compress the sequence very effectively, the last iterations do not decrease the sequence length by much, since most pairs are uncommon when more characters are introduced. This is illustrated in Figure 3, where we see that the number of occurrences of the most frequent pair of symbols decreases superexponentially as a function of the number of iterations performed on an alignment of the human and chimpanzee chromosome 1. This means that we potentially save a lot of time on the likelihood computation by performing the first iterations, but as the slope of the curve increases towards 0 we risk to spend a long time on the preprocessing and save very little time on the actual likelihood computation. To overcome this problem, we do not compress the input sequence all the way down to a single character. Assume we know that the preprocessing will not be reused for an HMM with less than N min states, and let t mv be the time required for an (N min × N min ) × N min matrix-vector multiplication and t mm be the time required for an (N min ×N min )×(N min ×N min ) matrix-matrix multiplication. In iteration i of the preprocessing we replace the most frequent pair of two symbols in the current sequence Figure 1 Classical approach to the forward algorithm. The classical forward algorithm, as described by Rabiner [18]. The rectangles represent matrices and vectors. The black lines denote matrix-vector multiplications. The top row is the C oi matrices. α i is obtained from α i−1 and C oi . Note that the input sequence is inverted to illustrate that the series of matrix-vector multiplication should be carried out from right to left. http://www.biomedcentral.com/1471-2105/14/339 Figure 2 Reusing common expressions to speed up the forward algorithm. The actual computation of the likelihood of the observed sequence 10100010011001 given a specific HMM. The rectangles represent matrices and vectors and lines between them represent dependencies.
In stage 1 the C oi matrices are computed. The solid black lines show the amount of work performed, while the grey lines show the amount of work saved due to redundancy. C 2 is for example computed as the product C 1 C 0 , and this multiplication is saved three times. In the second stage α T is computed from the compressed sequence and the C oi matrices. As in Figure 1 α i is computed from α i−1 and C oi . and find the most frequent pair of two symbols in the resulting sequence. Thus if p i is the number of occurrences of the pair found in iteration i, pre i is the time required for iteration i, and e is an estimate (given by the user) of the number of times the preprocessing is going to be reused (for example in a number of training procedures each calling forward several times), then, assuming that the matrix-vector multiplications and matrix-matrix multiplications dominate the runtime of the actual likelihood computations, the amount of time that is saved by running iteration i is e(t mv p i−1 − t mm ) − pre i , as we save p i−1 matrix-vector multiplications in each likelihood computation, and we do this by introducing one new matrix-matrix multiplication. This means that the optimal time to stop the preprocessing is before iteration j, where j is the minimal value of i making e(t mv p i−1 − t mm ) − pre i less than or equal to 0. However, we do not know pre i before iteration i has been completed, but we can estimate it by pre i−1 . Thus we stop the preprocessing just before iteration j, where j is the minimal value of i making e(t mv p i−1 − t mm ) − pre i−1 less than or equal to 0. The values t mv and t mm are measured prior to the preprocessing, whereas the user has to supply an estimate, e, of the number of reuses of the preprocessing and N min . If a single value of N min can not be determined, we allow the user to specify a list of state space sizes (N 1 min , N 2 min , ...) for which he wants the preprocessing to be saved. If no N min values are provided, the compression is stopped whenever p i = p i−1 for the first time.

Numerical stability
All our matrices contain probabilities, so all entries are between 0 and 1. This means that their products will tend towards 0 exponentially fast. The values of these products will normally be stored in one of the IEEE 754 floatingpoint formats. These formats have limited precision, and if the above was implemented naïvely the results would quickly underflow.
If we can make do with log (Pr (Y 1:T | λ)), we can prevent this underflow by repeatedly rescaling the matrices, much in the same way as the columns are rescaled in the numerically stable version of the classical forward algorithm [18]. To make this work in our case, we will http://www.biomedcentral.com/1471-2105/14/339 normalise the results of each matrix-matrix multiplication or matrix-vector multiplication we do throughout the algorithm and work with the normalised matrices instead. We first take care of the rounding errors that can propagate through the first stage of the likelihood computations (depicted in the top part of Figure 3) if the dependency graph between the new symbols is deep. Let be the sum of all entries in C o i for all symbols in the original observed sequence, and and let Finally let Then Thus to handle the underflow in the first stage, we compute s o i along withC o i for i = 1, . . . M (see Figure 4) and compute the product above in the second stage of the likelihood computation. However, theC o i matrices still only contain values between 0 and 1, and their product will therefore still tend towards 0 exponentially fast, causing underflow. To prevent this we introduce a scaling factor d i for each of the T − 1 matrix-vector multiplications in (4), set to be the sum of the entries in the resulting vector. Each d i is used two times: First we normalise the corresponding resulting vector by dividing each entry by d i , and next we use it to restore the correct result at the end of the computations. Assume thatᾱ T is the result of the T −1 normalised matrix-vector multiplications. Then Notice, however, that we now risk getting an underflow when computing these products if T is big. We handle this by working in log-space. Definẽ

Practical implementation details
In our implementation of the preprocessing phase described above, we simply build a map symbol2pair, mapping each new alphabet symbol o i to its two constituents (l i , r i ). In each scan every pair of symbols is counted, and the most frequent pair in the previous round is replaced by a new symbol. The data structure being saved in the end is symbol2pair along with two other maps: nstates2alphabetsize and nstates2seq.
using the two maps created in the first stage. To obtain maximal performance, we use a BLAS implementation for C++ to perform the series of matrix multiplication. Our implementation uses O(Tk) space in the preprocessing phase and O N 2 (T + M ) space in the actual computation, where k is the number of N min values supplied by the user, N is the number of states in the HMM used in the actual computation, and M is the number of symbols in the extended alphabet corresponding to N in nstates2alphabetsize. If the preprocessed data structure is saved to disk, it will take up O(Tk) space.
We have also implemented the algorithm in a parallelised version. In this version, stage 2 is parallelised much like the implementation in parredHMMlib [11], where the series of matrix multiplications in (5) is split into a number of blocks which are then processed in parallel. Stage 1 can clearly also be parallelised by computing independentC o i matrices in parallel. However, we found that this does not work well in practice, as the workload in stage 1 is not big enough to justify the parallelization. Stage 1 is therefore not parallelised in the library. The parallelisation of stage 2 gives the greatest speedup for long sequences that are not very compressible. This is because the parallelisation in general works best for long sequences [11], and if the input sequence is very compressible then the compressed sequence will be short and more work will be done in the non-parallelised stage 1. The experiments presented in the next section have all been run single-threaded to get a clearer picture of how the runtime of the basic algorithm is influenced by the characteristica of the input sequence and model. But in general a slightly faster running time can be expected if parallelisation is enabled, especially for long sequences of high complexity.

Results and discussion
We have implemented the above algorithms in a C++ library named zipHMM. The code provides both a C++ and a Python interface to the functionality of reading and writing HMMs to files, preprocessing input sequences and saving the results, and computing the likelihood of a model using the forward algorithm described in the previous section. The library uses BLAS for linear algebra operations and pthreads for multi-threaded parallelisation.

Using the library
The library can be used directly in C++ programs or through Python wrappers in scripts.

Using zipHMM from C++
When using the library in C++ the most important objects are from the Forwarder class, which is responsible for both preprocessing sequences, reading and writing the preprocessed data structure, and for computing the likelihood of a hidden Markov model. The code snippet in Figure 5(a) shows a complete C++ program that reads in an input sequence, f.read_seq(...), preprocess it (as part of reading in the sequence), stores the preprocessed structure to disk, f.write_to_directory(...), reads in an HMM from disk, read_HMM(...), and computes the likelihood of the HMM, f.forward(...).
The sequence reader takes the alphabet size as parameter. This is because we cannot necessarily assume that the observed symbols in the input sequence are all the possible symbols the HMM can emit, so we need to know the alphabet size explicitly. It furthermore takes an optional parameter in which the user can specify an estimate, e, of the number of times the preprocessing will be reused. The default value of this parameter is 1. http://www.biomedcentral.com/1471-2105/14/339 If the preprocessed sequence is already stored on disk, we can simply read that instead like this: Forwarder f; number_of_states = 4; f.read_from_directory("example_ preprocessed", number_of_states); This will cause the saved sequence matching N min ≤ 4 to be read from the directory example_preprocessed together with additional information on the extended alphabet used in this sequence.
In the library, HMMs are implicitly represented simply by a vector and two matrices, the π vector of initial state probabilities and the transition, A, and emission, B, matrices as described in the Implementation section. These are all represented in a Matrix class, and in the program in Figure 5(a) these are read in from disk. They can also be directly constructed and manipulated in a program. In our own programs we use this, together with a numerical optimisation library, to fit parameters by maximising the likelihood.
The f.forward(...) method computes the likelihood sequentially using the preprocessed structure. To use the multi-threaded parallelisation instead, one simply uses the f.pthread_forward(...) function, with the same parameters, instead.
For completeness the library also offers implementations of the Viterbi and posterior decoding algorithms. To use these in C++ the headers viterbi.hpp and posterior_decoding.hpp should be included and the functions viterbi(...) and posterior_deco ding(...) should be called as described in the README file in the library.

Using zipHMM from Python
All the C++ classes in the library are wrapped in a Python module so the full functionality of the zipHMM is available for Python scripting using essentially the same API, except with a more Python flavour where appropriate, e.g. reading in data is handled by returning multiple values from function calls instead of pass-by-reference function arguments and with a more typical Python naming http://www.biomedcentral.com/1471-2105/14/339 convention. Figure 5(b) shows the equivalent of the C++ code in Figure 5(a) in Python.

Performance
To evaluate the performance of zipHMM we performed a number of experiments using a hidden Markov model previously developed to infer population genetics parameters of a speciation model. All experiments were run on a machine with two Intel Sandy Bridge E5-2670 CPUs, each with 8 cores running at 2.67GHz and having access to a 64Gb main memory. We compare the performance of our forward algorithm to the performance of the implementations of the forward algorithm in HMMlib [19] and in parredHMMlib [11] and to a simple implementation of equation (3) using BLAS to perform the series of matrix-vector multiplications. HMMlib is an implementation that takes advantage of all the features of a modern computer, such as SSE instructions and multiple cores. The individual features of HMMlib can be turned on or off by the user, and we recommend only enabling these features for HMMs with large state spaces. In all our experiments we enabled the SSE parallelisation but used only a single thread. The parredHMM library implements equation (3) as a parallel reduction, splitting the series of matrix multiplications into a number of blocks and processing the blocks in parallel. The parredForward algorithm was calibrated to use the optimal number of threads.
For performance evaluation we wanted to evaluate how well the new algorithm compares to other optimised forward implementations, evaluate the trade-off between preprocessing and computing the likelihood, and explore how the complexity of the input string affects the running time.
Our new implementation of the forward algorithm is expected to perform best on strings of low complexity because they are more compressible. To investigate this we measured the per-iteration running time of the forward algorithm for parredHMMlib, HMMlib and the simple implementation of equation (3) on random binary sequences (over the alphabet {0, 1}) of length L = 10 7 with the frequency of 1s varying from 0.0001 to 0.05, and divided it by the per-iteration running time for zipHMMlib (excluding the preprocessing time) to obtain the speedup factor. This experiment is summarised in Figure 6, where we note that the speedup factor decreases linearly with the complexity of the input sequence; however, speedup factors of more than two orders of magnitude are obtained for less complex sequences, and even for sequences of low complexity a (modest) speedup is obtained.
In the rest of our experiments, we used a coalescent hidden Markov model (CoalHMM) from [17] together with real genomic data for the experiments. A CoalHMM [7, 8,12] exploits changing gene-trees along a genomic alignment to infer population genetics parameters. The "Isolation-with-Migration" CoalHMM from [17] considers a pairwise alignment as its observed sequence and a discretisation of the time to the most recent common ancestor, or "coalescence time", of the aligned sequences as its hidden states. The coalescence time can change from any point to another, so the transition matrix of the CoalHMM is fully connected, and the number of hidden states can be varied depending on how fine-grained we want to model time. Varying the number of states lets us explore the performance as a function of the number of states. The performance as a function of the length of the input was explored by using alignments of varying length. Finally, to explore how the complexity of the string affects the performance we used alignments of sequences at varying evolutionary distance, since closer related genomes have fewer variable sites and thus the alignments have lower complexity. The CoalHMM model uses a Jukes-Cantor model in its emission probabilities and thus only distinguishes between if a specific site has two identical nucleotides or two different nucleotides in the alignment. We therefore also varied the complexity of the strings by compressing either the actual sequence alignment or summarising it as an alphabet of size three, For all experiments we trained the model using the Nelder-Mead-simplex algorithm and measured the preprocessing time and total optimisation time, and the http://www.biomedcentral.com/1471-2105/14/339 expected number of likelihood computations was set to e = 500. Figure 7(a) shows how the performance of zipForward changes, when the size of the model is increased. We note that the total time, as expected, depends very heavily on the number of states (the time complexities of stage 1 and 2 are qubic and quadratic in the number of states, respectively), while the time used on preprocessing, also as expected, varies very little and shows no clear pattern as the number of states is increased. Figure 7(b) shows how the runtime for training the CoalHMM with zipForward changes when the sequence length is varied. We expected the runtime to increase with the sequence length, however this is not what the results show for the shorter sequences. This is due to the optimisation procedure, which required more iterations of the likelihood computation for the shorter sequences than for the longer sequences. For the longest sequences the runtime grows sublinearly, which was expected, since longer sequences often compress relatively more than shorter sequences.
We expected alignments of sequences at short evolutionary distance to be more compressible than alignments of sequences at longer evolutionary distance, and therefore expected the training procedure to be faster for alignments of sequences at short evolutionary distance.
We recognise this in Figure 7(c) except for the sequences with M = 16, where the human-orangutan alignment was processed faster than the human-gorilla alignment. However, this was caused by the trade-off between the time for the preprocessing and the time for the actual training procedure: the preprocessing procedure took significantly longer time for the human-gorilla alignment (because it was more compressible) than for the humanorangutan alignment, but this extra time was not all gained back in the training procedure, although the compressed sequence indeed was shorter (the per-iteration running time for the human-gorilla alignment was 7.913s and 8.488s for the human-orangutan alignment).
We also expected the total time of the training procedure to increase as the number of symbols in the initial alphabet was increased, because sequences with small initial alphabets are expected to be more compressible than sequences with larger initial alphabets. But as Figure 7(c) shows, the sequences with an initial alphabet of size M = 25 were processed faster than the sequences with an initial alphabet of size M = 16. This is again caused by the optimisation procedure, which converges faster for the sequences with M = 25 than for the sequences with M = 16 (e.g. for the human-gorilla alignments, the number of evaluations of the likelihood were 860 and 1160 for M = 25 and M = 16, respectively). This may be a result of the sequences with M = 25 containing more information than the sequences with M = 16. The lengths of the compressed sequences and the per-iteration running times match our expectations, and for the sequences with M = 3 and M = 25, which contain the same data, the algorithm behaves as expected. Figure 7(d) shows the performance of the four different implementations of the forward algorithm. Each of the four algorithms was used to train the CoalHMM on an alignment of the entire human and chimpanzee chromosome 1, using an HMM with 16 states and an initial alphabet of size 3. The training procedure was finished in 7.4 minutes (446 seconds) using zipForward and including the preprocessing time. This gives a speedup factor of 77.7 compared to the previously fastest implementation using parredHMMlib, which used 9.6 hours (34, 657 seconds). It is therefore evident that zipForward is clearly superior to the three other implementations on this kind of input. The time used per iteration of the likelihood computation was 0.5042 seconds for zipHMMlib, while it was 46.772 seconds for parredHMMlib, leading to a speedup of a factor 92.8 on the actual optimization procedure (excluding preprocessing time). Repeating the same experiment on full alignments over alphabets of size 16 and 25 (not shown here), where zipForward clearly performs worse than for sequences with alphabets of size 3 (see Figure 7(c)), we still obtained total speedup factors of 4.4 for both experiments.

Conclusions
We have engineered a variant of the HMM forward algorithm to exploit repetitions in strings to reduce the total amount of computation, by exploring shared subexpressions. We have implemented this in an easy to use C++ library, with a Python interface for use in scripting, and we have demonstrated that our library can be used to achieve speedups of 4 -78 factors for realistic whole-genome analysis with a reasonably complex hidden Markov model.