Skip to main content

Implementing EM and Viterbi algorithms for Hidden Markov Model in linear memory



The Baum-Welch learning procedure for Hidden Markov Models (HMMs) provides a powerful tool for tailoring HMM topologies to data for use in knowledge discovery and clustering. A linear memory procedure recently proposed by Miklós, I. and Meyer, I.M. describes a memory sparse version of the Baum-Welch algorithm with modifications to the original probabilistic table topologies to make memory use independent of sequence length (and linearly dependent on state number). The original description of the technique has some errors that we amend. We then compare the corrected implementation on a variety of data sets with conventional and checkpointing implementations.


We provide a correct recurrence relation for the emission parameter estimate and extend it to parameter estimates of the Normal distribution. To accelerate estimation of the prior state probabilities, and decrease memory use, we reverse the originally proposed forward sweep. We describe different scaling strategies necessary in all real implementations of the algorithm to prevent underflow. In this paper we also describe our approach to a linear memory implementation of the Viterbi decoding algorithm (with linearity in the sequence length, while memory use is approximately independent of state number). We demonstrate the use of the linear memory implementation on an extended Duration Hidden Markov Model (DHMM) and on an HMM with a spike detection topology. Comparing the various implementations of the Baum-Welch procedure we find that the checkpointing algorithm produces the best overall tradeoff between memory use and speed. In cases where sequence length is very large (for Baum-Welch), or state number is very large (for Viterbi), the linear memory methods outlined may offer some utility.


Our performance-optimized Java implementations of Baum-Welch algorithm are available at The described method and implementations will aid sequence alignment, gene structure prediction, HMM profile training, nanopore ionic flow blockades analysis and many other domains that require efficient HMM training with EM.


Hidden Markov Models (HMMs) are a widely accepted modeling tool [1] used in various domains, such as speech recognition [2] and bioinformatics [3]. An HMM can be described as a stochastic finite state machine where each transition between hidden states ends with a symbol emission. The HMM can be represented as a directed graph with N states where each state can emit either a discrete character or a continuous value drawn from a Probability Density Function (PDF).

We are interested in a distributed HMM analysis of the channel current blockade signal caused by a single DNA hairpin molecule held in a nanopore detector [4, 5]. The molecules examined frequently produce toggles with stationary statistical profiles for thousands of milliseconds. With a sampling rate of 20 μ s, processing even a modest blockade signal of 200 ms duration (10,000 sample points) becomes problematic, mostly because of the size of the dynamic programming tables required in the conventional implementations of the HMM's Baum-Welch and Viterbi decoding algorithms. Since we are also trying to model durations [6] and spike phenomena [7], by increasing the number of HMM states, conventional HMM implementations are found to be prohibitively expensive in terms of memory use.

The Baum-Welch algorithm is an Expectation Maximization (EM) algorithm invented by Leonard E. Baum and Lloyd R. Welch, and first appears in [8]. A later refinement, Hirschberg's algorithm for an HMM [9], reduces the memory footprint by recursively halving the pairwise alignment dynamic programming table for sequences of comparable size. In our application domain, the length of the observed emission sequence (in the case of nanopore ionic flow blockade analysis or gene structure prediction) is prohibitively long compared to the number of HMM states. Further, Baum-Welch requires multiple paths, instead of the most likely one, making this strategy less than optimal.

The checkpointing algorithm [1012] implements the Baum-Welch algorithm in O ( T N ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4ta8KaeiikaGYaaOaaaeaacqWGubavaSqabaGccqWGobGtcqGGPaqkaaa@3129@ memory and in O(TNQ max + T(Q + E)) processor time, where T is the length of the observed sequence, Q max is the maximum HMM node out-degree, E is the number of free emission parameters and Q is the number of free transition parameters. It divides the input sequence into T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ blocks of T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ symbols each, and, during the forward pass, retains the first column of the forward probability table for each block. When the reverse sweep starts, the forward values for each block are sequentially re-evaluated, beginning with their corresponding checkpoints, to update the parameter estimates.

Further refinement to the algorithm, as described in [13] and amended here, has rendered the memory demands independent of the observed sequence length, with O(N(Q + ED)) memory usage and O(TNQ max (Q + ED)) running time, where D is the dimensionality of a vector that stores statistics on the emission PDF parameter estimates. Performance of the various algorithms is summarized in Table 1. In this work, we also present a modification of one of the key HMM algorithms, the Viterbi algorithm, improving the memory profile without affecting the execution time.

Table 1 The computational expense of different algorithm implementations running on HMM.

Methods and Results

HMM definition, EM learning and Viterbi decoding

The following parameters describe the conventional HMM implementation according to [14]:

  • A set of states S = {S1,..., S N } with q t being the state visited at time t,

  • A set of PDFs B = {b1(o),..., b N (o)}, describing the emission probabilities b j (o t ) = p(o t |q t = S j ) for 1 ≤ jN, where o t is the observation at time-point t from the sequence of observations O = {o1,..., o T },

  • The state-transition probability matrix A = {ai, j} for 1 ≤ i, jN, where ai, j= p(qt + 1= S j |q t = S i ),

  • The initial state distribution vector Π = {π1,..., π N }.

A set of parameters λ = (Π, A, B) completely specifies an HMM. Here we describe the HMM parameter update rules for the EM learning algorithm rigorously derived in [15]. The Viterbi algorithm, as shown in Table 2, is a dynamic programming algorithm that runs an HMM to find the most likely sequence of hidden states, called the Viterbi path, that result in an observed sequence. When training the HMM using the Baum-Welch algorithm (an Expectation Maximization procedure), first we need to find the expected probabilities of being at a certain state at a certain time-point using the forward-backward procedure as shown in Table 2. The forward, backward, and Viterbi algorithms take O(TNQ max ) time to execute.

Table 2 The Viterbi decoding, forward and backward procedures.

Let us define ξ t (i, j) as the probability of being in state i at time t, and state j at time t + 1, given the model and the observation sequence

ξ t ( i , j ) = p ( q t = S i , q t + 1 = S j | O , λ ) = α t ( i ) a i , j b j ( o t + 1 ) β t + 1 ( j ) p ( O | λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqOVdG3aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGPbqAcqGGSaalcqWGQbGAcqGGPaqkcqGH9aqpcqWGWbaCcqGGOaakcqWGXbqCdaWgaaWcbaGaemiDaqhabeaakiabg2da9iabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaemyCae3aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGH9aqpcqWGtbWudaWgaaWcbaGaemOAaOgabeaakiabcYha8jabd+eapjabcYcaSiabeU7aSjabcMcaPiabg2da9KqbaoaalaaabaGaeqySde2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPiabdggaHnaaBaaabaGaemyAaKMaeiilaWIaemOAaOgabeaacqWGIbGydaWgaaqaaiabdQgaQbqabaGaeiikaGIaem4Ba82aaSbaaeaacqWG0baDcqGHRaWkcqaIXaqmaeqaaiabcMcaPiabek7aInaaBaaabaGaemiDaqNaey4kaSIaeGymaedabeaacqGGOaakcqWGQbGAcqGGPaqkaeaacqWGWbaCcqGGOaakcqWGpbWtcqGG8baFcqaH7oaBcqGGPaqkaaGccqGGSaalaaa@7539@

and γ t (i) as the probability of being in state i at time t, given the observation sequence and the model

γ t ( i ) = p ( q t = S i | O , λ ) = α t ( i ) β t ( i ) i = 1 N α t ( i ) β t ( i ) = j = 1 N ξ t ( i , j ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4SdC2aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpcqWGWbaCcqGGOaakcqWGXbqCdaWgaaWcbaGaemiDaqhabeaakiabg2da9iabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaem4ta8KaeiilaWIaeq4UdWMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaHXoqydaWgaaqaaiabdsha0bqabaGaeiikaGIaemyAaKMaeiykaKIaeqOSdi2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPaqaamaaqadabaGaeqySde2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPiabek7aInaaBaaabaGaemiDaqhabeaacqGGOaakcqWGPbqAcqGGPaqkaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtaiabggHiLdaaaOGaeyypa0ZaaabCaeaacqaH+oaEdaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabdMgaPjabcYcaSiabdQgaQjabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGGUaGlaaa@751E@

The HMM maximization step using these probabilities is shown in Table 3. The conventional EM procedure for HMM [14] takes O(TN) memory and O(TNQ max + T (Q + E)) processor time. An HMM containing empty internal states (see for example [3]) and Hierarchical HMM (HHMM) could be converted into canonical HMM form through stack transformation as discussed in [16].

Table 3 The maximization step in HMM learning. states.

Forward sweep strategy explained

Figure 1 outlines initial, simple transition probability calculations for all possible paths through a "toy" HMM. In Figure 1, to estimate the probability of transition from state 1 to state 2 (1 → 2), we calculate the probability of transition utilization at time intervals 1–2 and 2–3 as:

Figure 1
figure 1

Time trellis for simple model where possible emissions of 0 and 1 are shown above and below trellis. Probabilities of emissions that happen after each transition are shown in bold and transitions of interest taken at certain time-point are underlined.

p(Making transition 1 → 2 at time 1–2) = α1(1) × a1,2 × b2(o2) × β2(2), p(Making transition 1 → 2 at time 2–3) = α2(1) × a1,2 × b2(o3) × β3(2).

In particular, to estimate the probability of transition 1 → 2 at time interval 1–2 we calculate the sum of probabilities of all possible paths that lead to state 1 at time-point 1 (forward probability α1(1)). Then we multiply this probability of being in state 1 by the transition a1,2 and emission b2(o2) probabilities.

Further multiplication by the sum of probabilities of all possible paths from state 2 at time 2 until the end of the emission sequence (backward probability β2(2)), is the expected probability of transition use. The sum of these estimates at time-points 1–2 and 2–3 is equivalent to the transition probability estimate in Table 3 (prior to normalization).

According to [13] ti, j(t, m) is the weighted sum of probabilities of all possible state paths that emit subsequence o1,..., o t and finish in state m, taking an ij transition at least once where the weight of each state path is the number of ij transitions taken. Processing of the entire ti, j(t, m) recurrence takes memory proportional to O(NQ) and processor time O(TNQQ max ). Initially we have ti, j(1, m) = 0 since no transitions have been made. After initialization, we perform the following recurrence operations:

t i , j ( t , m ) = α t 1 ( i ) a i , m b m ( o t ) δ ( m = j ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiDaq3aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGGOaakcqWG0baDcqGGSaalcqWGTbqBcqGGPaqkcqGH9aqpcqaHXoqydaWgaaWcbaGaemiDaqNaeyOeI0IaeGymaedabeaakiabcIcaOiabdMgaPjabcMcaPiabdggaHnaaBaaaleaacqWGPbqAcqGGSaalcqWGTbqBaeqaaOGaemOyai2aaSbaaSqaaiabd2gaTbqabaGccqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqhabeaakiabcMcaPiabes7aKjabcIcaOiabd2gaTjabg2da9iabdQgaQjabcMcaPaaa@53E0@
+ n = 1 N t i , j ( t 1 , n ) a n , m b m ( o t ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaey4kaSYaaabCaeaacqWG0baDdaWgaaWcbaGaemyAaKMaeiilaWIaemOAaOgabeaakiabcIcaOiabdsha0jabgkHiTiabigdaXiabcYcaSiabd6gaUjabcMcaPiabdggaHnaaBaaaleaacqWGUbGBcqGGSaalcqWGTbqBaeqaaOGaemOyai2aaSbaaSqaaiabd2gaTbqabaGccqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqhabeaakiabcMcaPaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGGSaalaaa@4E04@

where δ ( m = j ) = { 1 , if m = j 0 , otherwise MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiTdqMaeiikaGIaemyBa0Maeyypa0JaemOAaOMaeiykaKIaeyypa0ZaaiqaaeaafaqaaeGacaaabaGaeGymaeJaeiilaWcabaqbaeqabeGaaaqaaiabbMgaPjabbAgaMbqaaiabd2gaTjabg2da9iabdQgaQbaaaeaacqaIWaamcqGGSaalaeaacqqGVbWBcqqG0baDcqqGObaAcqqGLbqzcqqGYbGCcqqG3bWDcqqGPbqAcqqGZbWCcqqGLbqzaaaacaGL7baaaaa@4BFF@ . Following equation (1), at a certain time-point t we need to score the evidence supporting transition between nodes i and j, which is the sum of probabilities of all possible state paths that emit subsequence o1,..., ot-1and finish in state i (forward probability αt-1(i)), multiplied by transition ai, jand emission b j (o t ) probabilities upon arrival to o t . We extend weighted paths containing evidence of ij transitions made at previous time-points 1,..., t - 1 further down the trellis in subequation (4). Finally, by the end of the recurrence we marginalize the final state m out of probability ti, j(T, m) to get a weighted sum of state paths taking transition ij at various time-points that is equivalent to the numerator in the transition probability estimate shown in Table 3. Thus, we estimate transition utilization using:

t i , j E N D = m = 1 N t i , j ( T , m ) , a i , j = t i , j E N D j o u t ( S i ) t i , j E N D , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabdsha0naaDaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGccqGH9aqpdaaeWbqaaiabdsha0naaBaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeqaaOGaeiikaGIaemivaqLaeiilaWIaemyBa0MaeiykaKIaeiilaWcaleaacqWGTbqBcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaaiabdggaHnaaBaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeqaaOGaeyypa0tcfa4aaSaaaeaacqWG0baDdaqhaaqaaiabdMgaPjabcYcaSiabdQgaQbqaaiabdweafjabd6eaojabdseaebaaaeaadaaeqaqaaiabdsha0naaDaaabaGaemyAaKMaeiilaWIaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaaqaaiabdQgaQjabgIGiolabd+gaVjabdwha1jabdsha0jabcIcaOiabdofatnaaBaaabaGaemyAaKgabeaacqGGPaqkaeqacqGHris5aaaakiabcYcaSaaaaaa@6DB1@

where out(S i ) is the set of nodes connected by edges from S i .

According to [13] e i (γ, t, m) is the weighted sum of probabilities of all possible state paths that emit subsequence o1,..., o t and finish in state m, for which state i emits observation γ at least once where the weight of each state path is the number of γ emissions that it makes from state i.

The following algorithm updates parameters for the set of discrete symbol probability distributions.

Initialization step e i (γ, 1, m) = α1(m) δ (i = m) δ (γ = o1). After initialization we make the recurrence steps, where we correct the emission recurrence presented in [13] [see Additional File 1]:

e i ( γ , t , m ) = α t ( m ) δ ( i = m ) δ ( γ = o t ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyzau2aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqaHZoWzcqGGSaalcqWG0baDcqGGSaalcqWGTbqBcqGGPaqkcqGH9aqpcqaHXoqydaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabd2gaTjabcMcaPiabes7aKjabcIcaOiabdMgaPjabg2da9iabd2gaTjabcMcaPiabes7aKjabcIcaOiabeo7aNjabg2da9iabd+gaVnaaBaaaleaacqWG0baDaeqaaOGaeiykaKcaaa@4E82@
= n = 1 N e i ( γ , t 1 , n ) a n , m b m ( o t ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeyypa0ZaaabCaeaacqWGLbqzdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabeo7aNjabcYcaSiabdsha0jabgkHiTiabigdaXiabcYcaSiabd6gaUjabcMcaPiabdggaHnaaBaaaleaacqWGUbGBcqGGSaalcqWGTbqBaeqaaOGaemOyai2aaSbaaSqaaiabd2gaTbqabaGccqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqhabeaakiabcMcaPaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGGUaGlaaa@4E58@

Finally, by the end of the recurrence we marginalize the final state m out of e i (γ, T, m) and estimate the emission parameters through normalization

e i E N D ( γ ) = m = 1 N e i ( γ , T , m ) , b ^ j ( γ ) = e j E N D ( γ ) γ = 1 D e j E N D ( γ ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabdwgaLnaaDaaaleaacqWGPbqAaeaacqWGfbqrcqWGobGtcqWGebaraaGccqGGOaakcqaHZoWzcqGGPaqkcqGH9aqpdaaeWbqaaiabdwgaLnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeq4SdCMaeiilaWIaemivaqLaeiilaWIaemyBa0MaeiykaKIaeiilaWcaleaacqWGTbqBcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaaiqbdkgaIzaajaWaaSbaaSqaaiabdQgaQbqabaGccqGGOaakcqaHZoWzcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabdwgaLnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabeo7aNjabcMcaPaqaamaaqadabaGaemyzau2aa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeq4SdCMaeiykaKcabaGaeq4SdCMaeyypa0JaeGymaedabaGaemiraqeacqGHris5aaaakiabc6caUaaaaaa@6B53@

The algorithm for discrete emission parameters estimate B = {b1(o),..., b N (o)} takes in O(NED) memory and O(TNEDQ max ) time. As discussed [see Subsection HMM definition, EM learning and Viterbi decoding] the forward sweep takes O(TNQ max ) time, where only the values of αt-1(i) for 1 ≤ iN are needed to evaluate αt(i), thus reducing the memory requirement to O(N) for the forward algorithm. Computing e i (γ, t, m) takes O(NED) previous probabilities of e i (γ, t - 1, m) for 1 ≤ mN, 1 ≤ iE, 1≤ γD. Recurrent updating of each e i (γ, t, m) probability elements takes O(Q max ) summations, totalling O(TNEDQ max ).

Theorem 1 e i (γ, t, m) is the weighted sum of probabilities of all possible state paths that emit subsequence o1,..., o t and finish in state m, for which state i emits observation γ at least once.


Initialization The only chance for a path at a time-point 1 to emit symbol γ at least once from state i and finish in state m is α1(m) in case i = m and γ = o1. Therefore, initialization conditions satisfy definition of e i (γ, t, m).

Induction Suppose e i (γ, t - 1, m) represents correct weighted sum of probabilities of all possible state paths that emit subsequence o1,..., ot-1and finish in state m, for which state i emits observation γ at least once. We need to prove the above holds for time-point t. Following equation (1) in recurrence part (5) we score the evidence of symbol o t emission from state i at time-point t, which will be further propagated down the trellis in recurrence part (6). Chances of such event is α t (m), i.e. sum of probabilities of all possible state paths finishing in state m at time-point t under conditions i = m and γ = o t . The second part of the recurrence (6) extends the weighted paths containing evidence of γ symbol emission from state i at previous time-points 1,..., t - 1 and finishing in state n further down the trellis through available transitions an, m. Thus the definition of e i (γ, t, m) holds true for the time-point t.

At the end of recurrence we marginalize the final state m out of probability e i (γ, T, m) to get the weighted sum of all state paths making observation γ in state i at various time-points equivalent to the numerator of the discrete emission parameter estimate in Table 3, which is a weighted sum of all possible paths that score emissions evidence at certain time-points. By normalizing these scores we estimate the emission parameters.

The forward sweep strategy was originally formulated in [13] for HMMs with silent Start/End states, and automatically handles the prior probabilities estimates for the states as standard transitions connecting Start with other non-silent states. The prior transition estimates aStart, iare naturally involved within recurrent updates of ti, j(t, m), which takes an additional O(N2) memory if all N non-silent states have non-zero priors with time cost O(TN2Q max ). In order to compute the prior estimates in the conventional HMM formulation we need to know the backward probability at time-point 1, which requires calculation of the entire backward table. Therefore, in the next section we present a linear memory Baum-Welch algorithm modification built around a backward sweep with scaling, which only involves calculation of α1(i) for 1 ≤ iN to estimate priors in O(N) time and O(N) memory.

Linear memory Baum-Welch using a backward sweep with scaling

The objective of the algorithm presented in this section is equivalent to that discussed previously [see Section Forward sweep strategy explained] based on forward probabilities of state occupation. However, by using the backward probabilities of state occupation we are able to estimate initial HMM state probabilities much more quickly. In the description that follows we introduce a new set of probabilities:

E i (γ, t, m) – the weighted sum of probabilities of all possible state paths that emit subsequence o t ,..., o T and finish in state m, for which state i emits observation γ at least once, where the weight of each state path is the number of γ emissions that it makes from state i.

Ti, j(t, m) – the weighted sum of probabilities of all possible state paths that emit subsequence o t ,..., o T and finish in state m, taking ij transition at least once, where the weight of each state path is the number of ij transitions that it takes.

All calculations are based on backward probability β t (i), but there is inevitably insufficient precision to directly represent these values for significantly long emission sequences. Therefore we scale the backward probability during the recursion.

The linear memory Baum-Welch implementation is shown in Figure 2, where E MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8hmHueaaa@36AD@ is a set of nodes with free emission parameters and T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae83eXtfaaa@376F@ is a set of nodes with free emanating transitions. Scaling relationships used at every iteration are rigorously proven [see Appendix A]. An alternative to scaling is relation (7) presented in [17] showing how to efficiently add log probabilities

Figure 2
figure 2

The linear memory implementation of Baum-Welch learning algorithm for HMM. This algorithm takes set of HMM parameters λ and sequence of symbols O. Expected HMM parameters are calculated according to formulas [see Subsection Parameters update].

log ( t = 0 N 1 p i ) = log p 0 + log ( 1 + i = 1 N 1 e log p i log p 0 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiBaWMaei4Ba8Maei4zaC2aaeWaaeaadaaeWbqaaiabdchaWnaaBaaaleaacqWGPbqAaeqaaaqaaiabdsha0jabg2da9iabicdaWaqaaiabd6eaojabgkHiTiabigdaXaqdcqGHris5aaGccaGLOaGaayzkaaGaeyypa0JagiiBaWMaei4Ba8Maei4zaCMaemiCaa3aaSbaaSqaaiabicdaWaqabaGccqGHRaWkcyGGSbaBcqGGVbWBcqGGNbWzdaqadaqaaiabigdaXiabgUcaRmaaqahabaGaemyzau2aaWbaaSqabeaacyGGSbaBcqGGVbWBcqGGNbWzcqWGWbaCdaWgaaadbaGaemyAaKgabeaaliabgkHiTiGbcYgaSjabc+gaVjabcEgaNjabdchaWnaaBaaameaacqaIWaamaeqaaaaaaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaojabgkHiTiabigdaXaqdcqGHris5aaGccaGLOaGaayzkaaGaeiOla4caaa@671A@

The scoring functions used for the emissions updates are shown in Table 4. With discrete emission we sum all the backward probabilities of state occupation given discrete symbol emission at certain time-points and later we normalize these counts in (8). In the case of a normally distributed continuous PDF we sum emissions and emission deviation from state i mean squared. These sums are scaled by probability of state occupation. We use these counts to estimate the emission mean (9) and variance (10) for a given state.

Table 4 The scoring functions for discrete and continuous emissions.

Parameters update

We estimate the initial probability according to equations presented in Table 3, where D1 is defined in Appendix A

π ^ i = α 1 ( i ) β ˜ i ( 1 ) i = 1 N α 1 ( i ) β ˜ i ( 1 ) = α 1 ( i ) D 1 β 1 ( i ) i = 1 N α 1 ( i ) D 1 β 1 ( i ) = α 1 ( i ) β 1 ( i ) i = 1 N α 1 ( i ) β 1 ( i ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqiWdaNbaKaadaWgaaWcbaGaemyAaKgabeaakiabg2da9KqbaoaalaaabaGaeqySde2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPiqbek7aIzaaiaWaaSbaaeaacqWGPbqAaeqaaiabcIcaOiabigdaXiabcMcaPaqaamaaqadabaGaeqySde2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobGaeyyeIuoacuaHYoGygaacamaaBaaabaGaemyAaKgabeaacqGGOaakcqaIXaqmcqGGPaqkaaGccqGH9aqpjuaGdaWcaaqaaiabeg7aHnaaBaaabaGaeGymaedabeaacqGGOaakcqWGPbqAcqGGPaqkcqWGebardaWgaaqaaiabigdaXaqabaGaeqOSdi2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPaqaamaaqadabaGaeqySde2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPiabdseaenaaBaaabaGaeGymaedabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtaiabggHiLdGaeqOSdi2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPaaakiabg2da9KqbaoaalaaabaGaeqySde2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPiabek7aInaaBaaabaGaeGymaedabeaacqGGOaakcqWGPbqAcqGGPaqkaeaadaaeWaqaaiabeg7aHnaaBaaabaGaeGymaedabeaacqGGOaakcqWGPbqAcqGGPaqkaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtaiabggHiLdGaeqOSdi2aaSbaaeaacqaIXaqmaeqaaiabcIcaOiabdMgaPjabcMcaPaaakiabc6caUaaa@91D9@

The emissions estimate for the discrete case are

b ^ j ( γ ) = E ˜ j E N D ( γ ) γ = 1 D E ˜ j E N D ( γ ) = D 1 E j E N D ( γ ) D 1 γ = 1 D E j E N D ( γ ) = E j E N D ( γ ) γ = 1 D E j E N D ( γ ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOyaiMbaKaadaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabeo7aNjabcMcaPiabg2da9KqbaoaalaaabaGafmyrauKbaGaadaqhaaqaaiabdQgaQbqaaiabdweafjabd6eaojabdseaebaacqGGOaakcqaHZoWzcqGGPaqkaeaadaaeWaqaaiqbdweafzaaiaWaa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeq4SdCMaeiykaKcabaGaeq4SdCMaeyypa0JaeGymaedabaGaemiraqeacqGHris5aaaakiabg2da9KqbaoaalaaabaGaemiraq0aaSbaaeaacqaIXaqmaeqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabeo7aNjabcMcaPaqaaiabdseaenaaBaaabaGaeGymaedabeaadaaeWaqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabeo7aNjabcMcaPaqaaiabeo7aNjabg2da9iabigdaXaqaaiabdseaebGaeyyeIuoaaaGccqGH9aqpjuaGdaWcaaqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabeo7aNjabcMcaPaqaamaaqadabaGaemyrau0aa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeq4SdCMaeiykaKcabaGaeq4SdCMaeyypa0JaeGymaedabaGaemiraqeacqGHris5aaaakiabc6caUaaa@87BC@

For normally distributed continuous observation PDF

b ^ j ( o ) μ = E ˜ j E N D ( 1 ) E ˜ j E N D ( 3 ) = D 1 E j E N D ( 1 ) D 1 E j E N D ( 3 ) = E j E N D ( 1 ) E j E N D ( 3 ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOyaiMbaKaadaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVjabcMcaPiabgkziUkabeY7aTjabg2da9KqbaoaalaaabaGafmyrauKbaGaadaqhaaqaaiabdQgaQbqaaiabdweafjabd6eaojabdseaebaacqGGOaakcqaIXaqmcqGGPaqkaeaacuWGfbqrgaacamaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabiodaZiabcMcaPaaakiabg2da9KqbaoaalaaabaGaemiraq0aaSbaaeaacqaIXaqmaeqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabigdaXiabcMcaPaqaaiabdseaenaaBaaabaGaeGymaedabeaacqWGfbqrdaqhaaqaaiabdQgaQbqaaiabdweafjabd6eaojabdseaebaacqGGOaakcqaIZaWmcqGGPaqkaaGccqGH9aqpjuaGdaWcaaqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabigdaXiabcMcaPaqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabiodaZiabcMcaPaaacqGGSaalaaa@730B@
b ^ j ( o ) σ 2 = E ˜ j E N D ( 2 ) E ˜ j E N D ( 3 ) = D 1 E j E N D ( 2 ) D 1 E j E N D ( 3 ) = E j E N D ( 2 ) E j E N D ( 3 ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOyaiMbaKaadaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVjabcMcaPiabgkziUkabeo8aZnaaCaaaleqabaGaeGOmaidaaOGaeyypa0tcfa4aaSaaaeaacuWGfbqrgaacamaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabikdaYiabcMcaPaqaaiqbdweafzaaiaWaa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeG4mamJaeiykaKcaaOGaeyypa0tcfa4aaSaaaeaacqWGebardaWgaaqaaiabigdaXaqabaGaemyrau0aa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeGOmaiJaeiykaKcabaGaemiraq0aaSbaaeaacqaIXaqmaeqaaiabdweafnaaDaaabaGaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaiabcIcaOiabiodaZiabcMcaPaaakiabg2da9KqbaoaalaaabaGaemyrau0aa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeGOmaiJaeiykaKcabaGaemyrau0aa0baaeaacqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaGaeiikaGIaeG4mamJaeiykaKcaaiabc6caUaaa@744B@

The transition estimate is calculated as following

a i , j = T ˜ i , j E N D j o u t ( S i ) T ˜ i , j E N D = D 1 T i , j E N D D 1 j o u t ( S i ) T i , j E N D = T i , j E N D j o u t ( S i ) T i , j E N D  for  i T . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyyae2aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGH9aqpjuaGdaWcaaqaaiqbdsfauzaaiaWaa0baaeaacqWGPbqAcqGGSaalcqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaaabaWaaabeaeaacuWGubavgaacamaaDaaabaGaemyAaKMaeiilaWIaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaaqaaiabdQgaQjabgIGiolabd+gaVjabdwha1jabdsha0jabcIcaOiabdofatnaaBaaabaGaemyAaKgabeaacqGGPaqkaeqacqGHris5aaaakiabg2da9KqbaoaalaaabaGaemiraq0aaSbaaeaacqaIXaqmaeqaaiabdsfaunaaDaaabaGaemyAaKMaeiilaWIaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaaqaaiabdseaenaaBaaabaGaeGymaedabeaadaaeqaqaaiabdsfaunaaDaaabaGaemyAaKMaeiilaWIaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaaqaaiabdQgaQjabgIGiolabd+gaVjabdwha1jabdsha0jabcIcaOiabdofatnaaBaaabaGaemyAaKgabeaacqGGPaqkaeqacqGHris5aaaacqGH9aqpdaWcaaqaaiabdsfaunaaDaaabaGaemyAaKMaeiilaWIaemOAaOgabaGaemyrauKaemOta4KaemiraqeaaaqaamaaqababaGaemivaq1aa0baaeaacqWGPbqAcqGGSaalcqWGQbGAaeaacqWGfbqrcqWGobGtcqWGebaraaaabaGaemOAaOMaeyicI4Saem4Ba8MaemyDauNaemiDaqNaeiikaGIaem4uam1aaSbaaeaacqWGPbqAaeqaaiabcMcaPaqabiabggHiLdaaaiabbccaGiabbAgaMjabb+gaVjabbkhaYjabbccaGiabdMgaPjabgIGioprtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaGabaiab=nr8ujabc6caUaaa@A848@

Viterbi decoding in linear memory

In this section we describe results when using a "linear memory" modification of the original Viterbi algorithm that was introduced in [18] by Andrew J. Viterbi. As mentioned previously, the Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states, called the "Viterbi path", corresponding to the sequence of observed events in the context of an HMM. Viterbi checkpointing implementation introduced in [11] divides input sequence into T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ blocks of T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ symbols each and during the first Viterbi pass keeps only the first column of the d table for each block. The reconstruction of the most probable state path starts with the last block, where we use the last checkpointing column to initialize recovery of the last T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ states of the most likely state path and so on, until the entire sequence decoding is reconstructed. The algorithm requires memory space proportional to O ( N T + T ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4ta8KaeiikaGIaemOta40aaOaaaeaacqWGubavaSqabaGccqGHRaWkcqWGubavcqGGPaqkaaa@333C@ and takes computational time O(TNQ max ) twice as much as conventional implementation would take because of multiple sweeps. Additional computations could be easily justified by the lower memory use, which in practice leads to substantial speedup.

Only two columns are needed for the δ array that stores maximum probability scores for a state at a given time-point for the algorithm to run (referring to the relationship shown in Table 2). We replace the array, needed to store the dynamic programming backtrack pointers, by a linked list. Our approximately linear memory implementation follows the observation that the backtrack paths typically converge to the most likely state path and travel all together to the beginning of the decoding table. Although the approximate linearity depends on model structure and emission sequence decoded, and is not guaranteed, this behavior is typical for the HMM topologies we use. The possibility of using O(N log(T)) space (in case we write to disk the most likely path before the coalescence point, i.e. the first state on the backtrack path where only a single candidate is left for the initial segment of the most probable state path) has only been rigorously proven for simple symmetric two-state HMM [19].

The modified Viterbi algorithm is shown in Figure 3. It runs in the same O(TNQ max ) time as a conventional Viterbi decoder, but takes the amount of memory O(T) as has been demonstrated by our simulations [see Section Computational performance].

Figure 3
figure 3

Viterbi algorithm implementation with linked list. Here ψ t + 1 ( q t + 1 ) p r e v MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqiYdK3aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGGOaakcqWGXbqCdaqhaaWcbaGaemiDaqNaey4kaSIaeGymaedabaGaey4fIOcaaOGaeiykaKIaeyOKH4QaemiCaaNaemOCaiNaemyzauMaemODayhaaa@402D@ is reference to the previous node.

This approach has proven to be useful in decoding the explicit Duration Hidden Markov Model (DHMM) topology introduced in [6], as can be seen in Figure 4. Although this implementation closely follows the originally proposed non-parametric duration density [20] design, the advantage is that we no longer have to use highly modified Forward-Backward and Viterbi algorithms [21]. This topology directly corresponds to the Generalized Hidden Markov Model (GHMM) used in GENSCAN [22], one of the most accurate gene structure prediction tools. The potential for a very large number of nodes in our proposed topology is compensated by introducing the linear memory Viterbi and Baum-Welch implementations with unaltered running time O(STM), where M is the maximum duration of an aggregate state and S is the number of aggregate states. An example of backtracking path compression for such a topology with discrete emissions can be seen in Figure 5, where the most likely trail of states quickly combines with all the alternative trails. Another interesting topology used by our laboratory for "spike" detection is shown in Figure 6, where the spikes are modelled as a mixture of two trajectories interconnected with an underlying set of ionic flow blockade states. The Viterbi decoding trail for this topology, detecting two sequential spikes in samples from the real continuous ionic flow blockade, is shown in Figure 7. Again, the backtracks quickly converge to the most likely state sequence.

Figure 4
figure 4

Explicit DHMM topology. Here the first aggregate state emits 0 with probability 0.75 and 1 with probability 0.25 and the second aggregate state emits 0 with probability 0.25 and 1 with probability 0.75. Duration histograms are shown for each aggregate state.

Figure 5
figure 5

Explicit Duration HMM trellis for the observation string shown below. The most likely sequence of states for the observation string shown below is shaded. The lightly grayed states will be deallocated by garbage collector.

Figure 6
figure 6

Spike detection loop topology.

Figure 7
figure 7

Trellis for loopy topology used for spikes detection where shallow spike (states 1–6) and deep spike (states 7–17) are consequently decoded. The most likely sequence of states for the sequence of observed ionic flow current blockades (in pA) shown below is shaded. The lightly grayed states will be deallocated by garbage collector.

Our particular implementation relies on the Java Garbage Collector (GC), which periodically deletes all the linked list nodes allocated in the heap that are no longer referenced by the active program stack as shaded in light gray color in Figures 5 and 7. On multiple core machines the GC could run concurrently with the main computational thread thus not obstructing execution of the method. In the lower level languages (like C/C++) "smart pointers" could be used to deallocate unused links when their reference count drops to zero, which is in some ways even more efficient than Java's garbage collection.

Computational performance

We conducted experiments on the HMM topologies mentioned above [see Section Viterbi decoding in linear memory] with both artificial and real data sets, and evaluated performance of the various implementations of the Viterbi and EM learning algorithms. We describe the performance of the Java Virtual Machive (JVM) after the HotSpot Java byte code optimizer burns-in, i.e. after it optimizes both memory use and execution time within EM cycles. The linear memory, checkpointing and conventional algorithm implementations are thereby streamlined to avoid an unfair comparison due to obvious performance bottlenecks.

For the DHMM topology shown in Figure 4 we have chosen to systematically alter the size of two aggregate states from 50 to 500 when learning on an artificially generated sequence of 10,000 discrete symbols to see how the number of free learning parameters affects the performance of the EM learning algorithms. In Subfigures 8(a)8(c) we see that the running time of the linear implementation grows dramatically compared to conventional and checkpointing implementations, making it a very slow alternative for such a scenario. Although the linear implementation memory profile is low, as expected, for high values of D, the checkpointing algorithm claims the least memory. This is because the table sizes in the linear memory EM implementation grow quickly as the number of states and transitions in the model increases. Garbage collection for large D is the lowest for the checkpointing EM compared to other implementations.

Figure 8
figure 8

In subfigures 8(a) – 8(c) performance of Baum-Welch used on DHMM topology with two aggregate states of various maximum duration D . In subfigures 8(d) – 8(f) performance of Baum-Welch algorithm used on spike topology for various ionic flow durations is shown.

In experiments on EM learning on a spike detection HMM topology, shown in Figure 6, we have systematically varied the ionic flow duration from 1,000 ms to 64,000 ms. Although in Subfigures 8(d)8(f) duration of the time cycle of the linear memory implementation is not so inflated in this situation, it is still many times higher than for conventional and checkpointing algorithms. Please note that conventional and checkpointing algorithms spend almost identical time per cycle. The conventional implementation still takes the largest amount of memory and once again checkpointing takes less memory compared to the linear memory implementation in the case of long signal duration. Garbage collection in the case of the conventional implementation starts taking a substantial fraction of the CPU time for maximum signal duration, which advocates against using the conventional implementation for the analysis of long signals.

Theoretically, for the linear memory algorithm to run faster than checkpointing alternative the following condition should hold true 2TNQ max + T (Q + E) > TNQ max (Q + ED) which reduces to the condition, unrealistic for any practical model 2 > Q + ED. Thus, the linear memory algorithm will always run slower than checkpointing. The memory condition for the linear memory EM algorithm implementation to run in less space is 2N T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ > N(Q + ED), which reduces to 2 T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ > Q + ED condition – quite realistic for sufficiently large values of T. The memory advantage is the key attribute since efforts are underway (Z. Jiang and S. Winters-Hilt) to perform segmented linear HMM processing on a distributed platform, where the speed deficiency is offset by M-fold speedup when using M nodes.

In both test scenarios shown in Figure 8 we see that conventional implementation of Baum-Welch aggressively claims very large amounts of heap space, even for modestly sized problems (in some applications, such as the JAHMM package [23], it allocates the probabilistic table ξ of size N2 × T, although we do it in N × T through progressive summation of forward-backward tables), where both modified EM algorithm implementations have a very compact memory signature. An HMM algorithm implemented based on forward sweep strategy with silent Start/End states runs slower and takes more memory compared to the proposed backward sweep strategy in case of DHMM topology. This is because prior probabilities of the states are estimated as regular transitions from the Start state, thus substantially increasing ti, j(t, m) table size and time required for a recurrent step.

In Tables 5 and 6 we list the ratio of memory used by the linked list nodes referenced from the active program stack to the sequence length T. As could be seen, it quickly becomes proportional to 1.0 in both spike detection and the explicit DHMM topologies as the decoded sequence length grows.

Table 5 Memory use for Viterbi decoding on spike topology with loop sizes 6 and 11.
Table 6 Memory use for Viterbi decoding on explicit DHMM with D = 60 and two aggregate

Discussion and Conclusion

We have discussed implementation of Baum-Welch and Viterbi algorithms in linear memory. We successfully used these implementations in analysis of nanopore blockade signals with very large sample sizes (more than 3,000 ms) where the main limitation becomes processing time rather than memory overflow. We are currently working on efficient distributed implementations of the linear memory algorithms to facilitate quicker, potentially "real-time" applications.

In both of the test scenarios considered, the linear memory modification of the EM algorithm takes significantly more time to execute compared to conventional and checkpointing implementations. Despite being the fastest in many realistic scenarios, the conventional implementation of the EM learning algorithm suffers from substantial memory use. The checkpointing algorithm alleviates both of these extremes, sometimes running even faster than the conventional algorithm due to lower memory management overhead. The checkpointing algorithm seems to provide an excellent tradeoff between memory use and speed. We are trying to understand if the running time of our linear memory EM algorithm implementation can be constrained in a way similar to the checkpointing algorithm. A demo program featuring the canonical, checkpointing and linear memory implementations of the EM learning and the Viterbi decoding algorithms is available on our web site (see Availability & requirements section below).

Availability & requirements

University of New Orleans bioinformatics group:


Appendix A – Proofs of scaling relationships

The scaling steps in Figure 2 need additional rigorous justification. Our proofs are partially inspired by recurrences presented in [24] with further clarifications.

Theorem 2 β ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaGaaaaa@2D85@ t (m) = d t β ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaebaaaa@2D8E@ t (m)

Proof Let us define D t = 1 i = 1 N β t ( i ) , d t = 1 i = 1 N β t ( i ) , β ˜ t ( m ) = D t β t ( m ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiraq0aaSbaaSqaaiabdsha0bqabaGccqGH9aqpjuaGdaWcaaqaaiabigdaXaqaamaaqadabaGaeqOSdi2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobGaeyyeIuoaaaGccqGGSaalcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaWaaabmaeaacqaHYoGydaWgaaqaaiabdsha0bqabaGaeiikaGIaemyAaKMaeiykaKcabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eacqGHris5aaaakiabcYcaSiqbek7aIzaaiaWaaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqGH9aqpcqWGebardaWgaaWcbaGaemiDaqhabeaakiabek7aInaaBaaaleaacqWG0baDaeqaaOGaeiikaGIaemyBa0MaeiykaKcaaa@6296@ ,

β ¯ t ( m ) = j = 1 N a m , j b j ( o t + 1 ) β ˜ t + 1 ( j ) = D t + 1 j = 1 N a m , j b j ( o t + 1 ) β t + 1 ( j ) = D t + 1 β t ( m ) , d t = 1 i = 1 N β ¯ t ( i ) = 1 D t + 1 i = 1 N β t ( i ) , β ˜ t ( m ) = d t β ¯ t ( m ) = d t D t + 1 β t ( m ) = 1 D t + 1 i = 1 N β t ( i ) D t + 1 β t ( m ) = D t β t ( m ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabuqaaaaabaGafqOSdiMbaebadaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabd2gaTjabcMcaPiabg2da9maaqahabaGaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabdQgaQbqabaGccqWGIbGydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIafqOSdiMbaGaadaWgaaWcbaGaemiDaqNaey4kaSIaeGymaedabeaakiabcIcaOiabdQgaQjabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakeaacqGH9aqpcqWGebardaWgaaWcbaGaemiDaqNaey4kaSIaeGymaedabeaakmaaqahabaGaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabdQgaQbqabaGccqWGIbGydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeqOSdi2aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGGOaakcqWGQbGAcqGGPaqkaSqaaiabdQgaQjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGcbaGaeyypa0Jaemiraq0aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqaHYoGydaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabd2gaTjabcMcaPiabcYcaSaqaaiabdsgaKnaaBaaaleaacqWG0baDaeqaaOGaeyypa0tcfa4aaSaaaeaacqaIXaqmaeaadaaeWaqaaiqbek7aIzaaraWaaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPaqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobGaeyyeIuoaaaGaeyypa0ZaaSaaaeaacqaIXaqmaeaacqWGebardaWgaaqaaiabdsha0jabgUcaRiabigdaXaqabaWaaabmaeaacqaHYoGydaWgaaqaaiabdsha0bqabaGaeiikaGIaemyAaKMaeiykaKcabaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eacqGHris5aaaacqGGSaalaOqaaiqbek7aIzaaiaWaaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqGH9aqpcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiqbek7aIzaaraWaaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqGH9aqpcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeqOSdi2aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqGH9aqpjuaGdaWcaaqaaiabigdaXaqaaiabdseaenaaBaaabaGaemiDaqNaey4kaSIaeGymaedabeaadaaeWaqaaiabek7aInaaBaaabaGaemiDaqhabeaacqGGOaakcqWGPbqAcqGGPaqkaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtaiabggHiLdaaaiabdseaenaaBaaabaGaemiDaqNaey4kaSIaeGymaedabeaacqaHYoGydaWgaaqaaiabdsha0bqabaGaeiikaGIaemyBa0MaeiykaKIaeyypa0Jaemiraq0aaSbaaeaacqWG0baDaeqaaiabek7aInaaBaaabaGaemiDaqhabeaacqGGOaakcqWGTbqBcqGGPaqkcqGGUaGlaaaaaa@F067@

Here we observe useful relationships D t = d t Dt + 1and β ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaebaaaa@2D8E@ t (m) = Dt + 1β t (m) necessary in follow-up proves.   □

Theorem 3 T ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmivaqLbaGaaaaa@2D15@ i, j(t, m) = d t T ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmivaqLbaebaaaa@2D1E@ i, j(t, m)

Proof Let us define T ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmivaqLbaGaaaaa@2D15@ i, j(t, m) = D t Ti, j(t, m),

T ˜ i , j ( t , m ) = d t T ¯ i , j ( t , m ) = d t [ β ˜ t + 1 ( j ) a m , j b j ( o t + 1 ) δ ( i = m ) + n = 1 N a m , n T ˜ i , j ( t + 1 , n ) b n ( o t + 1 ) ] = d t [ D t + 1 β t + 1 ( j ) a m , j b j ( o t + 1 ) δ ( i = m ) + D t + 1 n = 1 N a m , n T i , j ( t + 1 , n ) b n ( o t + 1 ) ] = d t D t + 1 [ β t + 1 ( j ) a m , j b j ( o t + 1 ) δ ( i = m ) + n = 1 N a m , n T i , j ( t + 1 , n ) b n ( o t + 1 ) ] = d t D t + 1 T i , j ( t , m ) = D t T i , j ( t , m ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabyqaaaaabaGafmivaqLbaGaadaWgaaWcbaGaemyAaKMaeiilaWIaemOAaOgabeaakiabcIcaOiabdsha0jabcYcaSiabd2gaTjabcMcaPiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOGafmivaqLbaebadaWgaaWcbaGaemyAaKMaeiilaWIaemOAaOgabeaakiabcIcaOiabdsha0jabcYcaSiabd2gaTjabcMcaPaqaaiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOWaamWaaeaacuaHYoGygaacamaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiikaGIaemOAaOMaeiykaKIaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabdQgaQbqabaGccqWGIbGydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeqiTdqMaeiikaGIaemyAaKMaeyypa0JaemyBa0MaeiykaKIaey4kaSYaaabCaeaacqWGHbqydaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiqbdsfauzaaiaWaaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGGOaakcqWG0baDcqGHRaWkcqaIXaqmcqGGSaalcqWGUbGBcqGGPaqkcqWGIbGydaWgaaWcbaGaemOBa4gabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKcaleaacqWGUbGBcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOGaay5waiaaw2faaaqaaiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOWaamWaaeaacqWGebardaWgaaWcbaGaemiDaqNaey4kaSIaeGymaedabeaakiabek7aInaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiikaGIaemOAaOMaeiykaKIaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabdQgaQbqabaGccqWGIbGydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeqiTdqMaeiikaGIaemyAaKMaeyypa0JaemyBa0MaeiykaKIaey4kaSIaemiraq0aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGcdaaeWbqaaiabdggaHnaaBaaaleaacqWGTbqBcqGGSaalcqWGUbGBaeqaaOGaemivaq1aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGGOaakcqWG0baDcqGHRaWkcqaIXaqmcqGGSaalcqWGUbGBcqGGPaqkcqWGIbGydaWgaaWcbaGaemOBa4gabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKcaleaacqWGUbGBcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOGaay5waiaaw2faaaqaaiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOGaemiraq0aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGcdaWadaqaaiabek7aInaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiikaGIaemOAaOMaeiykaKIaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabdQgaQbqabaGccqWGIbGydaWgaaWcbaGaemOAaOgabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaeqiTdqMaeiikaGIaemyAaKMaeyypa0JaemyBa0MaeiykaKIaey4kaSYaaabCaeaacqWGHbqydaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiabdsfaunaaBaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeqaaOGaeiikaGIaemiDaqNaey4kaSIaeGymaeJaeiilaWIaemOBa4MaeiykaKIaemOyai2aaSbaaSqaaiabd6gaUbqabaGccqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqNaey4kaSIaeGymaedabeaakiabcMcaPaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakiaawUfacaGLDbaaaeaacqGH9aqpcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaemivaq1aaSbaaSqaaiabdMgaPjabcYcaSiabdQgaQbqabaGccqGGOaakcqWG0baDcqGGSaalcqWGTbqBcqGGPaqkaeaacqGH9aqpcqWGebardaWgaaWcbaGaemiDaqhabeaakiabdsfaunaaBaaaleaacqWGPbqAcqGGSaalcqWGQbGAaeqaaOGaeiikaGIaemiDaqNaeiilaWIaemyBa0MaeiykaKIaeiOla4caaaaa@47CC@

Theorem 4 E ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyrauKbaGaaaaa@2CF7@ i (γ, t, m) = d t E ¯ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyrauKbaebaaaa@2D00@ i (γ, t, m)

Proof Let us define E ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyrauKbaGaaaaa@2CF7@ i (γ, t, m) = D t E i (γ, t, m),

E ˜ i ( γ , t , m ) = d t E ¯ i ( γ , t , m ) = d t [ n = 1 N b n ( o t + 1 ) a m , n E ˜ i ( γ , t + 1 , n ) + β ¯ t ( m )  SCORE ( o t , γ ) δ ( m = i ) ] = d t [ D t + 1 n = 1 N b n ( o t + 1 ) a m , n E i ( γ , t + 1 , n ) + D t + 1 β t ( m )  SCORE ( o t , γ ) δ ( m = i ) ] = d t D t + 1 [ n = 1 N b n ( o t + 1 ) a m , n E i ( γ , t + 1 , n ) + β t ( m )  SCORE ( o t , γ ) δ ( m = i ) ] = d t D t + 1 E i ( γ , t , m ) = D t E i ( γ , t , m ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeWabyqaaaaabaGafmyrauKbaGaadaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabeo7aNjabcYcaSiabdsha0jabcYcaSiabd2gaTjabcMcaPiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOGafmyrauKbaebadaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabeo7aNjabcYcaSiabdsha0jabcYcaSiabd2gaTjabcMcaPaqaaiabg2da9iabdsgaKnaaBaaaleaacqWG0baDaeqaaOWaamWaaeaadaaeWbqaaiabdkgaInaaBaaaleaacqWGUbGBaeqaaOGaeiikaGIaem4Ba82aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGGPaqkcqWGHbqydaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiqbdweafzaaiaWaaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqaHZoWzcqGGSaalcqWG0baDcqGHRaWkcqaIXaqmcqGGSaalcqWGUbGBcqGGPaqkcqGHRaWkcuaHYoGygaqeamaaBaaaleaacqWG0baDaeqaaOGaeiikaGIaemyBa0MaeiykaKIaeeiiaaIaee4uamLaee4qamKaee4ta8KaeeOuaiLaeeyrauKaeiikaGIaem4Ba82aaSbaaSqaaiabdsha0bqabaGccqGGSaalcqaHZoWzcqGGPaqkcqaH0oazcqGGOaakcqWGTbqBcqGH9aqpcqWGPbqAcqGGPaqkaSqaaiabd6gaUjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGccaGLBbGaayzxaaaabaGaeyypa0Jaemizaq2aaSbaaSqaaiabdsha0bqabaGcdaWadaqaaiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOWaaabCaeaacqWGIbGydaWgaaWcbaGaemOBa4gabeaakiabcIcaOiabd+gaVnaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeiykaKIaemyyae2aaSbaaSqaaiabd2gaTjabcYcaSiabd6gaUbqabaGccqWGfbqrdaWgaaWcbaGaemyAaKgabeaakiabcIcaOiabeo7aNjabcYcaSiabdsha0jabgUcaRiabigdaXiabcYcaSiabd6gaUjabcMcaPiabgUcaRiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaeqOSdi2aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqqGGaaicqqGtbWucqqGdbWqcqqGpbWtcqqGsbGucqqGfbqrcqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqhabeaakiabcYcaSiabeo7aNjabcMcaPiabes7aKjabcIcaOiabd2gaTjabg2da9iabdMgaPjabcMcaPaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakiaawUfacaGLDbaaaeaacqGH9aqpcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOWaamWaaeaadaaeWbqaaiabdkgaInaaBaaaleaacqWGUbGBaeqaaOGaeiikaGIaem4Ba82aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGGPaqkcqWGHbqydaWgaaWcbaGaemyBa0MaeiilaWIaemOBa4gabeaakiabdweafnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeq4SdCMaeiilaWIaemiDaqNaey4kaSIaeGymaeJaeiilaWIaemOBa4MaeiykaKIaey4kaSIaeqOSdi2aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGTbqBcqGGPaqkcqqGGaaicqqGtbWucqqGdbWqcqqGpbWtcqqGsbGucqqGfbqrcqGGOaakcqWGVbWBdaWgaaWcbaGaemiDaqhabeaakiabcYcaSiabeo7aNjabcMcaPiabes7aKjabcIcaOiabd2gaTjabg2da9iabdMgaPjabcMcaPaWcbaGaemOBa4Maeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakiaawUfacaGLDbaaaeaacqGH9aqpcqWGKbazdaWgaaWcbaGaemiDaqhabeaakiabdseaenaaBaaaleaacqWG0baDcqGHRaWkcqaIXaqmaeqaaOGaemyrau0aaSbaaSqaaiabdMgaPbqabaGccqGGOaakcqaHZoWzcqGGSaalcqWG0baDcqGGSaalcqWGTbqBcqGGPaqkaeaacqGH9aqpcqWGebardaWgaaWcbaGaemiDaqhabeaakiabdweafnaaBaaaleaacqWGPbqAaeqaaOGaeiikaGIaeq4SdCMaeiilaWIaemiDaqNaeiilaWIaemyBa0MaeiykaKIaeiOla4caaaaa@40E7@


  1. Bilmes J: What HMMs can do. In Tech rep. University of Washington, Seattle; 2002.

    Google Scholar 

  2. Rabiner L, Juang BH: Fundamentals of speech recognition. Printice Hall; 1993.

    Google Scholar 

  3. Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University press; 1998.

    Book  Google Scholar 

  4. Vercoutere W, Winters-Hilt S, Olsen H, Deamer D, Haussler D, Akeson M: Rapid discrimination among individual DNA hairpin molecules at single-nucleotide resolution using an ion channel. Nature Biotechnology 2001, 19: 248–252. 10.1038/85696

    Article  CAS  PubMed  Google Scholar 

  5. Vercoutere W, Winters-Hilt S, DeGuzman V, Deamer D, Ridino S, Rodgers J, Olsen H, Marziali A, Akeson M: Discrimination among individual Watson-Crick base pairs at the termini of single DNA hairpin molecules. Nucleic Acids Research 2003, 31(4):1311–1318. 10.1093/nar/gkg218

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Churbanov A, Baribault C, Winters-Hilt S: Duration learning for analysis of nanopore ionic current blockades. BMC Bioinformatics 2007. [MCBIOS IV supplemental proceedings].

    Google Scholar 

  7. Winters-Hilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics methods for novel nanopore analysis of HIV DNA termini. BMC Bioinformatics 2006, 7(Suppl 2):S22. 10.1186/1471-2105-7-S2-S22

    Article  PubMed Central  PubMed  Google Scholar 

  8. Baum L, Petrie T, Soules G, Weiss N: A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann Math Statist 1970, 41: 164–171. 10.1214/aoms/1177697196

    Article  Google Scholar 

  9. Hirschberg D: A linear-space algorithm for computing maximal common subsequences. Communications of the ACM 1975, 18: 341–343. 10.1145/360825.360861

    Article  Google Scholar 

  10. Grice J, Hughey R, Speck D: Reduced space sequence alignment. CABIOS 1997, 13: 45–53.

    CAS  PubMed  Google Scholar 

  11. Tarnas C, Hughey R: Reduced space hidden Markov model training. Bioinformatics 1998, 14(5):401–406. 10.1093/bioinformatics/14.5.401

    Article  CAS  PubMed  Google Scholar 

  12. Wheeler R, Hughey R: Optimizing reduced-space sequence analysis. Bioinformatics 2000, 16(12):1082–1090. 10.1093/bioinformatics/16.12.1082

    Article  CAS  PubMed  Google Scholar 

  13. Miklós I, Meyer I: A linear memory algorithm for Baum-Welch training. BMC Bioinformatics 2005, 6: 231. 10.1186/1471-2105-6-231

    Article  PubMed Central  PubMed  Google Scholar 

  14. Rabiner L: A tutorial on hidden Markov models and selected applications in speach recognition. Proceedings of IEEE 1989, 77: 257–286. 10.1109/5.18626

    Article  Google Scholar 

  15. Bilmes J: A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. In Tech Rep TR-97–021. International Computer Science Institute; 1998.

    Google Scholar 

  16. Wierstra D, Wiering M: Master's Thesis. Volume chap. 5. IDSIA; 2004:36–40. [A New Implementation of Hierarchical Hidden Markov Models].

    Google Scholar 

  17. Kingsbury N, Rayner P: Digital filtering using logarithmic arithmetic. Electronics Letters 1971, 7(2):56–58. 10.1049/el:19710039

    Article  Google Scholar 

  18. Viterbi A: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 1967, 13(2):260–269. 10.1109/TIT.1967.1054010

    Article  Google Scholar 

  19. Šrámek R, Brejová B, Vinař T: On-line Viterbi algorithm and its relationship to random walks. Tech rep Comenius and Cornell Universities; 2007. []

    Google Scholar 

  20. Ferguson J: Variable duration models for speech. In Proc Symposium on the application of Hidden Markov Models to text and speech Edited by: Ferguson J. 1980, 143–179.

    Google Scholar 

  21. Mitchell C, Helzerman R, Jamieson L, Harper M: A parallel implementation of a hidden Markov model with duration modeling for speech recognition. Digital Signal Processing, A Review Journal 1995, 5: 298–306. []

    Article  Google Scholar 

  22. Burge C, Karlin S: Predictions of complete gene structures in human genomic DNA. Journal of Molecular Biology 1997, 268: 78–94. 10.1006/jmbi.1997.0951

    Article  CAS  PubMed  Google Scholar 

  23. François JM: Jahmm – Java Hidden Markov Model (HMM).[]

  24. Rahimi A: An erratum for "A tutorial on Hidden Markov Models and selected applications in speech recognition".2000. []

    Google Scholar 

Download references


Authors are grateful for numerous constructive suggestions made by anonymous reviewers.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Alexander Churbanov.

Additional information

Authors' contributions

AC conceptualized the project, implemented and tested the Baum-Welch and Viterbi linear memory procedures. SW–H suggested focus on linear memory algorithms and outlined the idea for the Viterbi linear memory. SW–H helped with writing up the manuscript and provided many valuable suggestions throughout the study. All authors read and approved the final manuscript.

Electronic supplementary material


Additional File 1: Supplementary materials. Contains previously derived recurrences for linear memory HMM implementation with forward sweep and empty start/end states along with corrected recurrences. (PDF 48 KB)

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Churbanov, A., Winters-Hilt, S. Implementing EM and Viterbi algorithms for Hidden Markov Model in linear memory. BMC Bioinformatics 9, 224 (2008).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: