BaumWelch training
The BaumWelch algorithm defines an iterative procedure in which the emission and transition probabilities in iteration n + 1 are set to the number of times each transition and emission is expected to be used when analysing the training sequences with the set of emission and transition probabilities derived in the previous iteration n.
Let
denote the transition probability for going from state i to state j in iteration n,
the emission probability for emitting letter y in state i in iteration n, P(X) the probability of sequence X, and x
_{
k
}the kth letter in input sequence X which has length L. We also define X
_{
k
}as the sequence of letters from the beginning of sequence X up to sequence position k, (x
_{1}, ...x
_{
k
}). X
^{
k
}is defined as the sequence of letters from sequence position k + 1 to the end of the sequence, (x
_{
k+1}, ...x
_{
L
}).
For a given set of training sequences, S, the expectation maximisation update for transition probability
,
can then be written as
The superfix n on the quantities on the right hand side indicates that they are based on the transition probabilities
and emission probabilities
of iteration n. f(X
_{
k
}, i): = P(x
_{1}, ...x
_{
k
}, s(x
_{
k
}) = i) is the socalled forward probability of the sequence up to and including sequence position k, requiring that sequence letter x
_{
k
}is read by state i. It is equal to the sum of probabilities of all state paths that finish in state i at sequence position k. The probability of sequence X, P(X), is therefore equal to f(X
_{
L
}, End). b(X
^{
k
}, i): = P(x
_{
k+1}, ...x
_{
L
}s(x
_{
k
}) = i) is the socalled backward probability of the sequence from sequence position k + 1 to the end, given that the letter at sequence position k, x
_{
k
}, is read by state i. It is equal to the sum of probabilities of all state paths that start in state i at sequence position k.
For a given set of training sequences, S, the expectation maximisation update for emission probability
,
, is
δ is the usual delta function with
= 1 if x
_{
k
}= y and
= 0 if x
_{
k
}≠ y. As before, the superfix n on the quantities on the right hand side indicates that they are calculated using the transition probabilities
and emission probabilities
of iteration n.
The forward and backward probabilities f
^{
n
}(X
_{
k
}, i) and b
^{
n
}(X
^{
k
}, i) can be calculated using the forward and backward algorithms [1] which are introduced in the following section.
BaumWelch training using the forward and backward algorithm
The forward algorithm proposes a procedure for calculating the forward probabilities f(X
_{
k
}, i) in an iterative way. f(X
_{
k
}, i) is the sum of probabilities of all state paths that finish in state i at sequence position k.
The recursion starts with the initialisation
where Start is the number of the start state in the HMM. The recursion proceeds towards higher sequence positions
and terminates with
where End is the number of the end state in the HMM. The recursion can be implemented as a dynamic programming procedure which works its way through a twodimensional matrix, starting at the start of the sequence in the Start state and finishing at the end of the sequence in the End state of the HMM.
The backward algorithm calculates the backward probabilities b(X
^{
k
}, i) in a similar iterative way. b(X
^{
k
}, i) is the sum of probabilities of all state paths that start in state i at sequence position k. Opposed to the forward algorithm the backward algorithm starts at the end of the sequence in the End state and finishes at the start of the sequence in the Start state of the HMM.
The backward algorithm starts with the initialisation
and continues towards lower sequence positions with the recursion
and terminates with
As can be seen in the recursion steps of the forward and backward algorithms described above, the calculation of f(X
_{
k+1}, i) requires at most T
_{
max
}previously calculated elements f(X
_{
k
}, j) for j ∈ {1, ..M}. T
_{
max
}is the maximum number of states that any state of the model is connected to. Likewise, the calculation of b(X
^{
k
}, i) refers to at most T
_{
max
}elements b(X
^{
k+1
}, j) for j ∈ {1, ..M}.
In order to continue the calculation of the forward and backward values f(X
_{
k
}, i) and b(X
_{
k
}, i) for all states i ∈ {1, ..M} along the entire sequence, we thus only have to memorise M elements.
BaumWelch training using the checkpointing algorithm
Unit now, the checkpointing algorithm [11–13] was the most efficient way to perform BaumWelch training. The basic idea of the checkpointing algorithm is to perform the forward and backward algorithm by memorising the forward and backward values only in
columns along the sequence dimension of the dynamic programming table. The checkpointing algorithm starts with the forward algorithm, retaining only the forward values in
columns. These columns partition the dynamic programming table into
separate fields. The checkpointing algorithm then invokes the backward algorithm which memorises the backward values in a strip of length
as it moves along the sequence. When the backward calculation reaches the boundary of one field, the precalculated forward values of the neighbouring checkpointing column are used to calculate the corresponding forward values for that field. The forward and backward values of that field are then available at the same time and are used to calculate the corresponding values for the EM update.
The checkpointing algorithm can be further refined by using embedded checkpoints. With an embedding level of k, the forward values in
columns of the initial calculation are memorised, thus defining
long fields. When the memorysparse calculation of the backward values reaches the field in question, the forward algorithm is invoked again to calculate the forward values for
additional columns within that field. This procedure is iterated k times within the thus emerging fields. In the end, for each of the
long ksubfields, the forward and backward values are simultaneously available and are used to calculate the corresponding values for the EM update. The time complexity of this algorithm for one BaumWelch iteration and a given training sequence of length L is O(kLMT
_{
max
}+ L(T + E)), since k forward and 1 backward algorithms have to be invoked, and the memory complexity is
. For k = log(L), this amounts to a time requirement of O(log(L)LMT
_{
max
}+ L(T + E)) and a memory requirement of O(log(L)M), since
= e.
BaumWelch training using the new algorithm
It is not trivial to see that the quantities
and
of Equations 1 and 2 can be calculated in an even more memorysparse way as both, the forward and the corresponding backward probabilities are needed at the same time in order to calculate the terms
in
and
in
of Equations 1 and 2. A calculation of these quantities for each sequence position using a memorysparse implementation (that would memorise only M values at a time) both for the forward and backward algorithm would require Ltimes more time, i.e. significantly more time. Also, an algorithm along the lines of the Hirschberg algorithm [9, 10] cannot be applied as we cannot halve the dynamic programming table after the first recursion.
We here propose a new algorithm to calculate the quantities
and
which are required for BaumWelch training. Our algorithm requires O(M) memory and O(LMT
_{
max
}(T + E)) time rather than O(log(L)M) memory and O(log(L{LMT
_{
max
}+ L(T + E)) time.
The trick for coming up with a memory efficient algorithm is to realise that
For example,
in the numerator in Equation 1 is the weighted sum of probabilities of state paths for sequence X that contain at least one i → j transition, and the weight of each such state path in the sum is the number of times this transition occurs in the state path.
We now show how
in Equation 1 can be calculated in O(M) memory and O(LMT
_{
max
}) time. As the superfix n is only there to remind us that the calculation of
is based on the transition and emission probabilities of iteration n and as this index does not change in the calculation of
, we discard it for simplicity sake in the following.
Let t
_{
i, j
}(X
_{
k
}, l) denote the weighted sum of probabilities of state paths that finish in state l at sequence position k of sequence X and that contain at least one i → j transition, where the weight for each state path is equal to its number of i → j transitions.
Theorem 1: The following algorithm calculates t
_{
i, j
}(X) in O(M) memory and O(LMT
_{
max
}) time. t
_{
i, j
}(X) is the weighted sum of probabilities of all state paths for sequence X that have at least one i → j transition, where the weight for each state path is equal to its number of i → j transitions.
The algorithm starts with the initialisation
and proceeds via the following recursion
and finishes with
Proof:
 (1)
It is obvious that the recursion requires only O(M) memory as the calculation of all f(X
_{
k+1}, m) values with m ∈ {1, ..M} requires only access to the M previous f(X
_{
k
}, n) values with n ∈ {1, ..M}. Likewise, the calculations of all t
_{
i, j
}(X
_{
k+1}, m) values with m ∈ {1, ..M} refer only to M elements t
_{
i, j
}(X
_{
k
}, n) with n ∈ {1, ..M}. We therefore have to remember only a thin "slice" of t
_{
i, j
}and f values at sequence position k in order to be able to calculate the t
_{
i, j
}and f values for the next sequence position k + 1. The time requirement to calculate t
_{
i, j
}is O(LMT
_{
max
}): for every sequence position and for every state in the HMM, we have to sum at most T
_{
max
}terms in order to calculate the backward and forward terms.
 (2)
The f(X
_{
k
}, m) values are identical to the previously defined forward probabilities and are calculated in the same way as in the forward algorithm.
 (3)
We now prove by induction that t
_{
i, j
}(X
_{
k
}, l) is equal to the weighted sum of probabilities of state paths that have at least one i → j transition and that finish at sequence position k in state l, the weight of each state path being equal to its number of i → j transitions.
Initialisation step (sequence position
k = 0):
t
_{
i, j
}(
X
_{0},
m) = 0 is true as the sum of probabilities of state paths that finish in state
m at sequence position 0 and that have at least one
i →
j transition is zero. Induction step
k →
k + 1: We now show that if Equation 3 is true for sequence position
k, it is also true for
k + 1. We have to distinguish two cases:
 (i)
The first term, see right hand side of 5, is the sum of probabilities of state paths that finish at sequence position
k + 1 and whose last transition is from
i →
j. The second term, see 6, is the sum of probabilities of state paths that finish at sequence position
k + 1 and that already have at least one
i →
j transition. Note that the term in 6 also contains a contribution for
n = i. This ensures that the weight of those state path that already have at least one
i →
j transition is correctly increased by 1. The sum,
t
_{
i, j
}(
X
_{
k+1},
m), is therefore the weighted sum of probabilities of state paths that finish in sequence position
k + 1 and contain at least one
i →
j transition. Each state path's weight in the sum is equal to its number of
i →
j transitions.
 (ii)
The expression on the right hand side is the weighted sum of probabilities of state paths that finish in sequence position k + 1 and contain at least one i → j transition.
We have therefore shown that if Equation 3 is true for sequence position k, it is also true for sequence position k + 1. This concludes the proof of theorem 1. □
It is easy to show that e
_{
i
}(y, X) in Equation 2 can also be calculated in O(M) memory and O(LMT
_{
max
}) time in a similar way as t
_{
i, j
}(X). Let e
_{
i
}(y, X
_{
k
}, l) denote the weighted sum of probabilities of state paths that finish at sequence position k in state l and for which state i reads letter y at least once, the weight of each state path being equal to the number of times state i reads letter y. As in the calculation of t
_{
i, j
}(X) we again omit the superfix n as the calculation of e
_{
i
}(y, X) is again entirely based on the transition and emission probabilities of iteration n.
Theorem 2:
e
_{
i
}(y, X) can be calculated in O(M) memory and O(LMT
_{
max
}) time using the following algorithm. e
_{
i
}(y, X) is the weighted sum of probabilities of state paths for sequence X that read letter y in state i at least once, the weight of each state path being equal to the number of times letter y is read by state i.
Initialisation step:
Recursion:
Termination step:
Proof: The proof is strictly analogous to the proof of theorem 1.
The above theorems have shown that t
_{
i, j
}(X) and e
_{
i
}(y, X) can each be calculated in O(M) memory and O(LMT
_{
max
}) time. As there are T transition parameters and E emission parameters to be calculated in each BaumWelch iteration, and as these T + E values can be calculated independently, the time and memory requirements for each iteration and a set of training sequences whose sum of sequence lengths is L using our new algorithm are

O(M) memory and O(LMT
_{
max
}(T + E)) time, if all parameter estimates are calculated consecutively

O(M(T + E)) memory and O(LMT
_{
max
}) time, if all parameter estimates are calculated in parallel

more generally, O(Mc) memory and O(LMT
_{
max
}(T + E  c)) time for any c ∈ {1,..., (T + E)}, if c of T + E parameters are to be calculated in parallel
Note that the calculation of P(X) is a byproduct of each t
_{
i, j
}(X) and each e
_{
i
}(y, X) calculation, see Equations 4 and 7, and that T is equal to the number of free transition parameters in the HMM which is usually smaller than the number of transitions probabilities. Likewise, E is the number of free emission parameters in the HMM which may differ from the number of emission probabilities when the probabilities are parametrised.