Computing the probability of a labelling and of a footprint
We begin with the simple problems of computing the probability of a labelling λ or of a footprint f. For both of these problems, our algorithm is a variant of the traditional forward algorithm for HMMs [3], which computes the probability that an HMM generates a sequence y in O(nm2) time.
To compute the probability of all paths compatible with the labelling, we follow the usual forward algorithm, but at position i in the sequence, we restrict to allowing only transitions in the model between states labelled λi-1and λ
i
; other transitions are given probability zero. This algorithm takes less time than the forward algorithm to run; if the largest number of edges between any two (possibly identical) labels λ
i
and λ
j
is q, then this algorithm takes O(qn) time, which is O(L2n).
To compute the probability of all paths compatible with the footprint f = f1, ..., f
k
, we follow the usual forward algorithm, but on a variant HMM. We will create k groups of states G1, ..., G
k
, each corresponding to a label set
, for each position f
i
in the footprint: each state in
is represented in G
i
. States in the new model have exactly the same emission probabilities as in M, but we may only make transitions from states in G
i
to those in G
i
or in Gi+1, with the same transition probabilities as in M . (To make a proper HMM, we can create a "dump" state for paths in M that do not respect the footprint we seek, or not bother.) See Figure 2.
We require that the initial transition is from x0 to a state in G1, and we compute only the sum of the forward probabilities ending in states in G
k
that have made all of the required label transitions in the footprint f. The new model has O(Lk) states and each state has degree at most 2L, so computing the probability of a footprint can be done using the forward algorithm in O(L2kn) time. Note that this could be quadratic in n, as k, the length of the footprint, is bounded above by n.
Computing the probability of a ball of labellings
Computing the probability of B(λ, r), the ball of labellings around λ with radius r can be done efficiently for the border shift distance. As before, we produce a group of states G
i
for each position f
i
in the footprint f = f1 ... f
k
of labelling λ . Each state in G
i
has nonzero transition probability only to states in G
i
or Gi+1. However, if b
i
(λ) is the position of the i-th border in the labelling λ, then we allow states in G
i
to appear only in positions from bi-1- r to b
i
+ r - 1. For each position j, we can identify the groups that can be active at that position in this way. Computing the ball probability is easily accomplished using the pattern for computing the probability of a labelling: we simply set the transition probabilities for transitions to a new label that are found outside of the correct interval to zero and forbid transitions within a label after its valid interval ends. We then run the forward algorithm on the updated HMM, which we call M'. M' has exactly the same states as for computing a footprint, meaning that the runtime to compute the probability of the ball B(λ, r) is O(L2kn). However, we can reduce this runtime. If t
j
is the number of groups of states G
i
that could be active at position j, then allowing the boundaries to shift by r makes t
j
go up by 1 at 2r positions. The asymptotic runtime is q∑t
j
, which is O(qn + qrk). Again, since q ≤ L2, this is O(nL2 + rkL2), which may be well less than the runtime of the forward algorithm on the original model M, depending on r and k.
For the border shift sum distance, we must remember the total amount by which we have already shifted the first j - 1 borders before we handle the next border. As such, we can imagine the same sort of HMM as for the maximum border shift, but with an additional parameter in each state to keep track of this total mismatch so far accumulated. The model winds up with r times as many states this way, but the set of active groups at each position, and the degree of each node in the new HMM, stay the same. The overall runtime is r times slower than for the maximum border shift distance, or O(qnr + qr2k). We note also that the values of r chosen in practical applications are likely to be much larger for this distance measure than for the maximum border shift measure; as such, we expect that these algorithms are not likely to be as useful in practice.
Finally, for the generalized Hamming distance, with label distance matrix H, we can again hybridize the forward algorithm. Consider again our reference labelling λ, and create a new HMM with (r + 1)m states, where each state's identifier is (j, d). The semantic meaning of being in state (j, d) at position i of the sequence is that the HMM M is in state j and we have accumulated d distance from λ in the first i positions of the sequence; we create such states for all d in the range from 0 to r. If there is a transition from j to k in the HMM M, and if HL(λ{i=1}), L(k)= h, then we draw an edge in our new HMM from (j, d) to (k, d + h) for all d such that d + h ≤ r; if d + h is more than r, then we do not allow the transition. The new HMM has (r + 1)m states, each of which has the same degree as in M. We compute the total forward probabilities of all states (j, d) at position n, and this gives the probability of B(λ, r) for the generalized Hamming distance, in runtime r times that of the forward algorithm on the original sequence, or at most O(m2nr).
Hardness results
We might hope that we can compute the labelling for which B(λ, r) is of highest probability for a given r and one of our distance measures. Unfortunately, this is NP-hard, because the problem of computing the maximum probability labelling is also NP-hard [6], even for a fixed HMM with only two labels.
Optimizing the ball with radius 0, B(λ, 0), is exactly this problem, so optimizing balls for general radius r is clearly NP-hard for either the border shift or border shift sum distances. Similarly, it is NP-hard for the generalized Hamming distance, since if we used radius r = 0 and the label distance matrix H with value 1 for mismatched labels and 0 for matched labels, we are again maximizing the probability of a labelling. We conjecture that this problem stays NP-hard when restricted to the pure Hamming distance on state paths, and not labels.
Furthermore, maximizing the probability of a ball of labellings under any natural distance remains NP-hard even if we restrict attention to a single footprint f. The original HMM used in the proof that maximizing the label probability is NP-hard has only two labels, so there are only 2n possible footprints for a sequence of length n in this HMM. As such, if there existed a polynomial time algorithm that finds a maximum probability ball for a given footprint f, we could solve the most probable labelling problem by setting r = 0 and running that algorithm for each of the O(n) footprints.
Optimizing balls of states and paths: local search and special cases
Still, we can hope to find good balls of labellings using local search or global optimization in special cases. Here, we give a very efficient local search procedure to help optimize the probability of a ball of labellings using the border shift distance, and a global optimization procedure for computing the best centre for a ball of paths, not labellings, with the restriction that the path that forms the centre of the ball must divide the sequence up into long intervals, all using a single state.
Local search heuristic
The local search heuristic starts from a candidate labelling λ . At each iteration, it computes the probability of every ball centered at a labelling λ' matching λ in all borders but one, which is shifted by one position in either direction. We move to the λ' of highest ball probability and iterate until no improvement can be obtained by shifting a single border in the active λ. We can compute directly the ball probability for all 2k neighbours of λ; this gives runtime O(k2rL2 + knL2) per iteration.
We can improve this runtime by noticing that the active states at corresponding positions in λ and a neighbour λ' are identical for most sequence positions: if λ' consists of moving the border at position j in λ to position j + 1, this only changes the active states at positions j - r and j + r.
We can precompute the forward and backward probabilities of all states and positions for the ball of radius r around λ. For any HMM M, let ϕ
i
(k) be the probability that M, started at the initial state x0, emits y1, ..., y
i
and that x
i
= k, and let β
i
(k) be the probability that M, started at state k, emits y
i
, ..., y
n
. The probability that M emits y is the dot product of ϕ
i
and βi+1for any i. We can compute the forward and backward vectors ϕ
i
and β
i
for all positions i in time proportional to the runtime of the forward algorithm. In particular, if we have computed Pr[B(λ, r)], we can also compute all of the forward and backward vectors at all positions of the sequence in the same asymptotic runtime.
Now, if we are considering moving the border at j to position j + 1, we can keep the value of ϕj-r-1, as the active states and probabilities are unchanged at the first j - r - 1 positions. From that point, we can use ϕj-r-1and run the forward algorithm in the slightly altered HMM with the new border for the next 2r + 1 positions, until we compute the forward vector at position j + r; call it
. In the remaining positions, the probabilities do not change, so
, and we can compute the ball probability for λ' by computing only O(r) columns in the forward algorithm. Each column might require as many as Θ(rL) active states in extreme cases, so the algorithm takes O(krL2) time to analyze each ball. (Note that kL is also an upper bound on the number of active states in any column, but this is likely a coarser upper bound, as k may typically grow with the sequence length, while r is a parameter of the optimization.) This method gives O(nL2 + k2rL2) runtime per iteration of the local search to compute the forward and backward vectors for λ and analyze all 2k neighbours λ' . In practice, this technique sped up our algorithms by a factor of ten for our experiments with transmembrane protein data shown later.
Global optimization for paths with long intervals
While the general problem of finding the most probable ball of labellings is NP-hard, the problem can be solved in polynomial time for a special case. Here, we look for the most probable ball of state paths rather than labellings (which corresponds to each state having its own label). Furthermore, we are only interested in finding the most probable among those balls whose centre paths have each state lasting at least 2r + 1 positions. A path satisfying this requirement is called (2r + 1)-regular. We call a path π weakly (2r + 1)-regular if all its states except the final one last for at least 2r + 1 positions, while the final state lasts at least r + 1 positions. A weakly (2r + 1)-regular ball is a ball centered at such a path. Our dynamic programming algorithm maintains variables F [j, s] corresponding to the probability of the most probable weakly (2r + 1)-regular ball up to position j whose centre has state s at position j. The key idea behind the algorithm is that, by storing the probabilities of weakly (2r + 1)-regular balls, we can ensure that the last state of each path in the ball is the same. This allows us to devise a dynamic programming procedure similar to the Viterbi algorithm.
An optimal weakly (2r + 1)-regular ball up to position j can arise either by extending the final state of some centre path optimal to position j - 1 or by allowing a border to occur at position j - r because the last border must occur at least r positions previously. The recursive formula for calculating the probability of the new optimal ball is given by
where
, the probability that the model emits the sequence yj-2r-1to y
j
in only the states s' and s, and with a forced transition between those states in the interval, conditioned on starting the interval in state s'; we write
as a shorthand for x
i
, xi+1, ..., x
j
= s, ..., s.
In this formula, the first term computes the probability of the best weakly (2r + 1)-regular ball ending at s at position j where there is a border at position j - r, and the second computes the probability of extending the previously optimal path in the same state.
Once we have computed the values of F [j, s] up to j = n - r, we multiply each of these values by a sequence of transition and emission probabilities corresponding to extending the state s up to position n. This ensures that our centres are (2r + 1)-regular in the strong sense. Finally, we pick the largest probability obtained in this way and reconstruct the actual state path.
The runtime of the algorithm is determined by the time needed to calculate the Q(s, s', j) values. We can calculate these sums in amortized constant time by reusing the values obtained for the previous value of j. This yields a runtime of O(nm2), as for the Viterbi algorithm.