Expectation/Maximization
Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(xθ) = log[Σ_{π}P(x,πθ)]. At each iteration of the estimation process we would like the new log likelihood, P(xθ), to be greater than the old, P(xθ*). The difference in log likelihoods can be written such that one part is a relative entropy, the positivity of which makes the EM algorithm work:
log P(xθ)  log P(xθ*) = Q(θθ*)  Q(θ*θ*) + D[P(πx,θ*)P(πx,θ)],
where D[......] is the KullbackLeibler divergence, or relative entropy, and Q(θθ*) = Σ_{π}P(x,πθ). Now a greater log likelihood results simply by maximizing Q(θθ*) with respect to parameters θ. The EM iteration is comprised of two steps: (1) Estimation – calculate Q(θθ*), and (2) Maximization – maximize Q(θθ*) with respect to parameters θ.
For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability a_{kl} by A_{kl}(π) and those on emission probabilities e_{kb} by E_{k}(b,π) (following [1] conventions), P(x,πθ) can then be written:
P(x,πθ) = ∏_{k = 0}∏_{b}[e_{kb}]^E_{k}(b,π) ∏_{k = 0}∏_{l = 1}[a_{kl}]^A_{kl}(π).
Using the above form for P(x,πθ), A_{kl} for the expected value of A_{kl}(π) on path π, and E_{k}(b) for the expected value of E_{k}(b,π) on path π, it is then possible to write Q(θθ*) as:
Q(θθ*) = Σ_{k = 1}Σ_{b} E_{k}(b) log[e_{kb}] + Σ_{k = 0}Σ_{l = 1} A_{kl} log[a_{kl}].
It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for a_{kl} and e_{kb} are:
a_{kl} = A_{kl} /(Σ_{l} A_{kl}) and e_{kb} = E_{k}(b)/(Σ_{b} E_{k}(b)).
The latter estimation is for when the state sequence is known. For an HMM (with BaumWelch algorithm) it completes the Q maximization step (Mstep), which is obtained with the MLEs for a_{kl} and e_{kb}. The Estep requires that Q be calculated, for the HMM this requires that A_{kl} and E_{k}(b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.
Emission and Transition Expectations with Rescaling
For an HMM, the probability that transition a_{kl} is used at position i in sequence Z is:
p(S_{i} = k, S_{(i+1)} = lX) = p(S_{i} = k, S_{(i+1)} = l, Z)/p(Z), where
p(S_{i} = k,S_{(i+1)} = l,Z) = p(z_{0},...,z_{i},S_{i} = k)p(S_{(i+1)} = lS_{i} = k)p(z_{i+1}S_{(i+1)} = l)p(z_{i+2},...,z_{L1}S_{(i+1)} = l).
In terms of the previous notation with forward/backward variables:
p(S_{i} = k, S_{(i+1)} = lX) = f_{ki} a_{kl} e_{l(i+1)} b_{l(i+1)} /p(Z),
So the expected number of times a_{kl} is used, A_{kl}, simply sums over all positions i (except last with indexing):
A_{kl} = Σ_{i} f_{ki} a_{kl} e_{l(i+1)}b_{l(i+1)}/p(Z),
Similarly, the probability that b is emitted by state k at position i in sequence Z:
p(z_{i} = b, S_{i} = kX) = [ p(z_{0},...,z_{i},S_{i} = k) p(z_{i+1},...,z_{L1}S_{i} = k)/p(Z) ] δ(z_{i}b),
where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:
E_{k}(b) = Σ_{i} f_{ki} b_{ki}/p(Z) δ(z_{i}b),
In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that Σ_{i}F_{ki} = 1, where F_{ki} is the rescaled forward variable, and B_{ki} is the rescaled backward variable: F_{ki} = a_{βk}e_{ki}F_{β(i1)}/s_{i}, and B_{ki} = a_{kβ}e_{β(i+1)} B_{β(i+1)}/t_{i+1}, where s_{i} and t_{i+1}, are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:
A_{kl} = Σ_{i} F_{ki} a_{kl} e_{l(i+1)} B_{l(i+1)}/[Σ_{k}Σ_{l} F_{ki} a_{kl} e_{l(i+1)}] and E_{k}(b) = Σ_{i} F_{ki} B_{ki} δ(z_{i}b).
Gap Interpolating Markov Model (gIMM)
We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent.
Find out which position among the 24 prior positions has the maximum mutual information with frame position 0. Suppose it is p1
Frame positions: ...012012012012012012012012012...
Prior positions: ...987654321
Find out a position from p = 1...24 and p ≠ p1 such that it has the maximum mutual information with frame position 0 and the prior position p1. If it is p2: MI(x0,x1;x2)>MI(x0,x1;xk).
Repeat the above procedure until we get an nthorder MIlinkage, this defines an interpolated Markov model that allows for gaps (gIMM). Likewise, for the frame 1 and 2, and reverse frames.
Most importantly, this can also be done for the regions upstream of coding regions, with reference base just prior to the start codon, in searches for regulatory motifs.
Hash Interpolated MM (hIMM)
Given the current state and the emitted sequence as x1,...., xL; compute:
P(xLx1,...., xL1) ≈ Count(x1,...., xL)/Count(x1,...., xL1)
Iff Count(x1,...., xL1) ≥ 400 i.e. only if the parental sequence shows statistical significance
Store P(xLx1,...., xL1) in the hash
Maintain a separate hash for each of the following states – Junk, Intron and Exon0, Exon1 & Exon2
Pseudo Code:
The emission probabilities for states such as (jj), (ee) and (ii) are computed using hash IMM as:
Given sequence 'x1,...., xL'
If sequence is defined;
Return Probability
Else
Recurse on 'x1,...., xL1'
Minimum string length allowed for:

1.
Junk Region = 6

2.
Intron Region = 6

3.
Exon Region = 8
UL/LL Data Generation Method
We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.
Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.
Cumulant Distribution Function (CDF) of exponential distribution:
F = 1  exp (t/a) where 'a' is the mean
Hence t = a.log_{e}(1  F)
We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.
We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.
HMMwithDuration via Cumulant Transition Probabilities
The transition probabilities for (ee) – (ee); (jj) – (jj) and (ii) – (ii) type transitions are computed as:
Prob(jjj_{length} = L) = Prob(j_{length} ≥ L+1)/Prob(j_{length} ≥ L)
The transition probabilities for transitions such as (jj) – (je), (ee) – (ej), (ee) – (ei), (ii) – (ie) are computed as:
If the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40, then:
Prob(eie_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 40/(40 + 60)
Prob(eje_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 60/(40 + 60)
Pseudo Code:
Maintain separate counters for the junk, exon and intron regions.
The counters are updated as:

1.
The exon counter is set to 2 for a (je) – (ee) transition

2.
The exon counter gets incremented by 1 for every (ee) – (ee) transition
Prob(e_{length} ≥ L+1) is computed as:
We have Prob(e_{length} ≥ L+1) = 1  Σ_{i = 1..L} Prob(e_{length} = i)
Hence we generate a list such that for each index 'k > 0', the value 1  Σ_{i = 1..k} Prob(e_{length} = i) is stored
HMMwithDuration Viterbi Implementation
The transition probabilities for (UL) – (UL); (LL) – (LL) are computed as:
Prob(ulul_len = L) = Prob(ul_len ≥ L+1)/Prob(ul_len ≥ L)
The transition probabilities for transitions such as (UL) – (LL), (LL) – (UL) are computed as:
Prob(ulllul_len = L) = Prob(ul_len = L)/Prob(ul_len ≥ L)
Pseudo Code (see dynamic programming table in Figure 9):
Maintain separate counters for the UL and LL regions
The counters are updated as:

1.
The UL counter is set to 1 for a (ll) – (ul) transition

2.
The UL counter gets incremented by 1 for every (ul) – (ul) transition
Prob(ul_len ≥ L+1) is computed as:
We have Prob(ul_len ≥ L+1) = 1  Σ_{i = 1} ^{L} Prob(ul_len = i)
Hence we generate a list such that for each index 'k > 0', the value 1  Σ_{i = 1} ^{k}Prob(ul_len = i) is stored
Prob(ul_len = L) is computed as:
We generate a list such that for each index 'k', the value Prob(ul_len = k) is stored