Expectation/Maximization
Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(x|θ) = log[ΣπP(x,π|θ)]. At each iteration of the estimation process we would like the new log likelihood, P(x|θ), to be greater than the old, P(x|θ*). The difference in log likelihoods can be written such that one part is a relative entropy, the positivity of which makes the EM algorithm work:
log P(x|θ) - log P(x|θ*) = Q(θ|θ*) - Q(θ*|θ*) + D[P(π|x,θ*)||P(π|x,θ)],
where D[...||...] is the Kullback-Leibler divergence, or relative entropy, and Q(θ|θ*) = ΣπP(x,π|θ). Now a greater log likelihood results simply by maximizing Q(θ|θ*) with respect to parameters θ. The EM iteration is comprised of two steps: (1) Estimation – calculate Q(θ|θ*), and (2) Maximization – maximize Q(θ|θ*) with respect to parameters θ.
For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability akl by Akl(π) and those on emission probabilities ekb by Ek(b,π) (following [1] conventions), P(x,π|θ) can then be written:
P(x,π|θ) = ∏k = 0∏b[ekb]^Ek(b,π) ∏k = 0∏l = 1[akl]^Akl(π).
Using the above form for P(x,π|θ), Akl for the expected value of Akl(π) on path π, and Ek(b) for the expected value of Ek(b,π) on path π, it is then possible to write Q(θ|θ*) as:
Q(θ|θ*) = Σk = 1Σb Ek(b) log[ekb] + Σk = 0Σl = 1 Akl log[akl].
It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for akl and ekb are:
akl = Akl /(Σl Akl) and ekb = Ek(b)/(Σb Ek(b)).
The latter estimation is for when the state sequence is known. For an HMM (with Baum-Welch algorithm) it completes the Q maximization step (M-step), which is obtained with the MLEs for akl and ekb. The E-step requires that Q be calculated, for the HMM this requires that Akl and Ek(b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.
Emission and Transition Expectations with Rescaling
For an HMM, the probability that transition akl is used at position i in sequence Z is:
p(Si = k, S(i+1) = l|X) = p(Si = k, S(i+1) = l, Z)/p(Z), where
p(Si = k,S(i+1) = l,Z) = p(z0,...,zi,Si = k)p(S(i+1) = l|Si = k)p(zi+1|S(i+1) = l)p(zi+2,...,zL-1|S(i+1) = l).
In terms of the previous notation with forward/backward variables:
p(Si = k, S(i+1) = l|X) = fki akl el(i+1) bl(i+1) /p(Z),
So the expected number of times akl is used, Akl, simply sums over all positions i (except last with indexing):
Akl = Σi fki akl el(i+1)bl(i+1)/p(Z),
Similarly, the probability that b is emitted by state k at position i in sequence Z:
p(zi = b, Si = k|X) = [ p(z0,...,zi,Si = k) p(zi+1,...,zL-1|Si = k)/p(Z) ] δ(zi-b),
where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:
Ek(b) = Σi fki bki/p(Z) δ(zi-b),
In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that ΣiFki = 1, where Fki is the rescaled forward variable, and Bki is the rescaled backward variable: Fki = aβkekiFβ(i-1)/si, and Bki = akβeβ(i+1) Bβ(i+1)/ti+1, where si and ti+1, are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:
Akl = Σi Fki akl el(i+1) Bl(i+1)/[ΣkΣl Fki akl el(i+1)] and Ek(b) = Σi Fki Bki δ(zi-b).
Gap Interpolating Markov Model (gIMM)
We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent.
Find out which position among the 24 prior positions has the maximum mutual information with frame position 0. Suppose it is p1
Frame positions: ...012012012012012012012|012012...
Prior positions: ...------------------987654321
Find out a position from p = 1...24 and p ≠ p1 such that it has the maximum mutual information with frame position 0 and the prior position p1. If it is p2: MI(x0,x1;x2)>MI(x0,x1;xk).
Repeat the above procedure until we get an nth-order MI-linkage, this defines an interpolated Markov model that allows for gaps (gIMM). Likewise, for the frame 1 and 2, and reverse frames.
Most importantly, this can also be done for the regions upstream of coding regions, with reference base just prior to the start codon, in searches for regulatory motifs.
Hash Interpolated MM (hIMM)
Given the current state and the emitted sequence as x1,...., xL; compute:
P(xL|x1,...., xL-1) ≈ Count(x1,...., xL)/Count(x1,...., xL-1)
Iff Count(x1,...., xL-1) ≥ 400 i.e. only if the parental sequence shows statistical significance
Store P(xL|x1,...., xL-1) in the hash
Maintain a separate hash for each of the following states – Junk, Intron and Exon0, Exon1 & Exon2
Pseudo Code:
The emission probabilities for states such as (jj), (ee) and (ii) are computed using hash IMM as:
Given sequence 'x1,...., xL'
If sequence is defined;
Return Probability
Else
Recurse on 'x1,...., xL-1'
Minimum string length allowed for:
-
1.
Junk Region = 6
-
2.
Intron Region = 6
-
3.
Exon Region = 8
UL/LL Data Generation Method
We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.
Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.
Cumulant Distribution Function (CDF) of exponential distribution:
F = 1 - exp (-t/a) where 'a' is the mean
Hence t = a.loge(1 - F)
We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.
We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.
HMM-with-Duration via Cumulant Transition Probabilities
The transition probabilities for (ee) – (ee); (jj) – (jj) and (ii) – (ii) type transitions are computed as:
Prob(jj|jlength = L) = Prob(jlength ≥ L+1)/Prob(jlength ≥ L)
The transition probabilities for transitions such as (jj) – (je), (ee) – (ej), (ee) – (ei), (ii) – (ie) are computed as:
If the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40, then:
Prob(ei|elength = L) = Prob(elength = L)/Prob(elength ≥ L) × 40/(40 + 60)
Prob(ej|elength = L) = Prob(elength = L)/Prob(elength ≥ L) × 60/(40 + 60)
Pseudo Code:
Maintain separate counters for the junk, exon and intron regions.
The counters are updated as:
-
1.
The exon counter is set to 2 for a (je) – (ee) transition
-
2.
The exon counter gets incremented by 1 for every (ee) – (ee) transition
Prob(elength ≥ L+1) is computed as:
We have Prob(elength ≥ L+1) = 1 - Σi = 1..L Prob(elength = i)
Hence we generate a list such that for each index 'k > 0', the value 1 - Σi = 1..k Prob(elength = i) is stored
HMM-with-Duration Viterbi Implementation
The transition probabilities for (UL) – (UL); (LL) – (LL) are computed as:
Prob(ul|ul_len = L) = Prob(ul_len ≥ L+1)/Prob(ul_len ≥ L)
The transition probabilities for transitions such as (UL) – (LL), (LL) – (UL) are computed as:
Prob(ul-ll|ul_len = L) = Prob(ul_len = L)/Prob(ul_len ≥ L)
Pseudo Code (see dynamic programming table in Figure 9):
Maintain separate counters for the UL and LL regions
The counters are updated as:
-
1.
The UL counter is set to 1 for a (ll) – (ul) transition
-
2.
The UL counter gets incremented by 1 for every (ul) – (ul) transition
Prob(ul_len ≥ L+1) is computed as:
We have Prob(ul_len ≥ L+1) = 1 - Σi = 1 L Prob(ul_len = i)
Hence we generate a list such that for each index 'k > 0', the value 1 - Σi = 1 kProb(ul_len = i) is stored
Prob(ul_len = L) is computed as:
We generate a list such that for each index 'k', the value Prob(ul_len = k) is stored