Volume 7 Supplement 2

## Third Annual MCBIOS Conference. Bioinformatics: A Calculated Discovery

# Hidden Markov Model Variants and their Application

- Stephen Winters-Hilt
^{1, 2}Email author

**7(Suppl 2)**:S14

https://doi.org/10.1186/1471-2105-7-S2-S14

© Winters-Hilt; licensee BioMed Central Ltd. 2006

**Published: **26 September 2006

## Abstract

Markov statistical methods may make it possible to develop an unsupervised learning process that can automatically identify genomic structure in prokaryotes in a comprehensive way. This approach is based on mutual information, probabilistic measures, hidden Markov models, and other purely statistical inputs. This approach also provides a uniquely common ground for comparative prokaryotic genomics. The approach is an on-going effort by its nature, as a multi-pass learning process, where each round is more informed than the last, and thereby allows a shift to the more powerful methods available for supervised learning at each iteration. It is envisaged that this "bootstrap" learning process will also be useful as a knowledge discovery tool. For such an *ab initio* prokaryotic gene-finder to work, however, it needs a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more).

For eukaryotes, even with better start-of-coding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic gene-predictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the *C. elegans* alt-splice statistics. State profiles have made use of a novel hash-interpolating MM (hIMM) method. A new implementation for an HMM-with-Duration is also described, with far-reaching application to gene-structure identification and analysis of channel current blockade data.

## Background

### Motivations for seeking MM and HMM Variants

Part of the problem in developing a reliable gene predictor is having reliable, biologist-verified, training data. In its simplest form, the training data needed is the raw genomic DNA together with a minimal annotation that labels coding regions. For the prokaryotic gene prediction much of the problem with obtaining high-confidence training data can be circumvented by using a bootstrap gene-prediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs (open reading frames) are usually long genes, and compact in that motif searches upstream usually range over hundreds of bases rather than thousands (as in human).

In work on prokaryotic gene prediction (focusing on *V. cholerae*), software (*smORF*) was developed for an augmented ORF (open reading frame) characterization. Using that software, with a simple start-of-coding heuristic, it was possible to establish very good gene prediction for ORFs of length greater than 500 nucleotides. The smORF gene identification was used in a bootstrap gene-annotation process (where no initial training data was provided). The strength of the gene identification was then improved by use of a software tool that performed gap-interpolating-Markov-modeling (gIMM). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMM-scoring coding regions were rejected, performance improved. Failure analysis clearly indicates that start-codon modeling is needed in order to strengthen predictions. One of the benefits of the gIMM software is its gap-interpolating generalization. This permits motifs to be identified, particularly those sharing the same approximate alignment with the start-of-coding region. Using the bootstrap-identified genes from the smORF-based gene-prediction (including errors) as a train set permitted an unsupervised search for upstream regulatory structure. The classic Shine-Dalgarno sequence (the ribosome binding site) was the strongest signal in the 30-base window upstream from the start codon.

In preliminary work, a Hidden Markov Model based gene predictor was trained and tested on the *C. elegans* genome. Splice signal motifs and intron/exon statistical profiles were extracted, from which a gene predictor was constructed. To boost detection of exons, and indirectly introns, EST information was included. As with the prokaryotic research, further work entails identification of transcription regulation fingerprints, such as the promoter motifs, to clarify starts on transcription, etc. Even with better start-of-coding identification, however, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the typically poor gene-prediction performance in alternatively spliced regions. To address these complications an approach is described to expand the states in the gene-predictor HMM to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse was indicated after preliminary analysis of the *C. elegans* alt-splice statistics. State profiles were implemented using a hash-interpolating MM (hIMM) in this process (see Methods).

### Markov Chains

Key property of a Markov chain:

P(x_{i} | x_{i-1}, ..., x_{1}) = P (x_{i} | x_{i-1}) =
,

P(x) = P(x_{L}, x_{L-1} ..., x_{1}) = P(x_{1}) ∏_{i = 2..L}

C_{y} is the count of events y, and C_{xy} is the count of simultaneous events x and y, T_{y} is the count of strings of length one, and T_{xy} is the count of strings of length two:

Since T_{xy}+1 = T_{y} → T_{xy} ≅ T_{y} (sequential data sample property if one long training block),
= C_{xy}/C_{y} = C_{xy}/Σ_{x}C_{xy}, so C_{xy} is complete info for determining transition probabilities.

### Viterbi Path

The recursive algorithm for the most likely state path given an observed sequence (the Viterbi algorithm) is expressed in terms of v_{ki}, the probability of the most probable path that ends with observation Z_{i} = z_{i}, and state S_{i} = k. The recursive relation is v_{ki} = max_{n}{e_{ki}a_{nk}v_{n(i-1)}}, where the max_{n}{...} operation returns the maximum value of the argument over different values of index n, and the boundary condition on the recursion is v_{k0} = e_{k0}p_{k}. The a_{kl} are the transition probabilities P(X_{i} = l|X_{i-1} = k) to go from state k to state l. The e_{kb} are the emission probabilities P(S_{i} = b|X_{i} = k) while in state k. The Viterbi path labelings are then recursively defined by p(S_{i}|S_{(i+1)} = n) = argmax_{k}{v_{ki}a_{kn}}, where the argmax_{n}{...} operation returns the index n with maximum value of the argument. The evaluation of sequence probability (and its Viterbi labeling) take the emission and transition probabilities as a given. Estimates on those emission and transition probabilities are usually obtained by the Expectation/Maximization (EM) algorithm (commonly referred to as the Baum-Welch algorithm in the context of HMMs [1]).

Further details on the information measures used, such as mutual information, and on Expectation/Maximization (EM), are placed in Appendix A.

### EVA Projection

The HMM method is based on a stationary set of emission and transition probabilities. Emission broadening, via amplification of the emission state variances, is a filtering heuristic that leads to level-projection that strongly preserves transition times between major levels (see [13] for further details). Results from the emission variance amplification (EVA) emission broadening method are described in [13] (with varying amounts of variance amplification). This approach does not require the user to define the number of levels (classes). This is a major advantage compared to existing tools that require the user to determine the levels (classes) and perform a state projection. This allows kinetic features to be extracted with a "simple" FSA (Finite State Automaton) that requires minimal tuning.

## Results

*Ab initio* prokaryotic gene-finding

*Ab initio*gene-finding begins with identification of codon structure on the basis of a simple, genome-wide, mutual information analysis. The idea is to examine the genome's two-base pairings, so first consider pairings where the bases are adjacent in the genome, i.e., a sample set consisting of all dinucleotides found upon moving a window of two-base width across the genome. The counts on those pairings can be used to obtain a mutual information between the first base and second base. The next iteration is to take two-base-pairings separated by one base (gap = 1), etc., with mutual information results as shown in Figure 4 (applied to

*V. Cholerae*). This overall method is generalized further to define a gap-interpolating Markov model (details in Methods).

*Ab initio*gene-finding can then identify the stop codons and, thus, ORFs (see Figure 5). A generalization to codon void regions, then, leads to recognition of different, overlapping, potential gene regions (with two orientations). A tool has been developed to identify the genomic ORF topology in this generalized way, named "

*smORF*" (see Figure 6 and Table 1 for further definitions and results). This genome-topology tool also clearly shows differences between bacteria (a possible "fingerprint" tool) (see Figure 6 and Figure 7).

Topology-Index Histogram results are shown for *Vibrio Cholerae*. The N = 400 set correspond to indexing with ORFs greater than 400 bases long. Similarly, the N = 200 set correspond to indexing with ORFs greater than 200 bases long.

N = 400 | N = 200 | ||
---|---|---|---|

Score | Count | Score | Count |

0 | 1653287 | 0 | 846889 |

1000 | 649241 | 1000 | 1017629 |

2000 | 400 | 2000 | 1554 |

10000 | 630232 | 10000 | 954394 |

11000 | 27986 | 11000 | 138817 |

20000 | 3 | 12000 | 161 |

20000 | 1700 | ||

21000 | 5 |

*smORF* offers information about open reading frames (ORFs), and tallies information about other such codon void regions (an ORF is a void in three codons: TAA, TAG, TGA). This allows for a more informed selection process when sampling from a genome, such that non-overlapping gene starts can be cleanly and unambiguously sampled. The goal is, initially, to identify key gene structures (e.g., stop codons, etc.) and use only the highest confidence examples to train profilers. Once this is done, Markov models (MMs) can be constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterates again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude gene-finder can be constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first in-frame "atg". This analysis was applied to the *V. cholerae* genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first-"atg" heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the start-codon relative to the first-"atg" (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic gene-finder. Basically, information is needed to help identify the correct start codon in a 50 base window. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control).

For an *ab initio* gene-finder to work, it will need to have a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). In essence, a Markov model is needed with greater "reach" – the gap-interpolating Markov model (gIMM) was developed for this purpose, and is described in the Methods. To set up an *ab initio* motif discovery windows around the (1253) purported start of genes were sampled. The windows ranged from the 40 bases preceding the start codon to the first 20 bases of coding (a 60 base window). Some of the windows represent noise, as the first pass of the bootstrap feature extraction has only 92.1% accuracy. Even so, the gIMM is able to clearly discern the Shine-Dalgarno consensus sequence. With the critical motifs already discerned, further iterations of the MM construction, possibly as an HMM now, will undoubtedly aid in improving performance.

### Alternate-Splice Labeling Scheme for Eukaryotic HMM

The labeling scheme assigns a label to each base in the sequence. Exon frame position 0 bases have label 0 if in the forward read or A if in the reverse. Likewise, exon frame position 1 bases have label 1 or B, and exon frame position 2 bases have label 2 or C. Introns, for purposes of the analysis here, are represented as 'i' or 'I' for intron on the forward or reverse strand (in the HMM implementation the intron states are actually split out in order to maintain proper frame on re-entry to coding regions via state transition restrictions). Junk DNA is labeled 'j'.

#### Track Label Information

*C. elegans*genes has alternate splicing. Similarly, Table 3 shows the transition counts between labels.

(a) shows the Track 1 Label Counts, and (b) shows the Track 2 Label Counts.

(a) | ||
---|---|---|

0 : 571,187 | A : 518,431 | I : 1,634,653 |

1 : 571,187 | B : 518,431 | i : 1,779,392 |

2 : 571,187 | C : 518,431 | j : 7,336,733 |

(b) | ||

0 : 21,599 | A : 64,475 | I : 325,471 |

1 : 21,599 | B : 64,471 | i : 81,289 |

2 : 21,599 | C : 64,467 | j : 13,354,661 |

(a) shows the Track 1 Transition Counts, and (b) shows the Track 2 Transition Counts.

(a) | ||
---|---|---|

01 : 569,483 | BA : 516,874 | II : 1,628,572 |

12 : 569,490 | CB : 516,868 | ii : 1,772,795 |

20 : 566,732 | AC : 514,309 | jj : 7,334,177 |

0i, i1 : 1,704 | IA, BI : 1,557 | j0, 2j : 1,257 |

1i, i2 : 1,696 | IB, CI : 1,563 | Aj, jC : 1,161 |

2i, i0 : 3,197 | IC, AI : 2,961 | |

(b) | ||

01 : 21,554 | BA : 64,296 | II : 324,751 |

12 : 21,548 | CB : 64,275 | ii : 81,073 |

20 : 21,441 | AC : 63,986 | jj : 13,354,350 |

0i, i1 : 45 | IA BI : 175 | j0, 2j : 38 |

1i, i2 : 51 | IB, CI : 192 | Aj, jC : 136 |

2i, i0 : 120 | IC, AI : 353 |

#### V-Labels and V-Transitions

*C. Briggsae*). Ignoring those V-transitions with negligible counts as not allowed transitions, as well as those implicitly describing no alternative splicing locally (an overlap with 'j' in either track), reduces to an active V-transition set consisting of 86 transitions between V-labels. This is a tractable number of states to manage in the HMM analysis, suggesting a simple and direct approach to alternative splice HMM analysis. The number of V-transitions, whether counting all 263, or the 86 'active' ones, is still much smaller than the 72*72 = 5,184 transitions that would have been surmised for track annotations that were entirely independent.

V-label Counts.

V00, V11, V22: 17,839 | VA0, VB2, VC1: 0 |
---|---|

V01, V12, V20: 3 | VA1, VB0, VC2: 0 |

V02, V10, V21: 58 | VA2, VB1, VC0: 829 |

V0A, V1C, V2B: 741 | VAA, VBB, VCC: 16,169 |

V0B, V1A, V2C: 957 | VAB, VBC, VCA: 0 |

V0C, V1B, V2A: 5164 | VAC, VBA, VCB: 54 |

### HMM-with-Duration with application to channel current signal analysis

#### Synthetic Data Generation

The HMM-with-Duration implementation (described in the Methods) has been tested in terms of its performance at parsing synthetic blockade signals with attributes and visual appearance almost indiscernible from the experimentally observed blockade data. The benefit of the synthetic data is that the mean lifetimes of the blockade levels can range over an exhaustive set of possibilities for thorough testing of the HMM-with-Duration. The synthetic data was designed to have two levels, with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). In Figure 8, the data was generated with Poisson distributed lifetimes and parsed by an HMM's with and without duration, with length distribution generated from a Poisson (Figure 8a) or a Gaussian (Figure 8b) Distributions. The results clearly demonstrate the superior performance of the HMM-with-duration over its simpler, HMM without Duration, formulation. With use of the EVA-projection method described in the Background (and in [13]) this affords a robust means to obtain kinetic feature extraction. The emission broadening introduced in the HMM w/wo duration comparison in Figure 8a and 8b, is precisely what occurs with EVA-projection, with similar susceptibility to failure due to over-representation of brief blockade lifetimes in the HMM without Duration parse. Using HMM with duration this problem is resolved, brief toggles between states are now penalized according to the statistical rarity of the comparatively brief lifetime events (see Fig.'s 8a and 8b). Resolving this problem is critical for accurate kinetic feature extraction, and the results suggests that this problem can be elegantly solved with a pairing of the HMM-with-Duration stabilization with EVA-projection.

## Discussion

### aTOPO

As already mentioned, the gIMM tool identifies statistically anomalous motifs (usually restricted to some zone upstream). Another tool being developed, "aTOPO" (for motifs with Anomalous TOPOlogy), uses the information from the upstream zones swept with the gIMM tool to annotate upstream motifs, and offers further information on whether an anomalous motif is biologically significant by looking at is positional distribution and its correlations to other motifs. This work is still preliminary, so this will not be discussed further.

## Methods

### Expectation/Maximization

Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(x|θ) = log[Σ_{π}P(x,π|θ)]. At each iteration of the estimation process we would like the new log likelihood, P(x|θ), to be greater than the old, P(x|θ*). The difference in log likelihoods can be written such that one part is a relative entropy, the positivity of which makes the EM algorithm work:

log P(x|θ) - log P(x|θ*) = Q(θ|θ*) - Q(θ*|θ*) + D[P(π|x,θ*)||P(π|x,θ)],

where D[...||...] is the Kullback-Leibler divergence, or relative entropy, and Q(θ|θ*) = Σ_{π}P(x,π|θ). Now a greater log likelihood results simply by maximizing Q(θ|θ*) with respect to parameters θ. The EM iteration is comprised of two steps: (1) Estimation – calculate Q(θ|θ*), and (2) Maximization – maximize Q(θ|θ*) with respect to parameters θ.

For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability a_{kl} by A_{kl}(π) and those on emission probabilities e_{kb} by E_{k}(b,π) (following [1] conventions), P(x,π|θ) can then be written:

P(x,π|θ) = ∏_{k = 0}∏_{b}[e_{kb}]^E_{k}(b,π) ∏_{k = 0}∏_{l = 1}[a_{kl}]^A_{kl}(π).

Using the above form for P(x,π|θ), A_{kl} for the expected value of A_{kl}(π) on path π, and E_{k}(b) for the expected value of E_{k}(b,π) on path π, it is then possible to write Q(θ|θ*) as:

Q(θ|θ*) = Σ_{k = 1}Σ_{b} E_{k}(b) log[e_{kb}] + Σ_{k = 0}Σ_{l = 1} A_{kl} log[a_{kl}].

It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for a_{kl} and e_{kb} are:

a_{kl} = A_{kl} /(Σ_{l} A_{kl}) and e_{kb} = E_{k}(b)/(Σ_{b} E_{k}(b)).

The latter estimation is for when the state sequence is known. For an HMM (with Baum-Welch algorithm) it completes the Q maximization step (M-step), which is obtained with the MLEs for a_{kl} and e_{kb}. The E-step requires that Q be calculated, for the HMM this requires that A_{kl} and E_{k}(b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.

### Emission and Transition Expectations with Rescaling

For an HMM, the probability that transition a_{kl} is used at position i in sequence Z is:

p(S_{i} = k, S_{(i+1)} = l|X) = p(S_{i} = k, S_{(i+1)} = l, Z)/p(Z), where

p(S_{i} = k,S_{(i+1)} = l,Z) = p(z_{0},...,z_{i},S_{i} = k)p(S_{(i+1)} = l|S_{i} = k)p(z_{i+1}|S_{(i+1)} = l)p(z_{i+2},...,z_{L-1}|S_{(i+1)} = l).

In terms of the previous notation with forward/backward variables:

p(S_{i} = k, S_{(i+1)} = l|X) = f_{ki} a_{kl} e_{l(i+1)} b_{l(i+1)} /p(Z),

So the expected number of times a_{kl} is used, A_{kl}, simply sums over all positions i (except last with indexing):

A_{kl} = Σ_{i} f_{ki} a_{kl} e_{l(i+1)}b_{l(i+1)}/p(Z),

Similarly, the probability that b is emitted by state k at position i in sequence Z:

p(z_{i} = b, S_{i} = k|X) = [ p(z_{0},...,z_{i},S_{i} = k) p(z_{i+1},...,z_{L-1}|S_{i} = k)/p(Z) ] δ(z_{i}-b),

where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:

E_{k}(b) = Σ_{i} f_{ki} b_{ki}/p(Z) δ(z_{i}-b),

In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that Σ_{i}F_{ki} = 1, where F_{ki} is the rescaled forward variable, and B_{ki} is the rescaled backward variable: F_{ki} = a_{βk}e_{ki}F_{β(i-1)}/s_{i}, and B_{ki} = a_{kβ}e_{β(i+1)} B_{β(i+1)}/t_{i+1}, where s_{i} and t_{i+1}, are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:

A_{kl} = Σ_{i} F_{ki} a_{kl} e_{l(i+1)} B_{l(i+1)}/[Σ_{k}Σ_{l} F_{ki} a_{kl} e_{l(i+1)}] and E_{k}(b) = Σ_{i} F_{ki} B_{ki} δ(z_{i}-b).

### Gap Interpolating Markov Model (gIMM)

We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent.

Frame positions: ...012012012012012012012|012012...

Prior positions: ...------------------987654321

### Hash Interpolated MM (hIMM)

P(xL|x1,...., xL-1) ≈ Count(x1,...., xL)/Count(x1,...., xL-1)

Iff Count(x1,...., xL-1) ≥ 400 i.e. only if the parental sequence shows statistical significance

Pseudo Code:

Given sequence 'x1,...., xL'

If sequence is defined;

Return Probability

Else

Recurse on 'x1,...., xL-1'

### UL/LL Data Generation Method

We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.

Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.

Cumulant Distribution Function (CDF) of exponential distribution:

F = 1 - exp (-t/a) where 'a' is the mean

Hence t = a.log_{e}(1 - F)

We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.

We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.

### HMM-with-Duration via Cumulant Transition Probabilities

Prob(jj|j_{length} = L) = Prob(j_{length} ≥ L+1)/Prob(j_{length} ≥ L)

If the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40, then:

Prob(ei|e_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 40/(40 + 60)

Prob(ej|e_{length} = L) = Prob(e_{length} = L)/Prob(e_{length} ≥ L) × 60/(40 + 60)

Pseudo Code:

- 1.
The exon counter is set to 2 for a (je) – (ee) transition

- 2.
The exon counter gets incremented by 1 for every (ee) – (ee) transition

We have Prob(e_{length} ≥ L+1) = 1 - Σ_{i = 1..L} Prob(e_{length} = i)

Hence we generate a list such that for each index 'k > 0', the value 1 - Σ_{i = 1..k} Prob(e_{length} = i) is stored

### HMM-with-Duration Viterbi Implementation

Prob(ul|ul_len = L) = Prob(ul_len ≥ L+1)/Prob(ul_len ≥ L)

Prob(ul-ll|ul_len = L) = Prob(ul_len = L)/Prob(ul_len ≥ L)

- 1.
The UL counter is set to 1 for a (ll) – (ul) transition

- 2.
The UL counter gets incremented by 1 for every (ul) – (ul) transition

We have Prob(ul_len ≥ L+1) = 1 - Σ_{i = 1} ^{L} Prob(ul_len = i)

Hence we generate a list such that for each index 'k > 0', the value 1 - Σ_{i = 1} ^{k}Prob(ul_len = i) is stored

We generate a list such that for each index 'k', the value Prob(ul_len = k) is stored

## Appendix A

### Information measures

The fundamental information measures are Shannon entropy, mutual information, and relative entropy (also known as the Kullback-Leibler divergence or distance). Shannon entropy: σ = -Σ_{x}p(x)log(p(x)), is a measure of the information in distribution p(x). Mutual Information: μ = Σ_{x}Σ_{y}p(xy)log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. Relative Entropy (Kullback-Leibler distance): ρ = Σ_{x} p(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. Mutual information is a special case of relative entropy between a joint probability (two-component in simplest form) and the product of component probabilities.

### Hidden Markov Models with possible Side Information

Hidden Markov Models [1] provide a statistical framework for sequences of observations obeying stationary Markov statistics. The "hidden" part of the HMM consists of the labelings, s_{i}, for each observation, z_{i}, where the index i labels the observation. The stationary statistics for a first order HMM are described in terms of emission probabilities, e_{ni} = p(Z_{j} = z_{i}|S_{j} = n), and transition probabilities, a_{nm} = p(S_{j} = m|S_{(j-1)} = n). (The indexing on j is left in for clarity on the transition probability definition; from stationarity the expressions are valid for any choice of j.) Given (i) sequence of observations z_{i}, (ii) hidden labels s_{i}, and (iii) stationary Markov statistics, one can calculate: (i) p(Z_{0...L-1}), or (ii) the most likely hidden labeling (path with largest contribution to p(Z_{0...L-1})), or (iii) the re-estimation of emission and transmission probabilities such that p(Z_{0...L-1}) is maximized (using Expectation/Maximization).

### Forward and Backward Variables

The forward/backward variables can be used to evaluate p(Z_{0...L-1}) by breaking the sequence probability p(Z_{0...L-1}) into two pieces via use of a single hidden variable treated as a Bayesian parameter: p(Z_{0...L-1}) = Σ_{k}p(Z_{0...i},s_{i} = k)p(Z_{i+1...L-1},s_{i} = k) = Σ_{k}f_{ki}b_{ki}, where f_{ki} = p(Z_{0...i},s_{i} = k) and b_{ki} = p(Z_{i+1...L-1},s_{i} = k). A proof for the two-piece split is obtained by directly expanding the sequence probability (via Chow expansion: P(xyz) = P(x)P(y|x)P(z|xy), and Markov conditional reduction: P(xyz) = P(x)P(y|x)P(z|y), with appropriate notational book-keeping). Given stationarity, the state transition probabilities and the state probabilities at the i^{th} observation satisfy the trivial relation p_{qi} = Σ_{k}a_{kq}p_{k(i-1)}, where p_{qi} = p(S_{i} = q), and p_{q0} = p(S = q), and the latter probabilities are the state priors. The trivial recursion relation that is implied can be thought of as an operator equation, with operation the product by a_{kq} followed by summation (contraction) on the k index. The operator equation can be rewritten using an implied summation convention on repeated Greek-font indices (Einstein summation convention): p_{q} = a_{βq}p_{β} . Transition-probabilities in a similar operator role, but now taking into consideration local sequence information via the emission probabilities, are found in recursively defined expressions for the forward variables, f_{ki} = a_{βk}e_{ki} f_{β(i-1)}, and backward variables, b_{ki} = a_{kβ}e_{β(i+1)} b_{β(i+1)}. The recursive definitions on forward and backward variables permit efficient computation of observed sequence probabilities using dynamic programming tables. It is at this critical juncture that side information must mesh well with the states (column components in the table), i.e., in a manner like the emission or transition probabilities. Length information, for example, can be incorporated via length-distribution-biased transition probabilities (as described in a new method in the Methods).

## Declarations

### Acknowledgements

The author would like to thank Anil Yelundur for helpful discussions. Funding was provided by grants from the National Institutes for Health, The National Science Foundation, The Louisiana Board of Regents, and NASA.

## Authors’ Affiliations

## References

- Durbin R:
*Biological sequence analysis : probabilistic models of proteins and nucleic acids*. Cambridge, UK & New York: Cambridge University Press; 1998.View ArticleGoogle Scholar - Bezrukov SM, Vodyanoy I, Parsegian VA:
**Counting polymers moving through a single ion channel.***Nature*1994,**370**(6457):279–281. 10.1038/370279a0View ArticlePubMedGoogle Scholar - Bak P, Tang C, Wiesenfeld K:
**Self-organized Criticality – An explanation of 1/f noise.***Phys Rev Lett*1987,**59:**381. 10.1103/PhysRevLett.59.381View ArticlePubMedGoogle Scholar - Sinai Y:
*Introduction to Ergodic Theory*. Princeton University Press; 1976.Google Scholar - Sklar L:
*Physics and Chance*. Cambridge Univerisity Press; 1993.View ArticleGoogle Scholar - Kittel C, Kroemer H:
*Thermal Physics*. W.H. Freeman and Company, New York; 1980.Google Scholar - Jaynes E:
**Paradoxes of Probability Theory.***Internet accessible book preprint*1997. [http://omega.albany.edu:8008/JaynesBook.html]Google Scholar - Jumarie G:
*Relative Information : theories and applications*.*Volume 47*. Springer-Verlag. (Springer series in synergetics); 1990.Google Scholar - Katchalsky A, Curran PF:
*Nonequilibrium Thermodynamics in Biophysics*. Harvard University Press, Cambridge, MA; 1965.View ArticleGoogle Scholar - Prigogine I:
**Dynamical Foundations of Thermodynamics and Statistical Mechanics.**In*A Critical Review of Thermodynamics*. Edited by: Stuart E, GalOr B, Brainard A. Mono Books, Baltimore; 1970.Google Scholar - Prigogine I:
*A United Foundation of Dynamics and Thermodynamics. Chemica Scripta*. 1973,**4:**5–32.Google Scholar - Progigine I:
*Order out of Chaos*. Bantam Books, New York; 1984.Google Scholar - Winters-Hilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S:
**Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini.***BMC Bioinformatics*2006,**7**(Suppl 2):S22.PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.