Hidden Markov Model Variants and their Application

Winters-Hilt, Stephen

doi:10.1186/1471-2105-7-S2-S14

Volume 7 Supplement 2

Third Annual MCBIOS Conference. Bioinformatics: A Calculated Discovery

Proceedings
Open access
Published: 26 September 2006

Hidden Markov Model Variants and their Application

Stephen Winters-Hilt^1,2

BMC Bioinformatics volume 7, Article number: S14 (2006) Cite this article

5465 Accesses
21 Citations
Metrics details

Abstract

Markov statistical methods may make it possible to develop an unsupervised learning process that can automatically identify genomic structure in prokaryotes in a comprehensive way. This approach is based on mutual information, probabilistic measures, hidden Markov models, and other purely statistical inputs. This approach also provides a uniquely common ground for comparative prokaryotic genomics. The approach is an on-going effort by its nature, as a multi-pass learning process, where each round is more informed than the last, and thereby allows a shift to the more powerful methods available for supervised learning at each iteration. It is envisaged that this "bootstrap" learning process will also be useful as a knowledge discovery tool. For such an ab initio prokaryotic gene-finder to work, however, it needs a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more).

For eukaryotes, even with better start-of-coding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic gene-predictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles have made use of a novel hash-interpolating MM (hIMM) method. A new implementation for an HMM-with-Duration is also described, with far-reaching application to gene-structure identification and analysis of channel current blockade data.

Background

Motivations for seeking MM and HMM Variants

Part of the problem in developing a reliable gene predictor is having reliable, biologist-verified, training data. In its simplest form, the training data needed is the raw genomic DNA together with a minimal annotation that labels coding regions. For the prokaryotic gene prediction much of the problem with obtaining high-confidence training data can be circumvented by using a bootstrap gene-prediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs (open reading frames) are usually long genes, and compact in that motif searches upstream usually range over hundreds of bases rather than thousands (as in human).

In work on prokaryotic gene prediction (focusing on V. cholerae), software (smORF) was developed for an augmented ORF (open reading frame) characterization. Using that software, with a simple start-of-coding heuristic, it was possible to establish very good gene prediction for ORFs of length greater than 500 nucleotides. The smORF gene identification was used in a bootstrap gene-annotation process (where no initial training data was provided). The strength of the gene identification was then improved by use of a software tool that performed gap-interpolating-Markov-modeling (gIMM). When applied to the identified coding regions (most of the >500 length ORFs), six gIMMs were used (one for each frame of the codons, with forward and backward read senses). If poorly gIMM-scoring coding regions were rejected, performance improved. Failure analysis clearly indicates that start-codon modeling is needed in order to strengthen predictions. One of the benefits of the gIMM software is its gap-interpolating generalization. This permits motifs to be identified, particularly those sharing the same approximate alignment with the start-of-coding region. Using the bootstrap-identified genes from the smORF-based gene-prediction (including errors) as a train set permitted an unsupervised search for upstream regulatory structure. The classic Shine-Dalgarno sequence (the ribosome binding site) was the strongest signal in the 30-base window upstream from the start codon.

In preliminary work, a Hidden Markov Model based gene predictor was trained and tested on the C. elegans genome. Splice signal motifs and intron/exon statistical profiles were extracted, from which a gene predictor was constructed. To boost detection of exons, and indirectly introns, EST information was included. As with the prokaryotic research, further work entails identification of transcription regulation fingerprints, such as the promoter motifs, to clarify starts on transcription, etc. Even with better start-of-coding identification, however, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the typically poor gene-prediction performance in alternatively spliced regions. To address these complications an approach is described to expand the states in the gene-predictor HMM to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse was indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles were implemented using a hash-interpolating MM (hIMM) in this process (see Methods).

Another development described here is the incorporation of length distribution information, on exons and introns for example (see Figure 1, figure 2 and figure 3), into the HMM optimization via a novel, algorithmically convenient, implementation of HMM-with-Duration (an HMM that includes true length distribution information). One important application of such an HMM-with-duration includes feature extraction from channel current data. In this setting, information about the mean dwell times for the upper level and lower level is used as a first step in discriminating the class type. HMM with duration incorporates the length distribution information (dwell times distribution) on upper level and lower level states into the HMM optimization.

Markov Chains

Key property of a Markov chain:

P(x_i | x_i-1, ..., x₁) = P (x_i | x_i-1) = ,

where

are sometimes referred to as "transition probabilities" due to their sequential product contribution to sequence probability:

P(x) = P(x_L, x_L-1 ..., x₁) = P(x₁) ∏_{i = 2..L}

C_y is the count of events y, and C_xy is the count of simultaneous events x and y, T_y is the count of strings of length one, and T_xy is the count of strings of length two:

= P(x | y) = P(x,y)/P(y) = [C_xy/T_xy]/[C_y/T_y]

Since T_xy+1 = T_y → T_xy ≅ T_y (sequential data sample property if one long training block), = C_xy/C_y = C_xy/Σ_xC_xy, so C_xy is complete info for determining transition probabilities.

Viterbi Path

The recursive algorithm for the most likely state path given an observed sequence (the Viterbi algorithm) is expressed in terms of v_ki, the probability of the most probable path that ends with observation Z_i = z_i, and state S_i = k. The recursive relation is v_ki = max_n{e_kia_nkv_n(i-1)}, where the max_n{...} operation returns the maximum value of the argument over different values of index n, and the boundary condition on the recursion is v_k0 = e_k0p_k. The a_kl are the transition probabilities P(X_i = l|X_i-1 = k) to go from state k to state l. The e_kb are the emission probabilities P(S_i = b|X_i = k) while in state k. The Viterbi path labelings are then recursively defined by p(S_i|S_(i+1) = n) = argmax_k{v_kia_kn}, where the argmax_n{...} operation returns the index n with maximum value of the argument. The evaluation of sequence probability (and its Viterbi labeling) take the emission and transition probabilities as a given. Estimates on those emission and transition probabilities are usually obtained by the Expectation/Maximization (EM) algorithm (commonly referred to as the Baum-Welch algorithm in the context of HMMs [1]).

Further details on the information measures used, such as mutual information, and on Expectation/Maximization (EM), are placed in Appendix A.

EVA Projection

The HMM method is based on a stationary set of emission and transition probabilities. Emission broadening, via amplification of the emission state variances, is a filtering heuristic that leads to level-projection that strongly preserves transition times between major levels (see [13] for further details). Results from the emission variance amplification (EVA) emission broadening method are described in [13] (with varying amounts of variance amplification). This approach does not require the user to define the number of levels (classes). This is a major advantage compared to existing tools that require the user to determine the levels (classes) and perform a state projection. This allows kinetic features to be extracted with a "simple" FSA (Finite State Automaton) that requires minimal tuning.

Results

Ab initio prokaryotic gene-finding

Ab initio gene-finding begins with identification of codon structure on the basis of a simple, genome-wide, mutual information analysis. The idea is to examine the genome's two-base pairings, so first consider pairings where the bases are adjacent in the genome, i.e., a sample set consisting of all dinucleotides found upon moving a window of two-base width across the genome. The counts on those pairings can be used to obtain a mutual information between the first base and second base. The next iteration is to take two-base-pairings separated by one base (gap = 1), etc., with mutual information results as shown in Figure 4 (applied to V. Cholerae). This overall method is generalized further to define a gap-interpolating Markov model (details in Methods).

Ab initio gene-finding can then identify the stop codons and, thus, ORFs (see Figure 5). A generalization to codon void regions, then, leads to recognition of different, overlapping, potential gene regions (with two orientations). A tool has been developed to identify the genomic ORF topology in this generalized way, named "smORF" (see Figure 6 and Table 1 for further definitions and results). This genome-topology tool also clearly shows differences between bacteria (a possible "fingerprint" tool) (see Figure 6 and Figure 7).

Table 1 Topology-Index Histogram results are shown for Vibrio Cholerae. The N = 400 set correspond to indexing with ORFs greater than 400 bases long. Similarly, the N = 200 set correspond to indexing with ORFs greater than 200 bases long.

Full size table

smORF offers information about open reading frames (ORFs), and tallies information about other such codon void regions (an ORF is a void in three codons: TAA, TAG, TGA). This allows for a more informed selection process when sampling from a genome, such that non-overlapping gene starts can be cleanly and unambiguously sampled. The goal is, initially, to identify key gene structures (e.g., stop codons, etc.) and use only the highest confidence examples to train profilers. Once this is done, Markov models (MMs) can be constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterates again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude gene-finder can be constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first in-frame "atg". This analysis was applied to the V. cholerae genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first-"atg" heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the start-codon relative to the first-"atg" (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic gene-finder. Basically, information is needed to help identify the correct start codon in a 50 base window. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control).

For an ab initio gene-finder to work, it will need to have a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). In essence, a Markov model is needed with greater "reach" – the gap-interpolating Markov model (gIMM) was developed for this purpose, and is described in the Methods. To set up an ab initio motif discovery windows around the (1253) purported start of genes were sampled. The windows ranged from the 40 bases preceding the start codon to the first 20 bases of coding (a 60 base window). Some of the windows represent noise, as the first pass of the bootstrap feature extraction has only 92.1% accuracy. Even so, the gIMM is able to clearly discern the Shine-Dalgarno consensus sequence. With the critical motifs already discerned, further iterations of the MM construction, possibly as an HMM now, will undoubtedly aid in improving performance.

Alternate-Splice Labeling Scheme for Eukaryotic HMM

The labeling scheme assigns a label to each base in the sequence. Exon frame position 0 bases have label 0 if in the forward read or A if in the reverse. Likewise, exon frame position 1 bases have label 1 or B, and exon frame position 2 bases have label 2 or C. Introns, for purposes of the analysis here, are represented as 'i' or 'I' for intron on the forward or reverse strand (in the HMM implementation the intron states are actually split out in order to maintain proper frame on re-entry to coding regions via state transition restrictions). Junk DNA is labeled 'j'.

Track Label Information

Suppose there were multiple annotations regarding the labeling of a base (i.e., alternative splicing). As the genome is traversed in the forward direction, gene annotations that aren't in conflict with annotations already seen are used to determine labels on label-track-one. If a gene annotation is in conflict (an alternative splicing) then its label information is recorded on a second, adjacent, label track. Table 2 shows the label counts on track one and on track two (where the default base label is taken to be 'j'). From Table 2 it can be seen that about 8% of the first chromosome of C. elegans genes has alternate splicing. Similarly, Table 3 shows the transition counts between labels.

Table 2 (a) shows the Track 1 Label Counts, and (b) shows the Track 2 Label Counts.

Full size table

Table 3 (a) shows the Track 1 Transition Counts, and (b) shows the Track 2 Transition Counts.

Full size table

V-Labels and V-Transitions

Counts on coding-overlap V-label are shown in Table 4. Notice how the V-labels tend NOT to favor overlapping that is a simple frame-shift in a given read direction (i.e., the V01 count is very low compared to V00, etc.). There are 263 transitions on V-labels with non-zero counts. Many of the V-transitions have very low counts and can either be ignored in the initial model, or can have their stat's boosted by bringing information from related genomes (C. Briggsae). Ignoring those V-transitions with negligible counts as not allowed transitions, as well as those implicitly describing no alternative splicing locally (an overlap with 'j' in either track), reduces to an active V-transition set consisting of 86 transitions between V-labels. This is a tractable number of states to manage in the HMM analysis, suggesting a simple and direct approach to alternative splice HMM analysis. The number of V-transitions, whether counting all 263, or the 86 'active' ones, is still much smaller than the 72*72 = 5,184 transitions that would have been surmised for track annotations that were entirely independent.

Table 4 V-label Counts.

Full size table

HMM-with-Duration with application to channel current signal analysis

Synthetic Data Generation

We have generated two sets of synthetic data with states upper level (UL) and lower level (LL) (see Figure 8). The life times of the UL & LL are drawn from a Poisson distribution. The means of UL & LL dwell times for data set 1 are 20 ms & 40 ms respectively. The means of UL & LL dwell times for data set 2 are 40 ms each. A mean value of 48 pA is used for LL and a mean value of 72 pA is used for UL. We have assumed that the noise at UL & LL vary at around +/- 5pA around their respective means (Gaussian distribution). Hence we have assumed a variance of 2.78. See Figure 8.

The HMM-with-Duration implementation (described in the Methods) has been tested in terms of its performance at parsing synthetic blockade signals with attributes and visual appearance almost indiscernible from the experimentally observed blockade data. The benefit of the synthetic data is that the mean lifetimes of the blockade levels can range over an exhaustive set of possibilities for thorough testing of the HMM-with-Duration. The synthetic data was designed to have two levels, with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). In Figure 8, the data was generated with Poisson distributed lifetimes and parsed by an HMM's with and without duration, with length distribution generated from a Poisson (Figure 8a) or a Gaussian (Figure 8b) Distributions. The results clearly demonstrate the superior performance of the HMM-with-duration over its simpler, HMM without Duration, formulation. With use of the EVA-projection method described in the Background (and in [13]) this affords a robust means to obtain kinetic feature extraction. The emission broadening introduced in the HMM w/wo duration comparison in Figure 8a and 8b, is precisely what occurs with EVA-projection, with similar susceptibility to failure due to over-representation of brief blockade lifetimes in the HMM without Duration parse. Using HMM with duration this problem is resolved, brief toggles between states are now penalized according to the statistical rarity of the comparatively brief lifetime events (see Fig.'s 8a and 8b). Resolving this problem is critical for accurate kinetic feature extraction, and the results suggests that this problem can be elegantly solved with a pairing of the HMM-with-Duration stabilization with EVA-projection.

Discussion

aTOPO

As already mentioned, the gIMM tool identifies statistically anomalous motifs (usually restricted to some zone upstream). Another tool being developed, "aTOPO" (for motifs with Anomalous TOPOlogy), uses the information from the upstream zones swept with the gIMM tool to annotate upstream motifs, and offers further information on whether an anomalous motif is biologically significant by looking at is positional distribution and its correlations to other motifs. This work is still preliminary, so this will not be discussed further.

Methods

Expectation/Maximization

Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(x|θ) = log[Σ_πP(x,π|θ)]. At each iteration of the estimation process we would like the new log likelihood, P(x|θ), to be greater than the old, P(x|θ*). The difference in log likelihoods can be written such that one part is a relative entropy, the positivity of which makes the EM algorithm work:

log P(x|θ) - log P(x|θ*) = Q(θ|θ*) - Q(θ*|θ*) + D[P(π|x,θ*)||P(π|x,θ)],

where D[...||...] is the Kullback-Leibler divergence, or relative entropy, and Q(θ|θ*) = Σ_πP(x,π|θ). Now a greater log likelihood results simply by maximizing Q(θ|θ*) with respect to parameters θ. The EM iteration is comprised of two steps: (1) Estimation – calculate Q(θ|θ*), and (2) Maximization – maximize Q(θ|θ*) with respect to parameters θ.

For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability a_kl by A_kl(π) and those on emission probabilities e_kb by E_k(b,π) (following [1] conventions), P(x,π|θ) can then be written:

P(x,π|θ) = ∏_{k = 0}∏_b[e_kb]^E_k(b,π) ∏_{k = 0}∏_{l = 1}[a_kl]^A_kl(π).

Using the above form for P(x,π|θ), A_kl for the expected value of A_kl(π) on path π, and E_k(b) for the expected value of E_k(b,π) on path π, it is then possible to write Q(θ|θ*) as:

Q(θ|θ*) = Σ_{k = 1}Σ_b E_k(b) log[e_kb] + Σ_{k = 0}Σ_{l = 1} A_kl log[a_kl].

It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for a_kl and e_kb are:

a_kl = A_kl /(Σ_l A_kl) and e_kb = E_k(b)/(Σ_b E_k(b)).

The latter estimation is for when the state sequence is known. For an HMM (with Baum-Welch algorithm) it completes the Q maximization step (M-step), which is obtained with the MLEs for a_kl and e_kb. The E-step requires that Q be calculated, for the HMM this requires that A_kl and E_k(b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.

Emission and Transition Expectations with Rescaling

For an HMM, the probability that transition a_kl is used at position i in sequence Z is:

p(S_i = k, S_(i+1) = l|X) = p(S_i = k, S_(i+1) = l, Z)/p(Z), where

p(S_i = k,S_(i+1) = l,Z) = p(z₀,...,z_i,S_i = k)p(S_(i+1) = l|S_i = k)p(z_i+1|S_(i+1) = l)p(z_i+2,...,z_L-1|S_(i+1) = l).

In terms of the previous notation with forward/backward variables:

p(S_i = k, S_(i+1) = l|X) = f_ki a_kl e_l(i+1) b_l(i+1) /p(Z),

So the expected number of times a_kl is used, A_kl, simply sums over all positions i (except last with indexing):

A_kl = Σ_i f_ki a_kl e_l(i+1)b_l(i+1)/p(Z),

Similarly, the probability that b is emitted by state k at position i in sequence Z:

p(z_i = b, S_i = k|X) = [ p(z₀,...,z_i,S_i = k) p(z_i+1,...,z_L-1|S_i = k)/p(Z) ] δ(z_i-b),

where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:

E_k(b) = Σ_i f_ki b_ki/p(Z) δ(z_i-b),

In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that Σ_iF_ki = 1, where F_ki is the rescaled forward variable, and B_ki is the rescaled backward variable: F_ki = a_βke_kiF_β(i-1)/s_i, and B_ki = a_kβe_β(i+1) B_β(i+1)/t_i+1, where s_i and t_i+1, are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:

A_kl = Σ_i F_ki a_kl e_l(i+1) B_l(i+1)/[Σ_kΣ_l F_ki a_kl e_l(i+1)] and E_k(b) = Σ_i F_ki B_ki δ(z_i-b).

Gap Interpolating Markov Model (gIMM)

We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent.

Find out which position among the 24 prior positions has the maximum mutual information with frame position 0. Suppose it is p1

Frame positions: ...012012012012012012012|012012...

Prior positions: ...------------------987654321

Find out a position from p = 1...24 and p ≠ p1 such that it has the maximum mutual information with frame position 0 and the prior position p1. If it is p2: MI(x0,x1;x2)>MI(x0,x1;xk).

Repeat the above procedure until we get an nth-order MI-linkage, this defines an interpolated Markov model that allows for gaps (gIMM). Likewise, for the frame 1 and 2, and reverse frames.

Most importantly, this can also be done for the regions upstream of coding regions, with reference base just prior to the start codon, in searches for regulatory motifs.

Hash Interpolated MM (hIMM)

Given the current state and the emitted sequence as x1,...., xL; compute:

P(xL|x1,...., xL-1) ≈ Count(x1,...., xL)/Count(x1,...., xL-1)

Iff Count(x1,...., xL-1) ≥ 400 i.e. only if the parental sequence shows statistical significance

Store P(xL|x1,...., xL-1) in the hash

Maintain a separate hash for each of the following states – Junk, Intron and Exon0, Exon1 & Exon2

Pseudo Code:

The emission probabilities for states such as (jj), (ee) and (ii) are computed using hash IMM as:

Given sequence 'x1,...., xL'

If sequence is defined;

Return Probability

Else

Recurse on 'x1,...., xL-1'

Minimum string length allowed for:

1.
Junk Region = 6
2.
Intron Region = 6
3.
Exon Region = 8

UL/LL Data Generation Method

We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.

Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.

Cumulant Distribution Function (CDF) of exponential distribution:

F = 1 - exp (-t/a) where 'a' is the mean

Hence t = a.log_e(1 - F)

We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.

We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.

HMM-with-Duration via Cumulant Transition Probabilities

The transition probabilities for (ee) – (ee); (jj) – (jj) and (ii) – (ii) type transitions are computed as:

Prob(jj|j_length = L) = Prob(j_length ≥ L+1)/Prob(j_length ≥ L)

The transition probabilities for transitions such as (jj) – (je), (ee) – (ej), (ee) – (ei), (ii) – (ie) are computed as:

If the total number of (ej) transitions is 60 and the total number of (ei) transitions is 40, then:

Prob(ei|e_length = L) = Prob(e_length = L)/Prob(e_length ≥ L) × 40/(40 + 60)

Prob(ej|e_length = L) = Prob(e_length = L)/Prob(e_length ≥ L) × 60/(40 + 60)

Pseudo Code:

Maintain separate counters for the junk, exon and intron regions.

The counters are updated as:

1.
The exon counter is set to 2 for a (je) – (ee) transition
2.
The exon counter gets incremented by 1 for every (ee) – (ee) transition

Prob(e_length ≥ L+1) is computed as:

We have Prob(e_length ≥ L+1) = 1 - Σ_{i = 1..L} Prob(e_length = i)

Hence we generate a list such that for each index 'k > 0', the value 1 - Σ_{i = 1..k} Prob(e_length = i) is stored

HMM-with-Duration Viterbi Implementation

The transition probabilities for (UL) – (UL); (LL) – (LL) are computed as:

Prob(ul|ul_len = L) = Prob(ul_len ≥ L+1)/Prob(ul_len ≥ L)

The transition probabilities for transitions such as (UL) – (LL), (LL) – (UL) are computed as:

Prob(ul-ll|ul_len = L) = Prob(ul_len = L)/Prob(ul_len ≥ L)

Pseudo Code (see dynamic programming table in Figure 9):

Maintain separate counters for the UL and LL regions

The counters are updated as:

1.
The UL counter is set to 1 for a (ll) – (ul) transition
2.
The UL counter gets incremented by 1 for every (ul) – (ul) transition

Prob(ul_len ≥ L+1) is computed as:

We have Prob(ul_len ≥ L+1) = 1 - Σ_{i = 1} ^L Prob(ul_len = i)

Hence we generate a list such that for each index 'k > 0', the value 1 - Σ_{i = 1} ^kProb(ul_len = i) is stored

Prob(ul_len = L) is computed as:

We generate a list such that for each index 'k', the value Prob(ul_len = k) is stored

Appendix A

Information measures

The fundamental information measures are Shannon entropy, mutual information, and relative entropy (also known as the Kullback-Leibler divergence or distance). Shannon entropy: σ = -Σ_xp(x)log(p(x)), is a measure of the information in distribution p(x). Mutual Information: μ = Σ_xΣ_yp(xy)log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. Relative Entropy (Kullback-Leibler distance): ρ = Σ_x p(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. Mutual information is a special case of relative entropy between a joint probability (two-component in simplest form) and the product of component probabilities.

Hidden Markov Models with possible Side Information

Hidden Markov Models [1] provide a statistical framework for sequences of observations obeying stationary Markov statistics. The "hidden" part of the HMM consists of the labelings, s_i, for each observation, z_i, where the index i labels the observation. The stationary statistics for a first order HMM are described in terms of emission probabilities, e_ni = p(Z_j = z_i|S_j = n), and transition probabilities, a_nm = p(S_j = m|S_(j-1) = n). (The indexing on j is left in for clarity on the transition probability definition; from stationarity the expressions are valid for any choice of j.) Given (i) sequence of observations z_i, (ii) hidden labels s_i, and (iii) stationary Markov statistics, one can calculate: (i) p(Z_0...L-1), or (ii) the most likely hidden labeling (path with largest contribution to p(Z_0...L-1)), or (iii) the re-estimation of emission and transmission probabilities such that p(Z_0...L-1) is maximized (using Expectation/Maximization).

Forward and Backward Variables

The forward/backward variables can be used to evaluate p(Z_0...L-1) by breaking the sequence probability p(Z_0...L-1) into two pieces via use of a single hidden variable treated as a Bayesian parameter: p(Z_0...L-1) = Σ_kp(Z_0...i,s_i = k)p(Z_i+1...L-1,s_i = k) = Σ_kf_kib_ki, where f_ki = p(Z_0...i,s_i = k) and b_ki = p(Z_i+1...L-1,s_i = k). A proof for the two-piece split is obtained by directly expanding the sequence probability (via Chow expansion: P(xyz) = P(x)P(y|x)P(z|xy), and Markov conditional reduction: P(xyz) = P(x)P(y|x)P(z|y), with appropriate notational book-keeping). Given stationarity, the state transition probabilities and the state probabilities at the i^th observation satisfy the trivial relation p_qi = Σ_ka_kqp_k(i-1), where p_qi = p(S_i = q), and p_q0 = p(S = q), and the latter probabilities are the state priors. The trivial recursion relation that is implied can be thought of as an operator equation, with operation the product by a_kq followed by summation (contraction) on the k index. The operator equation can be rewritten using an implied summation convention on repeated Greek-font indices (Einstein summation convention): p_q = a_βqp_β . Transition-probabilities in a similar operator role, but now taking into consideration local sequence information via the emission probabilities, are found in recursively defined expressions for the forward variables, f_ki = a_βke_ki f_β(i-1), and backward variables, b_ki = a_kβe_β(i+1) b_β(i+1). The recursive definitions on forward and backward variables permit efficient computation of observed sequence probabilities using dynamic programming tables. It is at this critical juncture that side information must mesh well with the states (column components in the table), i.e., in a manner like the emission or transition probabilities. Length information, for example, can be incorporated via length-distribution-biased transition probabilities (as described in a new method in the Methods).

References

Durbin R: Biological sequence analysis : probabilistic models of proteins and nucleic acids. Cambridge, UK & New York: Cambridge University Press; 1998.
Book Google Scholar
Bezrukov SM, Vodyanoy I, Parsegian VA: Counting polymers moving through a single ion channel. Nature 1994, 370(6457):279–281. 10.1038/370279a0
Article CAS PubMed Google Scholar
Bak P, Tang C, Wiesenfeld K: Self-organized Criticality – An explanation of 1/f noise. Phys Rev Lett 1987, 59: 381. 10.1103/PhysRevLett.59.381
Article PubMed Google Scholar
Sinai Y: Introduction to Ergodic Theory. Princeton University Press; 1976.
Google Scholar
Sklar L: Physics and Chance. Cambridge Univerisity Press; 1993.
Book Google Scholar
Kittel C, Kroemer H: Thermal Physics. W.H. Freeman and Company, New York; 1980.
Google Scholar
Jaynes E: Paradoxes of Probability Theory. Internet accessible book preprint 1997. [http://omega.albany.edu:8008/JaynesBook.html]
Google Scholar
Jumarie G: Relative Information : theories and applications. Volume 47. Springer-Verlag. (Springer series in synergetics); 1990.
Google Scholar
Katchalsky A, Curran PF: Nonequilibrium Thermodynamics in Biophysics. Harvard University Press, Cambridge, MA; 1965.
Book Google Scholar
Prigogine I: Dynamical Foundations of Thermodynamics and Statistical Mechanics. In A Critical Review of Thermodynamics. Edited by: Stuart E, GalOr B, Brainard A. Mono Books, Baltimore; 1970.
Google Scholar
Prigogine I: A United Foundation of Dynamics and Thermodynamics. Chemica Scripta. 1973, 4: 5–32.
CAS Google Scholar
Progigine I: Order out of Chaos. Bantam Books, New York; 1984.
Google Scholar
Winters-Hilt S, Landry M, Akeson M, Tanase M, Amin I, Coombs A, Morales E, Millet J, Baribault C, Sendamangalam S: Cheminformatics Methods for Novel Nanopore analysis of HIV DNA termini. BMC Bioinformatics 2006, 7(Suppl 2):S22.
Article PubMed Central PubMed Google Scholar

Download references

Acknowledgements

The author would like to thank Anil Yelundur for helpful discussions. Funding was provided by grants from the National Institutes for Health, The National Science Foundation, The Louisiana Board of Regents, and NASA.

Author information

Authors and Affiliations

Department of Computer Science, University of New Orleans, New Orleans, LA, 70148, USA
Stephen Winters-Hilt
The Research Institute for Children, 200 Henry Clay Ave., New Orleans, LA, 70118, USA
Stephen Winters-Hilt

Authors

Stephen Winters-Hilt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen Winters-Hilt.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Winters-Hilt, S. Hidden Markov Model Variants and their Application. BMC Bioinformatics 7 (Suppl 2), S14 (2006). https://doi.org/10.1186/1471-2105-7-S2-S14

Download citation

Published: 26 September 2006
DOI: https://doi.org/10.1186/1471-2105-7-S2-S14

Third Annual MCBIOS Conference. Bioinformatics: A Calculated Discovery

Hidden Markov Model Variants and their Application

Abstract

Background