Hidden Markov Model Variants and their Application

Markov statistical methods may make it possible to develop an unsupervised learning process that can automatically identify genomic structure in prokaryotes in a comprehensive way. This approach is based on mutual information, probabilistic measures, hidden Markov models, and other purely statistical inputs. This approach also provides a uniquely common ground for comparative prokaryotic genomics. The approach is an on-going effort by its nature, as a multi-pass learning process, where each round is more informed than the last, and thereby allows a shift to the more powerful methods available for supervised learning at each iteration. It is envisaged that this "bootstrap" learning process will also be useful as a knowledge discovery tool. For such an ab initio prokaryotic gene-finder to work, however, it needs a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). For eukaryotes, even with better start-of-coding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic gene-predictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles have made use of a novel hash-interpolating MM (hIMM) method. A new implementation for an HMM-with-Duration is also described, with far-reaching application to gene-structure identification and analysis of channel current blockade data.

For eukaryotes, even with better start-of-coding identification, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the poor performance in alternatively spliced regions. To address these complications an approach is described to expand the states in a eukaryotic gene-predictor HMM, to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse is indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles have made use of a novel hash-interpolating MM (hIMM) method. A new implementation for an HMM-with-Duration is also described, with far-reaching application to gene-structure identification and analysis of channel current blockade data.

Motivations for seeking MM and HMM Variants
Part of the problem in developing a reliable gene predictor is having reliable, biologist-verified, training data. In its simplest form, the training data needed is the raw genomic DNA together with a minimal annotation that labels coding regions. For the prokaryotic gene prediction much of the problem with obtaining high-confidence training data can be circumvented by using a bootstrap gene-prediction approach. This is possible in prokaryotes because of their simpler and more compact genomic structure: simpler in that long ORFs (open reading In preliminary work, a Hidden Markov Model based gene predictor was trained and tested on the C. elegans genome. Splice signal motifs and intron/exon statistical profiles were extracted, from which a gene predictor was constructed. To boost detection of exons, and indirectly introns, EST information was included. As with the prokaryotic research, further work entails identification of transcription regulation fingerprints, such as the promoter motifs, to clarify starts on transcription, etc. Even with better start-of-coding identification, however, parsing of eukaryotic coding regions by the HMM is still limited by the HMM's single gene assumption, as evidenced by the typically poor gene-prediction performance in alternatively spliced regions. To address these complications an approach is described to expand the states in the gene-predictor HMM to operate with two layers of DNA parsing. This extension from the single layer gene prediction parse was indicated after preliminary analysis of the C. elegans alt-splice statistics. State profiles were implemented using a hash-interpolating MM (hIMM) in this process (see Methods).
Another development described here is the incorporation of length distribution information, on exons and introns for example (see Figure 1, figure 2 and figure 3), into the HMM optimization via a novel, algorithmically convenient, implementation of HMM-with-Duration (an HMM that includes true length distribution information). One important application of such an HMM-with-duration includes feature extraction from channel current data. In this setting, information about the mean dwell times for the upper level and lower level is used as a first step in discriminating the class type. HMM with duration incorporates the length distribution information (dwell times distribution) on upper level and lower level states into the HMM optimization.

Markov Chains
Key property of a Markov chain: where are sometimes referred to as "transition probabilities" due to their sequential product contribution to sequence probability: .L C y is the count of events y, and C xy is the count of simultaneous events x and y, T y is the count of strings of length one, and T xy is the count of strings of length two: Frequencies of different length junk regions in a test subset of C. elegans Since T xy +1 = T y → T xy ≅ T y (sequential data sample property if one long training block), = C xy /C y = C xy / Σ x C xy , so C xy is complete info for determining transition probabilities.

Viterbi Path
The recursive algorithm for the most likely state path given an observed sequence (the Viterbi algorithm) is expressed in terms of v ki , the probability of the most probable path that ends with observation Z i = z i , and state S i = k. The recursive relation is v ki = max n {e ki a nk v n(i-1) }, where the max n {...} operation returns the maximum value of the argument over different values of index n, and the boundary condition on the recursion is v k0 = e k0 p k . The a kl are the transition probabilities P(X i = l|X i-1 = k) to go from state k to state l. The e kb are the emission probabilities P(S i = b|X i = k) while in state k. The Viterbi path labelings are then recursively defined by p(S i |S (i+1) = n) = argmax k {v ki a kn }, where the argmax n {...} operation returns the index n with maximum value of the argument. The evaluation of sequence probability (and its Viterbi labeling) take the emission and transition probabilities as a given. Estimates on those emission and transition probabilities are usually obtained by the Expectation/Maximization (EM) algorithm (commonly referred to as the Baum-Welch algorithm in the context of HMMs [1]).
Further details on the information measures used, such as mutual information, and on Expectation/Maximization (EM), are placed in Appendix A.

EVA Projection
The HMM method is based on a stationary set of emission and transition probabilities. Emission broadening, via amplification of the emission state variances, is a filtering heuristic that leads to level-projection that strongly preserves transition times between major levels (see [13] for further details). Results from the emission variance amplification (EVA) emission broadening method are described in [13] (with varying amounts of variance amplification). This approach does not require the user to define the number of levels (classes). This is a major advantage compared to existing tools that require the user to determine the levels (classes) and perform a state projection. This allows kinetic features to be extracted with a "simple" FSA (Finite State Automaton) that requires minimal tuning.

Ab initio prokaryotic gene-finding
Ab initio gene-finding begins with identification of codon structure on the basis of a simple, genome-wide, mutual information analysis. The idea is to examine the genome's two-base pairings, so first consider pairings where the bases are adjacent in the genome, i.e., a sample set consisting of all dinucleotides found upon moving a window of two-base width across the genome. The counts on those pairings can be used to obtain a mutual information between the first base and second base. The next iteration is to take two-base-pairings separated by one base (gap = a x x Frequencies of different length exon regions in a test subset of C. elegans Ab initio gene-finding can then identify the stop codons and, thus, ORFs (see Figure 5). A generalization to codon void regions, then, leads to recognition of different, overlapping, potential gene regions (with two orientations). A tool has been developed to identify the genomic ORF topology in this generalized way, named "smORF" (see Figure 6 and Table 1 for further definitions and results). This genome-topology tool also clearly shows differences between bacteria (a possible "fingerprint" tool) (see . This allows for a more informed selection process when sampling from a genome, such that non-overlapping gene starts can be cleanly and unambiguously sampled. The goal is, initially, to identify key gene structures (e.g., stop codons, etc.) and use only the highest confidence examples to train profilers. Once this is done, Markov models (MMs) can be constructed on the suspected start/stop regions and coding/noncoding regions. The algorithm then iterates again, informed with the MM information, and partly relaxes the high fidelity sampling restrictions (essentially, the minimum allowed ORF length is made smaller). A crude gene-finder can be constructed on the high fidelity ORFs by use of a very simple heuristic: scan from the start of an ORF and stop at the first in-frame "atg". This analysis was applied to the V. cholerae genome (Chr. I). 1253 high fidelity ORFs were identified out of 2775 known genes. This first-"atg" heuristic provided a gene prediction accuracy of 1154/1253 (92.1% of predictions of gene regions were exactly correct). If small shifts are allowed in the predicted position of the start-codon relative to the first-"atg" (within 25 bases on either side), then prediction accuracy improves to 1250/1253 (99.8%). This actually elucidates a key piece of information needed to improve such a prokaryotic gene-finder. Basically, information is needed to help identify the correct start codon in a 50 base window. Such information exists in the form of DNA motifs corresponding to the binding footprint of regulatory biomolecules (that play a role in transcriptional or translational control).
An examination of he void sizes encountered for various codons (smORFs), or groupings of codons, reveals the stop codons very clearly as codons with anomalously lengthy stop-codon void regions  The y-axis shows the mutual information on base-pairs in he V.cholerae gemone, the x-axis is the gap size between bases used to construct the base pairs Figure 4 The y-axis shows the mutual information on base-pairs in he V. cholerae gemone, the x-axis is the gap size between bases used to construct the base pairs. Mutual Information, MI(X;Y) is: μ = Σ x Σ y p(xy)log(p(xy)/p(x)p(y)). If X and Y are independent r.v's then MI = 0. If we have a DNA sequence x 1 .... x i x i +1 x i+2 ........x n (where x k = {a,c,g, or t}) then we can get counts on pairs x i x i+1 for i = 1..n, and assuming stationarity on the data, and large enough n, we can speak of the joint probability p(X,Y). Calculation of MI(X,Y) then gives an indication of the linkage between base probabilities in dinucleotide probabilities. This can be extended to linkages when the two bases aren't sequential (have a base gap between them greater than zero), such as pairs based on xixi+2 (gap = 1), etc. This type of statistical framework can then be iterated to higher order MI calculations in a variety of ways to explore a number of statistical linkages and build towards a motif identifier based on such linkages (gIMM). Such an analysis on the V. Cholerae Chr. I genome, above, clearly indicates a three-component encoding of data, i.e., a 3-element codon structure is revealed. Furthermore, judging from the strong linkages for gaps 1 through 5, it is also clear that hexamer Markov model statistics will be strong in many regions of the genome. For an ab initio gene-finder to work, it will need to have a mechanism to identify critical motif structure, such as those around the start of coding or start of transcription (and then, hopefully more). In essence, a Markov model is needed with greater "reach" -the gap-interpolating Markov model (gIMM) was developed for this purpose, and is described in the Methods. To set up an ab initio motif discovery windows around the (1253) purported start of genes were sampled. The windows ranged from the 40 bases preceding the start codon to the first 20 bases of coding (a 60 base window). Some of the windows represent noise, as the first pass of the bootstrap feature extraction has only 92.1% accuracy. Even so, the gIMM is able to clearly discern the Shine-Dalgarno consensus sequence. With the critical motifs already discerned, further iterations of the MM construction, possibly as an HMM now, will undoubtedly aid in improving performance.

Alternate-Splice Labeling Scheme for Eukaryotic HMM
The labeling scheme assigns a label to each base in the sequence. Exon frame position 0 bases have label 0 if in the forward read or A if in the reverse. Likewise, exon frame position 1 bases have label 1 or B, and exon frame position 2 bases have label 2 or C. Introns, for purposes of the analysis here, are represented as 'i' or 'I' for intron on the forward or reverse strand (in the HMM implementation the intron states are actually split out in order to maintain proper frame on re-entry to coding regions via state transition restrictions). Junk DNA is labeled 'j'.

Track Label Information
Suppose there were multiple annotations regarding the labeling of a base (i.e., alternative splicing). As the genome is traversed in the forward direction, gene annotations that aren't in conflict with annotations already seen are used to determine labels on label-track-one. If a gene annotation is in conflict (an alternative splicing) then its label information is recorded on a second, adjacent, label track. Table 2 shows the label counts on track one and on track two (where the default base label is taken to be 'j'). From Table 2 it can be seen that about 8% of the first chromosome of C. elegans genes has alternate splicing. Similarly, Table 3 shows the transition counts between labels.

V-Labels and V-Transitions
Counts on coding-overlap V-label are shown in Table 4. Notice how the V-labels tend NOT to favor overlapping that is a simple frame-shift in a given read direction (i.e., the V01 count is very low compared to V00, etc.). There are 263 transitions on V-labels with non-zero counts. Many of the V-transitions have very low counts and can either be ignored in the initial model, or can have their stat's boosted by bringing information from related genomes (C. Briggsae). Ignoring those V-transitions with negligible counts as not allowed transitions, as well as those implicitly describing no alternative splicing locally (an overlap with 'j' in either track), reduces to an active Vtransition set consisting of 86 transitions between Vlabels. This is a tractable number of states to manage in the HMM analysis, suggesting a simple and direct approach to alternative splice HMM analysis. The number of V-transitions, whether counting all 263, or the 86 'active' ones, is still much smaller than the 72*72 = 5,184 a. Topology Index Histograms shown for the V. cholerae gemone, the x-axis is the gap size between bases used to construct the base pairs transitions that would have been surmised for track annotations that were entirely independent.

HMM-with-Duration with application to channel current signal analysis Synthetic Data Generation
We have generated two sets of synthetic data with states upper level (UL) and lower level (LL) (see Figure 8).  Figure 8.
The HMM-with-Duration implementation (described in the Methods) has been tested in terms of its performance at parsing synthetic blockade signals with attributes and visual appearance almost indiscernible from the experimentally observed blockade data. The benefit of the synthetic data is that the mean lifetimes of the blockade levels can range over an exhaustive set of possibilities for thorough testing of the HMM-with-Duration. The synthetic data was designed to have two levels, with lifetime in each level determined by a governing distribution (Poisson and Gaussian distributions with a range of mean values were considered). In Figure 8, the data was generated with Poisson distributed lifetimes and parsed by an HMM's with and without duration, with length distribution generated from a Poisson (Figure 8a) or a Gaussian ( Figure  8b) Distributions. The results clearly demonstrate the superior performance of the HMM-with-duration over its simpler, HMM without Duration, formulation. With use of the EVA-projection method described in the Background (and in [13]) this affords a robust means to obtain kinetic feature extraction. The emission broadening intro-duced in the HMM w/wo duration comparison in Figure  8a and 8b, is precisely what occurs with EVA-projection, with similar susceptibility to failure due to over-representation of brief blockade lifetimes in the HMM without Duration parse. Using HMM with duration this problem is resolved, brief toggles between states are now penalized according to the statistical rarity of the comparatively brief lifetime events (see Fig.'s 8a and 8b). Resolving this problem is critical for accurate kinetic feature extraction, and the results suggests that this problem can be elegantly solved with a pairing of the HMM-with-Duration stabilization with EVA-projection.

Discussion aTOPO
As already mentioned, the gIMM tool identifies statistically anomalous motifs (usually restricted to some zone upstream). Another tool being developed, "aTOPO" (for motifs with Anomalous TOPOlogy), uses the information from the upstream zones swept with the gIMM tool to annotate upstream motifs, and offers further information on whether an anomalous motif is biologically significant by looking at is positional distribution and its correlations to other motifs. This work is still preliminary, so this will not be discussed further.

Expectation/Maximization
Expectation/Maximization, EM, is a general method to estimate the maximum likelihood when there is hidden or missing data. The method is guaranteed to find a maximum, but it may only be a local maximum, as is shown here (along the lines of [1]). For a statistical model with parameters θ, observed quantities x, and hidden labels π, the EM goal is to maximize the log likelihood of the observed quantities with respect to θ : log P(x|θ) = log[Σ π P(x,π|θ)]. At each iteration of the estimation process we would like the new log likelihood, P(x|θ), to be greater than the old, P(x|θ*). The difference in log likeli-  For an HMM the hidden labels π correspond to a path of states. Along path π the emission and transition parameters will be used to varying degrees. Along path π, denote usage counts on transition probability a kl by A kl (π) and those on emission probabilities e kb by E k (b,π) (following [1] conventions), P(x,π|θ) can then be written: Using the above form for P(x,π|θ), A kl for the expected value of A kl (π) on path π, and E k (b) for the expected value of E k (b,π) on path π, it is then possible to write Q(θ|θ*) as: It then follows (relative entropy positivity argument again) that the maximum likelihood estimators (MLEs) for a kl and e kb are: The latter estimation is for when the state sequence is known. For an HMM (with Baum-Welch algorithm) it completes the Q maximization step (M-step), which is obtained with the MLEs for a kl and e kb . The E-step requires that Q be calculated, for the HMM this requires that A kl and E k (b) be calculated. This calculation is done using the forward/backward formalism with rescaling in the next section.
In terms of the previous notation with forward/backward variables: So the expected number of times a kl is used, A kl , simply sums over all positions i (except last with indexing): Similarly, the probability that b is emitted by state k at position i in sequence Z: where a Kronecker delta function is used to enforce emission of b at position i. The expected number of times b is emitted by state k for sequence Z:   a. Synthetic data with Poisson distributed length statistics is shown in the upper trace Figure 8 a. Synthetic data with Poisson distributed length statistics is shown in the upper trace. Emission broadening is introduced with an emission variance amplification factor of 20. This effectively broadens the noise band (thickness) seen in the upper trace by a factor of 20, which leads to a blurring between the upper and lower levels of blockade since the noise bands now overlap (i.e., a toggling cross-over instability is introduced to challenge the projection method). The middle trace shows the clean, highly accurate Viterbi parsing into the appropriate levels that is still obtained with the HMM-with-Duration implementation. The lower trace shows the Viterbi parse with a simple HMM, that is uninformed about the underlying length distributions, thus giving rise to a Viterbi traceback parse that fails to penalize unlikely, very short duration, blockade events (seen as the unstable, rapid level-projection toggles). b. Synthetic data with Gaussian distributed length statistics is shown in the upper trace, with the suceessful HMM-with-Duration parsing shown in the middle trace. Emission broadening is introduced with an emission variance amplification factor of 20 as in Fig. 8a, with a similar failure in the HMM-without-Duration's ability to parse critical kinetic feature information.
In practice, direct computation of the forward and backward variables can run into underflow errors. Rescaling variables at each step can control this problem. One rescaling approach is to rescale the forward variables such that Σ i F ki = 1, where F ki is the rescaled forward variable, and B ki is the rescaled backward variable: F ki = a βk e ki F β(i-1) / s i , and B ki = a kβ e β(i+1) B β(i+1) /t i+1 , where s i and t i+1 , are the rescaling constants. The expectation on counts for the various emissions and transitions then reduce to:

Gap Interpolating Markov Model (gIMM)
We construct a gIMM using Mutual Information (MI). Based on the three reading frames of the DNA, the Mutual Information is frame dependent. Find out a position from p = 1...24 and p ≠ p1 such that it has the maximum mutual information with frame position 0 and the prior position p1. If it is p2: MI(x0,x1;x2)>MI(x0,x1;xk).
Repeat the above procedure until we get an nth-order MI-linkage, this defines an interpolated Markov model that allows for gaps (gIMM). Likewise, for the frame 1 and 2, and reverse frames.
Most importantly, this can also be done for the regions upstream of coding regions, with reference base just prior to the start codon, in searches for regulatory motifs.

UL/LL Data Generation Method
We generated 75000 random numbers that were Gaussian distributed with mean 48 pA and variance 2.78. We also generated 50000 random numbers that were Gaussian distributed with mean 72 pA and variance 2.78.
Since the dwell times of the UL & LL are Poisson distributed, the transition times (duration times between state transitions) are assumed to be exponentially distributed. Therefore for data set 1, the mean interval between successive UL states is 40 ms and the mean interval between successive LL states is 20 ms.
Cumulant Distribution Function (CDF) of exponential distribution: F = 1 -exp (-t/a) where 'a' is the mean Hence t = a.log e (1 -F) We generate uniformly distributed 'F' between [0, 1). We then compute 't' from the above equation.
We have 50,000 samples per second. Hence a dwell time of 't' ms corresponds to 50.t samples rounded off to the nearest integer.
(page number not for citation purposes)

Information measures
The fundamental information measures are Shannon entropy, mutual information, and relative entropy (also known as the Kullback-Leibler divergence or distance). Shannon entropy: σ = -Σ x p(x)log(p(x)), is a measure of the information in distribution p(x). Mutual Information: μ = Σ x Σ y p(xy)log(p(xy)/p(x)p(y)), is a measure of information one random variable has about another random variable. Relative Entropy (Kullback-Leibler distance): ρ = Σ x p(x) log(p(x)/q(x)), is a measure of distance between two probability distributions. Mutual information is a The HMM dynamic programming table is modified to track new count index information at each cell to enable the HMM-with-Duration implementation Figure 9 The HMM dynamic programming table is modified to track new count index information at each cell to enable the HMM-with-Duration implementation. The figure shows a view of Viterbi modifications in the dynamic programming table -the transition probabilities are now modified to be a ratio of transition probability cumulants. The transition probability cumulant is calculated based on a predetermined length distribution profile and uses information on the length that is also a (new) cell-level parameter. There are two new cell-level parameters, one each for tracking length on UL and LL states. (For gene-finding, the information to track at the cell-level would be the length of the purported exon, intron, or junk, region.)