# Error statistics of hidden Markov model and hidden Boltzmann model results

- Lee A Newberg
^{1, 2}Email author

**10**:212

**DOI: **10.1186/1471-2105-10-212

© Newberg; licensee BioMed Central Ltd. 2009

**Received: **09 February 2009

**Accepted: **09 July 2009

**Published: **09 July 2009

## Abstract

### Background

Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold?

### Results

Here we present a novel general approach to estimating these false positive and true positive rates that is significantly more efficient than are existing general approaches. We validate the technique via an implementation within the HMMER 3.0 package, which scans DNA or protein sequence databases for patterns of interest, using a profile-HMM.

### Conclusion

The new approach is faster than general naïve sampling approaches, and more general than other current approaches. It provides an efficient mechanism by which to estimate error statistics for hidden Markov model and hidden Boltzmann model results.

## Background

Hidden Markov models are employed in a wide variety of fields, including speech recognition, econometrics, computer vision, signal processing, cryptanalysis, and computational biology. In speech recognition, hidden Markov models can be used to distinguish one word from another based upon the time series of certain qualities of a sound [1]. In finance, the models can be used to simulate the unknown transitions between low, medium, and high debt default regimes in time [2]. In computer vision they can be used to decode American Sign Language (ASL) [3]. Hidden Markov models are used in computational biology to find similarity between sequences of nucleotides (DNA or RNA) or polypeptides (proteins) [4, 5] and to predict protein structure [6].

Hidden Markov models permit the facile description and implementation of powerful statistical models and algorithms that are used for calculation of the *probability* of sequential data. Furthermore, the algorithms used to manipulate hidden Markov models are easily applied more generally. Frequently these dynamic programming algorithms are instead employed in the calculation of an odds ratio, which is the the ratio of the probability of sequential data under a foreground model (signal), divided by the probability of the sequential data under a background model (noise). In other applications, the algorithms are used to compute other scores, frequently employed as proxies for logarithmic probabilities or logarithmic odds ratios, even though the scores are not directly derived from known foreground and background statistical models. Below, we will precisely define a hidden Boltzmann model as a hidden Markov model generalization that admits these odds ratio and other score calculations.

- 1.
Is the score strong enough to indicate a signal, or is it reasonably probable that noise will yield a score this strong?

- 2.
Is the score weak enough to indicate noise, or is it reasonably probable that a signal will yield a score this weak?

The *false positive rate* (closely related to the *type I error* or *p-value*) for a score threshold is the probability that noise data will yield a score at least as strong as the threshold. The *true positive rate* for a score threshold is the probability that signal data will yield a score at least as strong as the threshold.

Within computational biology, error statistics are used primarily in the subfield of sequence alignment, where specialized approaches exist for computing them. (See the next section.) We are hopeful that the availability of the general approach we describe here will enable the productive use of error statistics in other subfields of computational biology, and in other scientific fields where error statistic estimation has been difficult.

### Prior work

Methods for estimating the false positive rate exist in some settings. For instance, we can consider the Smith-Waterman pairwise local alignment algorithm [7], which can be interpreted as a maximum path score calculation via a hidden Boltzmann model. This well-established algorithm scores the extent to which two sequences have similar subsequences; recent techniques permit efficient estimation of false positive rates for this algorithm to levels as low as 10^{-4000}[8]. Efficient estimation is also available for the more general local profile-HMM sequence alignments [9].

Furthermore, in the special case where a hidden Boltzmann model computes a logarithmic odds ratio and where the score threshold is not too extreme, there is a generally applicable technique [10]. In this prior work, each probability parameter of a hidden Markov model is modified to be a weighted arithmetic average of applicable background and foreground probabilities, (1 - *α*)*p*_{
B
}+ *αp*_{
F
}for *α* ∈ [0, 1], where *p*_{
B
}is the applicable probability under the background model, and *p*_{
F
}is the applicable probability under the foreground model. When the score function for a hidden Boltzmann model happens to be a logarithmic odds ratio, the technique we present here can be described similarly. However, under such a circumstance, our modified hidden Boltzmann model has an "unnormalized" probability that is a weighted geometric average of the background and foreground probabilities, *p*_{
B
}^{1-α}*p*_{
F
}^{
α
}for *α* ≥ 0. (Note that even in this limited context of logarithmic odds ratios, we are able to estimate error statistics for higher score thresholds than are achievable in the prior work because, by permitting any *α* ≥ 0, we allow an extrapolation beyond the *p*_{
F
}value.)

Here we expand and extend the previous false-positive-rate result for pairwise sequence alignments [8] to the class of hidden Boltzmann models, which includes the class of hidden Markov models. In particular, we extend the result to biologically relevant hidden Markov models of all sorts, not just profile-HMMs. We demonstrate the new technique in the Method sections, and we show that the approach is applicable to true positive rates as well. We make use of a novel importance sampling distribution and provide a novel approach to computing its normalization.

In the Results section, we discuss our application of the technique within the HMMER 3.0 package, which permits scanning nucleotide and polypeptide sequence databases for patterns of interest. In particular, we show that it works well with HMMER global alignments.

## Methods

We describe first our models and then the algorithms we use to manipulate them.

### Models

For our building blocks we assume that we are given: a hidden Boltzmann model that computes scores of interest for sequences of interest, a simple *background model* (also termed *null model* or *random model*) that describes *noise* sequences, and a computable *foreground model* (also termed *alternative model* or *hypothesis model*) that describes *signal* sequences. Each of these will be described more thoroughly in the following.

#### Hidden Boltzmann models: states, transitions, and emissions

With little or no modification, many algorithms applicable to hidden Markov models are useful more generally. These algorithms, including the present work, function not only with the strict probabilities of a hidden Markov model, but also with odds ratios, with logarithms of probabilities or logarithms of odds ratios, and with scores used as proxies for such logarithms. Nonetheless, the term "hidden Markov model" is more restrictive and does not admit these generalizations. To accommodate these generalizations, we coin the term "hidden Boltzmann model," so named because of earlier work with Boltzmann chains and Boltzmann machines [11, 12]. Much as is done with Boltzmann chains and Boltzmann machines, we describe hidden Boltzmann models in terms of scores rather than probabilities. Although these scores are often scaled logarithmic probabilities or scaled logarithmic odds ratios, in general they need not be. A hidden Boltzmann model consists of a set of *states* and a set of directed *transitions* between states. Any state or any transition can be designated as an *emitter*. Each emitter includes a specification of the set of *emissions* that it can produce; these emissions are from an *alphabet*, the set of all possible emissions. Furthermore, each state, each transition, and each emission of each emitter has a real-valued *score* (also termed *energy*) associated with it.

An *emission path* through a hidden Boltzmann model starts at a special *start* state, ends at a special *terminal* state, proceeds from state to state via transitions, and includes a choice of emission for each encounter with each emitter. The *sequence* associated with an emission path is the ordered set of emissions.

The *score* of an emission path is the sum of the encountered transition, state, and emission scores; each score is included in the sum each time that it is encountered along the emission path. Note that when each of the scores is the scaled logarithm of a probability, the summing of scores along an emission path gives the scaled logarithm of the joint probability of events modeled as statistically independent.

*a*+ (

*h*+

*t*)

*b*+

*hc*+

*td*, where

*a*,

*b*,

*c*, and

*d*are the real-valued scores associated with the transitions and emissions, where

*h*is the number of "H" characters emitted and

*t*is the number of "T" characters emitted, and where state and transition scores not indicated in the figure are assumed to be zero. Note that hidden Boltzmann models are not restricted to emitting from discrete alphabets, such as the present {H, T}; a hidden Boltzmann model can emit arbitrary real number values, for example. As in the discrete alphabet case, each possible emission for each emitter has an associated score. In this continuous case, such a score is equal to, or is a proxy for, the logarithm of a probability density or the logarithm of the ratio of a foreground probability density to a background probability density.

#### Multiple emission paths for an emitted sequence

*hidden*because, in most cases, an underlying emission path cannot be uniquely determined from its sequence of emissions. In other words, a given sequence can typically be emitted by any of several emission paths through a hidden Boltzmann model, although that is not the case for the simple model of Figure 1. In this more general case (see Figure 2, a HMMER Plan7 profile-HMM [13]), the score associated with an emission sequence is usually determined in either one of two ways,

*maximum score*(also termed

*Viterbi score*) or

*forward score*[1, 4].

*D*, the maximum score

*s*

_{max}(

*D*) is the largest score achievable by an emission path that emits the sequence

*D*:

where *π* ∈ *π*_{
D
}indicates that any emission path *π* that emits *D* should be considered, and where *s*(*π*) is the sum of all state, transition, and emission scores encountered on the emission path *p*. Despite the usually combinatorially large number of emission paths *π* ∈ *π*_{
D
}, the value *s*_{max}(*D*) is efficiently computable by the standard Viterbi dynamic programming algorithm.

*s*

_{fw}(

*D*) for a sequence

*D*reflects a hidden Markov model interpretation of the hidden Boltzmann model. For any transition, state, or emission score

*s*from the hidden Boltzmann model, the value exp(

*s*) is treated as if it were a corresponding hidden Markov model probability, even though generally it is not actually a probability. (For instance, these values do not behave as probabilities in that, for any given state of a hidden Boltzmann model, the outgoing transition exp(

*s*) values need not sum to one.) Because an emission path's score

*s*(

*π*) is computed as the sum of the scores encountered as it is traversed, exp(

*s*(

*π*)) is interpreted as the product of the encountered probabilities. Furthermore, exp(

*s*

_{fw}(

*D*)) is computed as if it were the overall probability of an emitted sequence

*D*, where distinct emission paths through the model are assumed to be statistically disjoint events:

The name *forward* comes from the algorithm used to calculate this sum. The algorithm has run-time and memory efficiency comparable to those for the corresponding *s*_{max}(*D*) algorithm [1, 4].

*partition function Z*(

*D, T*) and corresponding

*free score s*

_{free}(

*D, T*) for any

*temperature T*∈ (0, +∞) are

Note that exp(*s*_{free}(*D, T*)/*T*) can be computed via a minor modification to the forward algorithm that computes exp(*s*_{fw}(*D*)); it is the values of exp(*s/T*), exp(*s*(*π*)/*T*), and exp(*s*_{free}(*D*)/*T*) that are treated as if they were hidden Markov model probabilities in the forward algorithm. The run-time and memory efficiency for the *s*_{free}(*D*, *T*) computation are essentially the same as those for the *s*_{fw}(*D*) or *s*_{max}(*D*) computation. We will make use of this partition function in the following.

#### The background model

*L*. Specifically, we assume that under a background model

*B*, the

*L*sequence positions are statistically independent and identically distributed according to some shared probability distribution Pr(

*d|B*), where

*d*indicates a possible emission:

where *d*_{
i
}is the *i* th emission of the sequence *D*. This assumption might be relaxed; see *Complex background models* in the Discussion section.

#### Mathematical problem statement

*D*of length

*L*is compared to other sequences of the same length. We write

where the false positive rate fpr(*s*_{0}) that we wish to estimate is the probability-weighted fraction of background model sequences of length *L* that achieve a score of at least *s*_{0}, where *D* ∈ *D*_{
L
}indicates that any sequence *D* of length *L* should be considered, where Pr(*D|B*) is the probability of a sequence *D* under the background model, where *s*(*D*) is the score assigned to the sequence *D* by the hidden Boltzmann model, and where Θ is a function that has value one if its argument is true or value zero if its argument is false. We write *s*(*D*) to indicate that this definition applies to *s*(*D*) = *s*_{max}(*D*) and to *s*(*D*) = *s*_{fw}(*D*).

### Algorithm

#### Importance sampling

*s*

_{0}) can be estimated via naïve sampling. That is, sequences are sampled/generated according to the background model

*B*, and fpr(

*s*

_{0}) is estimated by the fraction of the sampled sequences with a score of at least

*s*

_{0}. We note that if Pr(

*D|T*) is the probability of a sequence

*D*under some other model for sequences of length

*L*that is parameterized by a value

*T*, then we can write

*s*

_{0}) by sampling sequences according to this alternate model, and then averaging the corresponding

*f*(

*D, s*

_{0}) values. This approach is called

*importance sampling*[14]. Importance sampling is useful because estimation via Equation 6 can be substantially more efficient than estimation via Equation 5. That is, in terms of the variances of the the estimators, often it is possible to find an importance sampling model for which

#### Choice of importance sampling distribution

*T*as

*Z*(

*T*) is the normalization of the Pr(

*D|T*) probability distribution and is defined as

The value that we will choose for the parameter *T* ∈ (0, +∞) has yet to be specified.

#### Importance samples

Ultimately we wish to draw sample sequences according to the distribution Pr(*D|T*), compute *f* (*D*, *s*_{0}) for each sample, and use the average of these values as our estimate for the false positive rate. Here we describe the sampling of sequences.

*Z*(

*T*) via a novel modification to the forward algorithm that computes

*Z*(

*D*,

*T*). In the forward calculation of

*Z*(

*D*,

*T*), the emission of a value

*d*from an emitter

*E*is incorporated via a factor exp(

*s*

_{ E }(

*d*)/

*T*), where

*s*

_{ E }(

*d*) is the score associated with the emission of

*d*from the emitter

*E*. In the forward calculation for

*Z*(

*T*), instead of such a factor we use the average factor for the emitter ⟨exp(

*s*

_{ E }/

*T*)⟩

_{ B },

regardless of the value of *d*. Because the needed pre-computation and caching of these average factors are typically significantly faster than is the forward score computation, the run time for the *Z*(*T*) calculation is essentially the same as that for the *Z*(*D*, *T*) or *s*(*D*) computations.

*L*via stochastic backtrace of the

*Z*(

*T*) forward computation. Specifically, we sample the states and transitions of an emission path

*π*from the

*Z*(

*T*) computation via standard hidden Markov model techniques for stochastic backtrace [1, 4, 15]. In addition, we sample emissions for the emission path, where the probability that a value

*d'*is sampled for an encounter with an emitter

*E*is

*p*(

*i.e*., its states, transitions, and emissions) from the probability distribution

*D*. Because the sequence

*D*could have arisen from any emission path

*π*that emits it, the probability that we will sample

*D*by this approach is

which is the promised importance sampling distribution of Equation 9.

#### Estimation of the false positive rate

*s*

_{0}, for either maximum scores or forward scores depending upon the application. For each of

*N*sampled sequences {

*D*

_{ i }:

*i*= 1 ...

*N*}, we compute

*s*(

*D*

_{ i }) and

*Z*(

*D*

_{ i },

*T*). An estimate for fpr(

*s*

_{0}) is then

*s*

_{0}) with

In our implementation for HMMER 3.0 (see the Results section) we have found
to work well. The choice for the best estimator usually depends upon the efficiency of the estimators, which can be estimated from the *N* importance sampled sequences.

#### Estimation of the true positive rate

*s*

_{0}) is the probability-weighted fraction of foreground model sequences of length

*L*that achieve a score of at least

*s*

_{0}, where Pr(

*D|F*) is the probability of a sequence

*D*of length

*L*under the foreground model

*F*, and where fnr(

*s*

_{0}) is the false negative rate. The importance sampling estimate derives from the relationship

#### Special case for the true positive rate

*H*and the background model

*B*, and that the foreground model

*F*is the restriction of the model

*H*to sequences of a length

*L*:

*D*(

*π*) is the sequence emitted by the emission path

*π*. Use of Equation 24 in Equation 3 and in Equation 10 gives

We can define the estimators
and
in a manner analogous to the definitions of the estimators
and
in Equations 18 and 19. Note that when we are estimating the true positive rate for a *forward* score threshold, we already have *Z*(*D*_{
i
}, 1) in hand; it is equal to exp(*s*_{fw}(*D*_{
i
})) and *s*_{fw}(*D*_{
i
}) is needed for the computation of Θ(*s*(*D*_{
i
}) ≥ *s*_{0}).

#### Choice of temperature

To complete the the Algorithm section, we need an approach to select a temperature *T* that will be efficient for a given score threshold *s*_{0}. Because the relationship between temperature and score threshold is not straightforward, we recommend the building of a calibration curve to relate temperature to maximum score, and the building of a second calibration curve to relate temperature to forward score. Furthermore, for both maximum scores and forward scores, we have empirically observed lower variances for error statistic estimation when the fraction of sampled sequences exceeding the given score threshold is 20–60%; thus, we recommend aiming for a value for the importance sampling temperature parameter that achieves this statistic.

Conceptually, the approach to building a calibration curve is straightforward. For each of several temperatures, compute *Z*(*T*) and perform several stochastic backtraces to sample several sequences from the Pr(*D|T*) distribution. For each sampled sequence *D*, compute its score *s*(*D*). Plot the resulting temperature-score pairs {(*T*_{
i
}, *s*_{
i
})} as points in the *x*-*y* plane. Via some reasonable *ad hoc* procedure, use such a plot to choose a temperature for each score threshold of interest. Once a temperature is selected, draw and process *N* importance samples to compute the error statistics, as previously described.

## Results

Using the alpha-release source code for the HMMER 3.0 package [9], we randomly generated a length *M* = 100, Plan7 profile-HMM, and we estimated its error statistics for local-alignment scans of polypeptide sequences of length *L* = 200. Our calibration curves had 50 temperatures, and required 100 calculations of *s*_{max}(*D*) for the maximum score threshold calibration curve and 100 calculations of *s*_{fw}(*D*) for the forward score threshold calibration curve. We used the appropriate calibration curve to choose a temperature for each score threshold that we subsequently considered. For each of 1000 maximum score thresholds and for each of 1000 forward score thresholds, we estimated the false positive and true positive rates.

For each forward score threshold, we used *N* = 100 calculations of *s*_{fw}(*D*), 100 calculations of *Z* (*D*, *T*), and 1 calculation of *Z*(*T*) to estimate the false positive rate, and an additional 1 calculation of *Z*(1) to estimate the true positive rate. Similarly, for each maximum score threshold, we used 100 calculations of *s*_{max}(*D*), 100 calculations of *Z*(*D*, *T*), and 1 calculation of *Z*(*T*) to estimate the false positive rate, and an additional 100 calculations of *s*_{fw}(*D*) and 1 calculation of *Z* (1) to estimate the true positive rate. Because all other parts of the error statistic calculations require comparatively little run time, the calculation of both error statistics for a specified forward or maximum score threshold required 202–302 times the run time of a typical *s*(*D*) calculation. The error statistics calculation for a score threshold is 4.2–6.3 seconds on our platform. Note, however, that considerable savings in run time could have been achieved through the selective re-use of samples from one score threshold for another; see *Re-use of simulations* in the Discussion section.

^{-20}would require a run-time longer than the present age of the universe. Special purpose approaches, such as that for profile-HMM local sequence alignments, are typically faster than the importance sampling approach. Computed false positive rates and true positive rates, as a function of score threshold, are plotted in Figure 3. See the Discussion section.

Previously, we applied the algorithm to real DNA sequences; we employed the approach to analyze Smith-Waterman pairwise local alignments of intergenic regions in five Drosophila species, and easily estimated false positive rates as low as 10^{-400}[16].

## Discussion

We have provided a technique for the error statistic estimation of hidden Boltzmann model results. For all but the lowest hidden Boltzmann model scores, the presented technique is significantly more efficient than naïve simulation. We have demonstrated the effectiveness of the technique in the HMMER 3.0 package for scanning sequence databases.

### Review of results

#### Applicability to hidden Markov models

The approach for the general class of hidden Boltzmann models is easily specialized to hidden Markov models. The natural logarithm of any transition or emission probability *p* of a hidden Markov model is used in lieu of the corresponding score *s* in a hidden Boltzmann model. In particular, in the above formulae any occurrence of exp(*s*) should be replaced with *p*, and any occurrence of exp(*s/T*) should be replaced with *p*^{1/T}.

#### Linearity of error statistics as a function of score threshold

The lowermost two curves plotted in each panel of Figure 3, for maximum score and forward score *false positive rates*, are relatively straight for false positive rates above 10^{-100}. This behavior is consistent with theory and observations that these curves represent Gumbel distributions [9, 17]. Also as expected, the curves bend downward as the scores become more extreme. This is an indication that the Gumbel distribution result, which applies asymptotically as sequence lengths increase without bound, breaks down for extreme scores. This break down has been observed before [8, 18] and is expected [19]: in short, whether for maximum score or forward score, a highest achievable score exists among sequences of a fixed length *L*, and the curves will go to a false positive rate of zero as that score is approached.

The two uppermost curves in each panel of Figure 3, for maximum score and forward score *true positive rates*, are linear for scores under 100, and bend downward for more extreme scores. We are unaware as to whether this low-score linear behavior has been observed or predicted previously. Unlike the false positive rate curves, these curves experience a "phase transition," near a score of 125; the slope of the curves changes and then enters another linear regime. The cause of this phase transition merits further exploration.

#### Non-extreme error statistics

In our experience, importance sampling is more efficient than naïve sampling for false positive rates under 10^{-6}. For higher values, especially above 10^{-3}, the relationship of Equation 8 breaks down, and naïve sampling is often more efficient. Furthermore, the scores that yield these false positive rates also demark the transition in relative efficiency for true positive rate estimation.

### Future directions

#### Real problem instances

A significant shortcoming of the present work is our insufficient testing on real problem instances. Except for the special case of Smith-Waterman local DNA alignments [16], this hidden Boltzmann model technique has not been tested on real data. In future work we anticipate demonstrating the effectiveness of the present work on scans of actual protein and nucleotide databases, using accepted hidden Boltzmann models that are designed to identify common evolutionary history and/or common functionality. Such testing has been important in prior work [9, 20].

#### Scaling to different problem instances

In past work, for Smith-Waterman local sequence alignments, we have noted that the logarithmic false positive rate curves, such as those depicted in Figure 3, are remarkably conserved in shape [8]. That is, we have observed that an affine transformation of a logarithmic false positive rate curve for sequences of some length *L*_{1} is a remarkably good approximation of the corresponding curve for sequences of some other length *L*_{2}. Furthermore, the needed affine transformation is easy to calculate without simulations; it is the unique transformation that takes both the (minimum score, maximum logarithmic false positive rate) point and the (maximum score, minimum logarithmic false positive rate) point for sequences of length *L*_{1} to the corresponding points for sequences of length *L*_{2}.

The extent to which conservation of shape applies to the general class of hidden Boltzmann model error statistic curves is a topic that merits further consideration.

#### Re-use of simulations

When the score thresholds in a set of interest are not too different from one another, a single temperature and a single set of *N* sampled sequences can be used to calculate the error statistics for the entire set of score thresholds. The calculations of *s*(*D*), *Z*(*D*, *T*) and *Z*(*D*, 1) are the most time-intensive part of the error statistics calculations, but they need be performed only once for each sampled sequence. Therefore, the error statistics for a set of nearby score thresholds can be estimated almost as quickly as they can be estimated for a single threshold. In particular, if error statistics are needed for a large number of score thresholds, it will be productive to cache all computed *s*(*D*), *Z*(*D*, *T*) and *Z*(*D*, 1) values at each employed temperature, for possible use with subsequent score thresholds. However, because the efficiency of the error statistic estimators depends significantly upon the choice of temperature, use of samples from a given temperature *T* should be avoided unless 20–60% of the samples for that temperature satisfy *s*(*D*) ≥ *s*_{0}. While a run time equal to a few hundred score calculations is much shorter than is achievable by previous techniques, it is still undesirably slow for many applications, including HMMER 3.0. Importance sample caching and error statistic curve scaling will help to bring down the overall run time required for multiple error statistic estimations.

#### Other scoring functions

We expect that the techniques presented here will successfully carry over to free score, average score, and other definitions of score.

#### Complex background models

A modification of Equation 23, which is for estimating true positive rates under an arbitrary *foreground* model, might yield efficient estimation of false positive rates under a complex *background* model.

*B*that does satisfy Equation 4 then

(together with Equations 6, 3, and 10) prescribes an importance sampling approach for computing false positive rates under the complex background model. Under what circumstances this approach will be efficient is an open question.

#### Stochastic context-free grammars

The present technique can be applied to the Inside/Outside algorithms that manipulate stochastic context-free grammars [21]; much as we have described here, use of a *p*^{1/T}value in lieu of each probability *p* in a grammar gives an unnormalized probability distribution that can be used for importance sampling. We conjecture that the resulting importance sampling distribution will lead to significantly more efficient estimation than naïve sampling. In computational biology, stochastic context-free grammars are used with RNA secondary structure [4, 22], though we have not seen statistical significance estimation in this context.

## Conclusion

We have demonstrated a technique for error statistic estimation for hidden Boltzmann models and shown how it is applied to hidden Markov models. The approach is faster than naïve sampling approaches and is more general than other current approaches.

## Declarations

### Acknowledgements

We thank Sean R. Eddy for access to his HMMER 3.0 source code, for valuable feedback on this manuscript, and for permission to use the above Figure 2, from the HMMER User's Guide [13]. We thank the anonymous reviewers for their many constructive criticisms. We thank the Computational Molecular Biology and Statistics Core Facility at the Wadsworth Center for the computing resources to make these calculations. This research was supported by Health Research, Inc. grant 11-6592-14 to LAN.

## Authors’ Affiliations

## References

- Rabiner LR, Juang BH: An introduction to hidden Markov models.
*IEEE ASSP Mag*1986, 3: 4–16. 10.1109/MASSP.1986.1165342View ArticleGoogle Scholar - Banachewicz K, Lucas A, Vaart A: Modelling portfolio defaults using hidden Markov models with covariates.
*Econometrics J*2008, 11: 155–171. 10.1111/j.1368-423X.2008.00232.xView ArticleGoogle Scholar - Vogler C, Metaxas D: Adapting hidden Markov models for ASL recognition by using three-dimensional computer vision methods.
*IEEE International Conference On Systems, Man, and Cybernetics, Computational Cybernetics And Simulation*1997, 1: 156–161. 10.1109/ICSMC.1997.625741Google Scholar - Durbin R, Eddy S, Krogh A, Mitchison GJ:
*Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids*. Cambridge, United Kingdom: Cambridge University Press; 1998.View ArticleGoogle Scholar - Mitrophanov AY, Borodovsky M: Statistical significance in biological sequence analysis.
*Brief Bioinform*2006, 7: 2–24.View ArticlePubMedGoogle Scholar - Bystroff C, Shao Y: Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA.
*Bioinformatics*2002, 18(Suppl 1):S54-S61.View ArticlePubMedGoogle Scholar - Smith TF, Waterman MS: Identification of common molecular subsequences.
*J Mol Biol*1981, 147: 195–197.View ArticlePubMedGoogle Scholar - Newberg LA: Significance of gapped sequence alignments.
*J Comput Biol*2008, 15(9):1187–1194.PubMed CentralView ArticlePubMedGoogle Scholar - Eddy SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation.
*PLoS Comput Biol*2008, 4(5):e1000069.PubMed CentralView ArticlePubMedGoogle Scholar - Barash Y, Elidan G, Kaplan T, Friedman N: CIS: Compound importance sampling method for protein-DNA binding site p-value estimation.
*Bioinformatics*2005, 21(5):596–600.View ArticlePubMedGoogle Scholar - Saul LK, Jordan MI: Boltzmann chains and hidden Markov models. In
*Proceedings of the 1994 Conference on Advances in Neural Information Processing Systems 7*. Edited by: Tesauro G, Touretzky DS, Leen TK. Cambridge, MA: MIT Press; 1995:435–442.Google Scholar - MacKay DJC: Equivalence of linear Boltzmann chains and hidden Markov models.
*Neural Computation*1996, 8: 178–181. 10.1162/neco.1996.8.1.178View ArticleGoogle Scholar - Eddy SR:
*HMMER User's Guide: Biological sequence analysis using profile hidden Markov models*. 2.3.2, Howard Hughes Medical Institute and Dept. of Genetics Washington University School of Medicine, Saint Louis, MO; 2003.Google Scholar - Hammersley JM, Handscomb DC:
*Monte Carlo Methods*. New York: Wiley; 1964.View ArticleGoogle Scholar - Newberg LA: Memory-efficient dynamic programming backtrace and pairwise local sequence alignment.
*Bioinformatics*2008, 24(16):1772–1778.PubMed CentralView ArticlePubMedGoogle Scholar - Newberg LA, Lawrence CE: Exact calculation of distributions on integers, with application to sequence alignment.
*J Comput Biol*2009, 16: 1–18.PubMed CentralView ArticlePubMedGoogle Scholar - Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.
*Proc Natl Acad Sci USA*1990, 87: 2264–2268.PubMed CentralView ArticlePubMedGoogle Scholar - Wolfsheimer S, Burghardt B, Hartmann AK: Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail.
*Algorithms Mol Biol*2007., 2: article 9 article 9Google Scholar - Altschul SF, Gish W: Local alignment statistics.
*Methods Enzymol*1996, 266: 460–480.View ArticlePubMedGoogle Scholar - Karplus K, Karchin R, Shackelford G, Hughey R: Calibrating E-values for hidden Markov models using reverse-sequence null models.
*Bioinformatics*2005, 21(22):4107–4115.View ArticlePubMedGoogle Scholar - Lari K, Young SJ: The estimation of stochastic context-free grammars using the Inside-Outside algorithm.
*Computer Speech and Language*1990, 4: 35–56. 10.1016/0885-2308(90)90022-XView ArticleGoogle Scholar - Ding Y, Lawrence CE: A Bayesian statistical algorithm for RNA secondary structure prediction.
*Comput Chem*1999, 23(3–4):387–400.View ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.