The number of generalized hidden Markov model (GHMM) gene finders reported in the literature has increased fairly dramatically of late [1–8], and the community is now contemplating various ways to extend this attractive framework in order to incorporate homology information, with a handful of such systems having already been built (e.g., [9–12]). GHMMs offer a number of clear advantages which would seem to explain this growth in popularity. Chief among these is the fact that the GHMM framework, being (in theory) purely probabilistic, allows for principled approaches to constructing, utilizing, and extending models for accurate prediction of gene structures.

While the decoding problem for GHMM gene finders is arguably well understood, being a relatively straightforward extension of the same problem for traditional HMMs and amenable to a Viterbi-like solution (albeit a more complex one), methods for optimally training a GHMM gene finder have received scant attention in the gene-finding literature to date. What information is available (e.g., [2, 4]) seems to indicate that the common practice is to optimize the submodels of the GHMM independently, without regard for the optimality of the composite model.

The training of HMMs and GHMMs has traditionally been carried out using some form of *maximum likelihood estimation* (MLE). Baum-Welch training [13], which is an instance of the well-known *expectation maximization* (EM) procedure, is itself a form of MLE [14]. In the case of GHMM gene finders, one typically applies some form of MLE to each of the submodels (states) in the GHMM so as to render training features of each type (e.g., exon, intron, donor site) maximally likely under the induced (sub)model; i.e., maximizing:

for state *q* and for *S*
_{
i
}a feature of length *d*
_{
i
}from the state-*q*-specific training set *T*. The submodels are then merged into a composite model (i.e., the full GHMM) by observing transition probabilities between features in the training data corresponding to each of the GHMM states.

For example, an exon state in a GHMM can be trained by collecting *n*-gram statistics (i.e., counts of *n*-letter substrings) from known exon sequences and normalizing these into transition probabilities for an (*n*-1)^{th}-order Markov chain [15]. Similarly, intron, intergenic, and untranslated region (UTR) states can be modeled by collecting appropriate statistics from corresponding sample features and using these to train individual content-scoring models, such as Markov chains, neural networks, decision trees, etc. Signal sensors for donor and acceptor splice sites and start and stop codons can be trained by aligning known signals of the appropriate type and counting nucleotide frequencies at each position within a fixed window around the signal; converting these counts to relative frequencies produces probability estimates for use in a weight matrix or similar type of model. Transition and duration probabilities can likewise be estimated by observing appropriate frequencies in training data. All of these estimation activities can be performed independently, resulting in a GHMM consisting of distinct subsets of maximum likelihood parameters.

Such an approach does not, however, attend to the global optimality of the GHMM as a whole. Ideally, one would like to maximize the expected accuracy of the gene finder on unseen data. A reasonable approximation to this ideal would be to maximize the average probability of the gene parses in the training set:

where the collection of model parameters making up the GHMM is denoted θ and the elements (*S*, φ) of the training set *T* comprise pairs of sequences *S* and their known parses φ. This argmax gives us the parameterization
under which the full gene *parses* (rather than the *sequences*) in the training set will be maximally likely (on average). Decomposing each parse φ into a series of (*q*
_{
i
}, *d*
_{
i
}) pairs, for state *q*
_{
i
}and state duration (i.e., feature length) *d*
_{
i
}, we get:

where *P*
_{
e
}, *P*
_{
t
}, and *P*
_{
d
}represent the emission, transition, and duration probabilities of the GHMM, respectively. Whereas the common MLE training procedure for GHMMs as described above optimizes the individual terms in the numerator of Equation 3 independently, the argmax above calls instead for these terms to be jointly tuned so as to optimize the entire ratio in parentheses. Intuitively, one can think of this alternate formulation as attempting to account for the process in the Viterbi algorithm (during later decoding) whereby the individual submodels "compete" for nucleotides (in the sense that each nucleotide can be emitted by only one submodel in any given parse, and the Viterbi algorithm chooses the final, predicted parse based on the values of the model parameters). Our hope is that by addressing the issue of submodel competition explicitly during parameter estimation, we will thereby empower the gene finder to better discriminate at a global sequence level between the features modeled by individual submodels in the GHMM, thereby producing more accurate gene predictions.

A similar optimization problem occurs in the field of speech recognition, in which systems of interacting acoustic models and language models are employed to optimally parse an audio stream into a series of discrete words. Interestingly, the trend in that field, starting with Bahl *et al*. in 1986 [16], has increasingly been away from the sole use of MLE and toward an alternative approach very similar to that prescribed by Equation 2 known as *global discriminative training* [17–19] or *conditional maximum likelihood* [20]. The problem also appears in a slightly different form in the related field of statistical natural language parsing, in which it has been suggested that global methods for optimizing competing stochastic grammar models may improve the accuracy of systems at the level of whole-sentence parses [21]. *Maximum discrimination HMMs* have already been applied successfully to problems in the realm of biological sequence analysis [22], though their use in gene finding has apparently not yet seen widespread adoption. To our knowledge, the only gene finder reported to use discriminative training is HMMgene [23], a gene finder based on a non-generalized HMM.

In light of these considerations, it is worth contemplating the possible gains in gene finder accuracy that might be obtained through the use of some form of discriminative training applied to a GHMM – that is, training aimed more directly at optimizing the ability of the gene finder to discriminate between exons and non-exons, thereby improving the expected accuracy of the gene finder's predictions. Anecdotal evidence already suggests that investigation of such methods may indeed be fruitful, as the process of manual tuning of GHMM parameters (i.e., "tweaking") after MLE training is commonly acknowledged by those with experience training GHMM-based gene finders (including our own systems). The practice of performing such tuning on the training set, especially when done iteratively, can be viewed as a manual form of gradient ascent optimization using the percentages of correctly predicted nucleotides, exons, and whole genes as surrogates for the Σ_{(S,φ)∈T} P(φ|S,θ) term in Equation 2.

We therefore decided to investigate the use of a simple form of global discriminative training for gene-finding. We did this by building a rudimentary gradient ascent optimizer and applying it to a subset of the model parameters for our GHMM-based gene finder, TigrScan, as described in the Methods.