HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Background Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model. Results The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases. Conclusion The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation.


Background
Protein homology is used as the basis for studying its phylogeny and predicting its function. A preliminary step in annotation of protein function from its sequence, is to compare it against a database of functionally annotated sequences and infer function based on similar conservation patterns to known homologues. As databases of sequences with known functions are large, fast heuristic methods based on extending local alignments such as BLAST [1] and FASTA [2] are commonly employed for this task.
Improved sensitivity in detecting homologues is provided by profile-sequence comparison methods such as PSI-BLAST [3] -which uses position specific scoring matrices, and HMMER [4] which uses a profile Hidden Markov Model (HMM). A profile is developed from a multiple alignment and contains more information on the sequence family than a single sequence, providing a base for detecting homologs with discontinuous conservation patterns, and remote homologues.
Patterns of sequence conservation can arise from both phylogenetic and functional relationships between proteins [5]. Proteins perform a wide variety of functions, but share a comparatively small number of folds. The TIMbarrel fold, as an extreme example, includes oxidoreductases, lyases, hydrolases and isomerases, which are examples of divergent evolution of function within the fold [6]. These proteins, while within each class contain functionspecific signals, share fold-specific signals across the functional groups. The development of profile-profile based methods, (e.g. HHSEARCH [7], COACH [8]) maximises the contribution of common signals between profiles, providing even greater sensitivity in detecting remote homologs, and have proven useful for fold classification. Profile HMM databases are now commonly used to assign a protein to a structural class: the Superfamily database [9] which maps profiles to SCOP [10] structures, and the Pfam database [11] -which is a database of protein families largely based on domains.
An important goal of sequence annotation is the ability to assign molecular function to a protein sequence. Phylogenomic inference attempts to annotate protein function in the context of its entire family, and though has improved accuracy and specificity, its universal applicability is hampered by the fact that it is a labor-intensive manual process that requires significant effort from dedicated scientists [12]. Sjolander and co-workers have used "subfamily HMMs", built from a multiple alignment of the protein family decomposed into functionally distinct subfamilies, in classifying sequences with a very low error rate [13].
As sequences are increasingly being classified on the basis of their common function -e.g the Gene Ontology project [14] (see ref [12] for more examples), function specific profiles are important goals in the ability to annotate sequences. HMMs built from a functionally classified sub-family often pick up sequences belonging to other sub-families because of fold signals common to the family. Pre-classified data however, provides for the use of both positive and negative training sequences. Negative training sequences have been used before, both to modify emission probabilities [15] and transition probabilities [16]. Both methods employ the Viterbi algorithm to align negative training sequences to the model, and change its probabilities during the training stage. Hannehalli and Russell [17] have used positional entropy to assess the discerning value of an amino acid position in a multiple alignment, sub-classify sequences, and score sequences against HMMs to remove the influence of non-discriminating residues. Kernel based methods, notably the Support Vector Machine (SVM) have been applied to classify sequences both at the fold level [18] and at the sub-family level [19].
A multiple alignment of representative protein kinase sequences divided into sub-families is illustrative of the problem faced in using the HMM of a sub-family for classification ( Figure 1). Easily apparent are the large number of columns which are conserved across all sub-families, representative of the fold signals. Amino acids selectively conserved in one sub-family are responsible for its specific function, and this information is used to discriminate sequences from the other sub-families. G-protein coupled receptors (GPCR) have also been classified hierarchically [20] and have been used to test the application of kernel based support vector machines as classifiers. This dataset provides another test for discriminating methods -they share a common fold but with limited sequence similarity across the family. As the HMM built from a family of sequences contains both common fold and function specific signals, the availability of a negative sequence data set allows the use of methods that optimise the discrimination threshold to separate sequences based on their function. Further separation, if necessary, maybe provided by modifying model parameters to minimize the influence of fold-specific signals and/or maximise the influence of specificity determining residues. These methods could be applied to classify proteins on the basis of their function in spite of their sharing a common fold.
We describe the use of cross-validation [21] to optimise the threshold to improve specificity for a particular subfamily profile HMM. From different measures of estimating classification accuracy, we choose the mode of the Matthews Correlation Coefficient (MCC) [22] distribution as the optimal threshold (referred henceforth in the A multiple alignment showing the common fold specific signals, along with the group specific sub-family function specific signals Figure 1 A multiple alignment showing the common fold specific signals, along with the group specific sub-family function specific signals. Alscript [36] figure showing a portion of the alignment of representatives of six protein kinase families discussed in the text. The alignment is coloured based on residue conservation: Red and pink -identical and conserved across all families -correspond to fold signals, and blue and green -identical and conserved within a family. Positions predicted to confer specificity for the family [35] are highlighted in yellow. Deleted regions are indicated by dashes (---). Numbers below the alignment correspond to the PDB structure 2f7z.  GRK_3005003   2f7z_E   PKA_9844082   PKA_2144416   PKA_7512016   PKA_157220   PKA_85121   PKA_6679467   PKA_3123587   PKA_4322296   PKA_14758310   PKA_200367   50  60  70  80  90  100 110 120  110  120  130  140  150  160  170  180  190  200  210  220 text as HMM-t). In addition, improved specificity is imparted by using negative training sequences to modify model parameters -the emission and transition probabilities -to make the model more specific (referred henceforth as HMM-ModE). The method is applied in a highthroughput classification exercise to finely classify a sample derived from an earlier fold level data mining of protein kinases [23]. The sub-family profile HMM with default threshold will be referred to as HMM-d to differentiate from the profile HMM used with the optimal discriminating threshold, HMM-t.

Improved specificity is obtained by optimising the discrimination threshold
The HMM built from positive training sequences contains sub-family specific signals in addition to the common signals that arise from the family fold. The log-odds score, a measure of how much more probable the sequence is to be emitted by the given HMM than by the random null model, is routinely used in sequence profile comparison.
where P(X 1 ,...X n | M) is the probability of the sequence X 1 ...X n being emitted by the model M, and P(X 1 ,...X n | Null) is the probability of the sequence being emitted by the null model.
All the sequences which obtain a positive score are considered to belong to family for which the model is built. The significance of this score, in HMMER, is calculated as an "E-value", assuming an extreme-value distribution whose parameters are either calculated during calibration, or from a conserved upper bound [24]. The use of the Evalue is empirical, as the exact nature of the distribution of scores from global alignments is still unknown [24], though it has been shown that HMM score distributions are not an extreme-value distribution [25]. The Pfam database uses curated thresholds as an additional aid to the Evalue: a "trusted" cutoff (TC1) -which is the lowest score of a true positive in the full alignment, a "noise" cut-off (NC1) which is the highest score for a sequence not included in the dataset, and a "gathering" threshold (GA1), which is the threshold that is actually set to collect the sequences in the Pfam Full alignment where TC1>GA1>NC1 [11]. These criteria cannot be uniformly applied to pre-classified positive and negative sequence data, as there maybe negative sequences with higher scores than positive sequences.
For a given threshold score, a sequence from the positive set will be classified as a true positive(TP) or false negative(FN), and one from the negative set as a true nega-tive(TN) or false positive(FP). Using these terms, sensitivity (TP/(TP+FN)) and specificity (TP/(TP+FP)) maybe used to measure the performance of a classifier. Receiver-Operator Characteristic curves (1-specificity v/s sensitivity) [22] show that the discriminating potential of the default HMM profile is inherently high and that the poor specificity of the HMM, generated from positive training sequences, results from the default threshold based on null probabilities (Figure 2). We use the Mathews correlation coefficient to indicate the optimal threshold. Normally used methods for testing the efficacy of discrimination in machine learning methods include cross-validation, where the sample is split into training and test data, and bootstrapping, where the data is randomly sampled multiple times [21]. N-fold cross-validation or jackknifing ranges from "sample-splitting"where the sample is split equally into a test and training set, to "leave-one-out" -where the method is tested iteratively through the sample set, using as the test set a single sequence, and the remainder of the sample as a training set. In order to allow the method to be used in highthroughput analysis, we use 10-fold cross-validation, which lies between the accuracy of bootstrapping, and the speed of sample-splitting. We use the mode of the average MCC distribution as the optimal discrimination threshold. (Figure 3A).

Further improvement in discrimination is provided by using negative sequences to train the HMM
Increased discrimination is provided by modifying emission and transition probabilities in the model by incorporating probabilities derived from negative training sequences directly into the model. Earlier work that used HMM with discriminative training modified model emission probabilities by iteratively aligning negative sequences to the model [15]. This method uses the capabilities of the HMM to both generate the multiple alignment and train the model with positive and negative sequences, a feature which is not available with HMMER, which uses a null model containing position independent probabilities derived from background frequencies of the amino acids. Moreover, multiple alignments generated from HMMs are not as accurate as methods employing scoring matrices -the profiles from Pfam are often handedited, and our use of hmmalign [24] to align negative training data does not generate alignments of the quality as specialized multiple alignment programs working from sequences (data not shown). The advent of fast and accurate multiple alignment programs such as MUSCLE [26] permits the generation of the model parameters by using profile-profile alignments of the positive and negative samples. Analysis of these profiles allows the easy identification of alignment positions capable of increased discrimination, and the modification of model parameters to implement them. Discriminating alignment positions can S P X X M P X X Null n n = log ( ... | ) ( ... | ) 1 1 be identified using relative entropy(RE i ) between the probability distributions of the positive (p) and negative (q) sets for a position i [27,28].
where p i,x and q i,x are the probabilities of the amino acid x at a position i in the positive and negative sets respectively.
In order to use a model independent method of selecting discriminating alignment positions, Z-scores based on the distribution of cumulative relative entropies (CRE i ) for the alignment may be used.
where μ and σ are the mean and standard deviation of the CRE distribution.
The log-odds score is then given by High Z-scores (Z > 3) are associated with specificity determining positions [17], but although this method may work well on classifying sequences at the sub-family level that have been previously classified to the family level, it is insufficient to accurately mine a large database. The use of Z-scores to select variables (alignment positions) involved in maximal discrimination loses information that is shared between the positive and negative sequences, increasing the likelihood of an unrelated sequence that may contain the reduced pattern by chance. This is particularly likely in cases where specificity is conferred by only a few residues, or even a single position [29]. We propose a mixed score -that would discriminate the sequence belonging to the subfamily against a sequence containing the pattern by chance by incorporating the fold components of the profile, and against sequences from other subfamilies, by incorporating information related to the specificity determining residues identified using relative entropy.
where Px Null is the null probability of amino acid x. Other terms are as defined earlier.
This effectively calls for a position dependent null model, that incorporates information from the negative training sequences. Though the Viterbi algorithm uses a log score in aligning a sequence to a profile to prevent underflow errors, this score is calculated from the model emission probabilities. In order to preserve the plan7 architecture used in HMMER, we use a heuristic method that modifies the model emission probabilities to implement this mixed model score.
The mixed model above still does not capture all information available from the false positive sequences. Consider the case where there is an conserved insert in the negative sequences that is absent in the positive sequences. As there is no equivalent emission probabilities in the positive profile (the matching columns in the positive profile HMM would be delete states), this information is lost. A trivial implementation of the log-odds score with known positive and negative sequences maybe made by scoring the sequence against the profiles generated from both positive sequences and negative sequences, and subtracting the negative profile score from the score of the sequence against the positive profile. A B C This log-difference-of-odds-scores (henceforth referred to as HMM-Sub) would provide the maximum discrimination between the positive and negative datasets, but has some caveats discussed below. The comparative impact of these methods is shown in Figure 3 for a randomly selected dataset. It is apparent that the MCC distribution is successively broader and has a higher maximum with each method, corresponding to increased discrimination between the positive and negative datasets.

Validation
Emission probabilities were modified as described in methods. Existing methods for modifying transition probabilities [16] from negative training data sampling [30] to improve the efficiency of HMMER were used without change. As negative training data is significantly larger in size than positive training data, the speed of implementation of HMM-ModE is improved by only selecting false positives from the negative training data, thus limiting its size to those sequences that significantly influence discrimination. The use of profile-profile alignments also permits easy calculation of resultant models, as the match states of columns of the positive and negative training data are aligned.
Protein Kinases provide the kind of challenge sub-family classification demands. Protein Kinases were first classified by Hanks [31] into distinct families that share basic structural and functional properties based on similarity in catalytic domain amino acid sequence, and more recently have been classified into 12 fold groups based on structural fold similarity. Each of the fold groups is further classified into families which can be distinguished from one another by representative HMMs [23]. Within the family, finer functional classification is often not possible due to the large proportion of shared fold signals. One such instance is the AGC family of Serine/Threonine Protein Kinases (Figure 4). The AGC family contains Protein Kinases such as cAMP-dependent Protein Kinase (PKA), Protein Kinase C, Protein Kinases related to PKA and PKC (RAC), G protein-coupled receptor kinase (GRK), ribosomal S6 PK, and the PVPK1 Protein Kinase homologs in plants [31]. The proteins all share a two-lobed structure and high level of sequence similarity, yet have different substrate specificity [32]. The results of the application of the above methods on this dataset is reported in Table 1. GRK has an insert relative to the other sequences, which is sufficient for HMMER with a suitable cut-off to improve its specificity. In all the other sub-families, the distribution of scores for positive and negative sequences overlap allowing an assessment of the discrimination capabilities of the methods described in this paper. In general, there is an increase in specificity using the HMM-ModE protocol, albeit with a reduction in sensitivity from the hmmer score with a default threshold. In the case of S6PK, the reduced sensitivity is due to three sequences in the dataset being shorter than the rest. As the log-odds-score increases with sequence length, this is an expected development, and must be used as a caveat for the general application of the method.
G-Protein coupled receptors, which play a key role in cellsignaling network that control an array of physiological processes [33] have also been classified into sub-families on the basis of their substrate specificity [20]. These proteins are characterised by the conservation of seven transmembrane regions, the selection criteria being hydrophobic residues. Sequences from one sub-family often have higher sequence similarity with members of other families than within the sub-family. This dataset has also been the focus of the application of the SVM as a discriminator [19], and hence is interesting as it provides a comparison to the methods detailed above. Karchin et al have compared the relative performances of SVM, BLAST and HMMs for the classification of GPCR sub-families that bind to a specific ligand, defined by them as "level-2" sub-families ( Figure 5 and 6). They calculate coverage (which is the percentage of True Positives selected before the first False Positive error) and the errors per sequence at the Minimal Error Point (MEP) as the parameters for evaluating the different methods, each of which could work best at different score thresholds. The former is indicative of the sensitivity of a discriminating method whereas the latter, since it is a total of both the False Positive as well as False Negative errors, indicates both sensitivity and specificity. These statistics are calculated by sweeping a threshold over the E-values combined from all the sub-families. The coverage values reported for SVM, BLAST and HMM are 65%, 13.3% and 5% respectively. Our HMM-d has a coverage of 13% which is comparable to the values reported for BLAST and HMM. The coverage of HMM-ModE (27%) is better than that of HMM-d. On the other hand, the 18% error at the MEP obtained for HMM-ModE is comparable to the 13.7% reported for SVM but lower than those reported for BLAST and HMM (25.5% and 30% respectively) or 21% obtained for HMM-d. However, we note that the average coverage and errors at MEP are calculated after combining and normalising results of different sub-families using the E-values. The results for HMM-ModE should be better than those observed using these parameters as sequence classification is based on the threshold score and not the E-value. A better comparison to the SVM results above, which also uses a discriminant score and not a significance value, would be to average sub-family values of these parameters. HMM-ModE then returns values of 96% coverage, and 6% error rate at MEP ( Table 2).
The relatively poor results for Octopamine is due to the fact that there are only twelve sequences in the dataset, and they have higher sequence similarity with the serotonin sub-class than with each other ( Table 2). Since our choice of threshold is optimised for specificity, there is a sharp fall in sensitivity. The HMM-ModE profile provides an improvement in the coverage values for 11 of the subfamilies. Only in one case (OlfactoryII family6) is the coverage for HMM-ModE worse than HMM-d.

Classification of kinases
To test the above protocol in a high-throughput annotation case study, we applied the method to classify protein kinase sequences at a functional level. Protein Kinase sequences have been classified by Cheek et al [23] into fold groups on the basis of structural similarity and further into families of homologous sequences. Each family is made up of sub-families denoted by EC numbers. We constructed function-specific sub-family profiles using sequences from the ENZYME [34] database as a training set. The protein S/T -Y/atypical kinase/lipid kinase/ATPgrasp fold group contains enzymatic functions belonging to 36 different EC numbers, of which 19 EC numbers have 3 or more sequences available (Figure 7). We could populate the training set by mining databases for annotated sequences that fitted the description of the class, but used only specified sequences in this validation. Two generic activities, namely "protein kinase" and "protein tyrosine kinase" (EC numbers 2.7.1.37 and 2.7.1.112 respectively) were not included in the analysis.
The application of the method in high-throughput analysis is instructive. A careful perusal of the sequences classified showed that very few sequences with annotations outside of the sub-family were scored using HMM-ModE and HMM-t, commensurate with their expectation of high sensitivity and specificity. Rhodopsin and Beta-adrenergic receptor kinases are sub-families belonging to the G-protein-coupled receptor kinase 1 family which bind different substrates. The HMM-d profile for the Rhodopsin kinase sub-family (EC number 2.7.1.125) selected 18 sequences annotated as beta-adrenergic receptors from a database of 56,144 protein kinase sequences previously classified by Cheek et al [23].  The relatively high specificity of the HMM-ModE profiles and HMM-t provides a greater confidence with which to annotate unknown, hypothetical, putative or unnamed sequences. Table 3 shows the number of such sequences which have been annotated by our protocol.
HMM-sub provides inconsistent results when used directly on a database of generic sequences. For protein families where the division of proteins into functional sub-types can be accomplished by phylogeny, this method would work well, as the specificity determining columns would then contain mutually exclusive amino acids in the different sub-families, and maximum discrimination would be provided by the application of this method. However at the level of classification we target the application of these methods, proteins usually have multiple features, not necessarily dependent on its molecular function that co-evolve. Examples include variations in sub-cellular location -membrane-bound or cytosolic, differing affinities for more than one substrate, or the interaction with other proteins that differ across paralogs. In addition, by effectively removing all features shared with the family, the method has a high chance of picking up false positives which may contain features unrelated to the molecular function.
Protein families whose members have convergently evolved is a case which will result in the failure of the methods described here. Though the catalytic activity of these proteins is the same, and the amino acids that confer this specificity would be similarly conserved in space, their arrangement in the sequence would be dependent on the scaffold of the protein fold. Although this case is not present in any of the case studies presented in this paper, we guard against this eventuality by first doing a phylogenetic clustering of the sequences from the subfamily. In the case of multiple folds, separate fold-based profiles maybe used. The clustering is also helpful in the case of very large training datasets, to allow sampling representative of the complete dataset to generate the multiple alignment.

Conclusion
We have implemented a protocol to classify protein sequences based on profile HMMs. This protocol maximises the discrimination of the sequence belonging to the subfamily against a sequence containing the pattern by chance by incorporating the fold components of the profile, and against sequences from other subfamilies, by incorporating information related to the specificity determining residues identified using relative entropy. Although essentially implementing ideas suggested by Mamitsuka [15], Hannehalli and Russell [17], Wistrand and Sonnhammer [16], and Brown et al [13], this protocol is faster in training, as only negative sequences that are The values in parentheses indicate the standard deviation for all 10 samples in a 10-fold cross validation. HMM-d -HMM profile used with default threshold HMM-t -HMM profile used with optimised threshold HMM-ModE -profile with modified emission probabilities HMM-Sub -Log-difference-of-odds-score method selected by the sub-family HMM as false positives, are used in modifying model parameters, and optimising the discrimination threshold. The use of HMMER for searching and scoring sequence databases remains unchanged except for our use of recent modifications to the model that aid improved discrimination. The hmmsearch Evalue is no longer applicable as some of the model probabilities are modified to reflect information from negative training sequences, though the null probability used by the program to calculate the E-value still remains that of the original model as it is common across all states in the model.
The availability of accurately pre-classified protein sequences is an important starting point for classification based on function. Datasets classified on the basis of folds, such as the kinase set used in this work, additionally provide an opportunity for finer classification based on more specific function. Increased use of methods such as described in this manuscript, with a high prediction accuracy, will provide confidence in functional annotation of protein sequences which are generated from highthroughput genome sequence projects, a large proportion of which are of not experimentally characterised.
Given the above results, we recommend that if the coverage is 1 (i.e the highest false positive score is less than the lowest true positive score) the threshold as specified in Pfam may be used. If the coverage is less, then the threshold calculated using 10-fold cross-validation as described in this paper, is the optimal discriminating threshold for the given dataset, and can be used in place of the "gathering" threshold for sub-family classification. Further discrimination is possible if there are sufficient false positive sequences to build a profile.

Datasets used in the study
The sequences belonging to the 6 sub-families of the AGC protein kinase family, each with different ligand specificities, were kindly provided by the authors of [35]. These sequences included 66 sequences of cyclic nucleotide regulated Protein Kinases (PKA), 135 sequences of Diacylglycerol-activated/phospholipid-dependent protein kinase C (PKC), 23 sequences of RAC/Akt protein kinases, related to PKA and PKC, (RAC), 58 sequences of G protein-coupled receptor kinases (GRK), 40 sequences of protein kinases that phosphorylate the ribosomal protein S6 family (S6PK) and 48 sequences of the flowering plant protein kinase homolog family (PVPK1) (Figure 4).
Protein sequences belonging to level-2 sub-families of the Class A ( Figure 5) and Class C GPCR ( Figure 6) proteins were kindly provided by the authors of [19]. The sequences are a part of two mutually exclusive sets called set0 and set1 which are used in a two-fold cross validation An outline of some Level 1 and Level 2 subfamilies of the GPCR Class C proteins Figure 6 An outline of some Level 1 and Level 2 subfamilies of the GPCR Class C proteins. The level-2 sub-families used in this study are marked in bold.

Level 1 Subfamilies Level 2 Subfamilies
Class C

and the average percentage of errors per sequence at the MEP of HMM-d and HMM-ModE for classification of Level-2 sub-families of Class A and Class C GPCR proteins
exercise. These sets were manually inspected and were tivity, specificity and MCC (Matthews Correlation Coefficient) [22] distributions for 10 sets of each sample were obtained. An average MCC distribution for the 10 sets for a particular sample was plotted as a function of the scores. The optimal discrimination threshold was identified as the mid point corresponding to the mode valueof the average MCC distribution. The sequences in the each test sample for each sub-family were scored with HMMER and HMM-ModE profiles generated from the entire training sample using the optimal threshold score. The complete protocol is automated using PERL scripts along with the datasets are available from [37].

Modifying the Emission probabilities of the HMMs (HMM-ModE)
The false positive (FP) set of sequences were aligned using MUSCLE. Profile alignment of the subfamily profile and the FP profile was performed using the -profile option of MUSCLE. HMMs for the aligned sub-family and FP alignments were built as described above. For nullifying the non-discriminating fold-specific emission probabilities the match emission probabilities of the true positive profile were modified with the emission probabilities of the FP profile using the equations described in the HMMER user's guide [24]. Specifically, the score in the HMM is modified using the condition where Intscale = 1000 Probabilities can be derived from the scores using The modification steps and the generation of the new HMM in the HMMER format was accomplished using PERL scripts.

Authors' contributions
PKS and DKD wrote programs for the modification of emission probabilities, SN worked on the cross-validation and on modularising the final code, AML guided the implementation of the work. All the authors jointly wrote the manuscript.