Predicting conserved protein motifs with Sub-HMMs
© Horan et al. 2010
Received: 17 December 2009
Accepted: 26 April 2010
Published: 26 April 2010
Skip to main content
© Horan et al. 2010
Received: 17 December 2009
Accepted: 26 April 2010
Published: 26 April 2010
Profile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins.
To identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities.
Sub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future.
The identification of functionally relevant features in protein sequences is an important task for gaining insight into their molecular and biological activities. Commonly used feature classifications systems focus on protein regions of different lengths ranging from single residues in active site representations and relatively short sequence motifs to much longer protein domains. The identification of these functional modules is often of immediate importance for guiding molecular and evolutionary studies of genes and genomes, such as experimental or computational discoveries of drug targets, catalytic residues and ligand binding sites [1–6]. Due to the greater evolutionary constraints, the functionally important regions in proteins tend to be more conserved among related sequences than their less relevant regions. As a result of this basic similarity-function principle, one can predict the functional features in proteins relatively reliably by identifying their conserved regions [7, 8]. The same information is often useful to predict differences of the catalytic and substrate specificities within subgroups of protein families by identifying their specificity determining residues [9, 10].
Profile hidden Markov models (profile HMMs) provide the basis of very efficient approaches for modeling longer conserved regions in protein families, which are referred to as protein domains [11–14]. These domain models usually co-align well with longer functional and structural units of proteins, such as protein folds [15, 16]. The genome regions coding for protein domains, rather than entire genes, are often considered the functional base units of protein evolution. Because domain models are relatively complex by covering longer conserved sequence areas, the identification of essential sub-features within protein domains can greatly facilitate their functional characterization. Well known examples are the conserved protein motifs from the PROSITE database [17, 18]. These much shorter patterns frequently map to residues within protein domains that are directly involved in the core functions of many proteins, such as the coordination of the catalytic centers of enzymes. The most specific and functionally insightful information about known or predicted active sites is provided by protein structure-based resources, such as the Catalytic Site Atlas (CSA), CASTp, ActSitePred, ConSurf and PDBSite [7, 19–23]. The utility spectrum of these structure-based resources is typically restricted to proteins that share sequence similarity with proteins of known 3D structure. This requirement of structure information makes these methods less suited for functional site predictions of many membrane proteins or other difficult to crystallize protein classes. Thus, it is important to develop additional tools that can be used for the prediction of functionally relevant features of all protein classes. Conservation analyses are widely used alternatives for this purpose [8, 24–26]. Typically, these methods aim to identify conserved residues in multiple sequence alignments of related proteins. Based on the above principle, these conserved sites tend to be functionally more important than more variable ones. More recently developed approaches incorporate additional information with conservation data, such as secondary structure predictions, solvent accessibility data and other parameters [27, 28]. In addition, Mistry et al.  have developed a set of strict rules that allows the transfer of experimentally validated active site information to other sequences within the same enzyme family. A disadvantage of most conserved residue approaches is the difficulty of using their data sets without major modifications for search applications in order to identify novel proteins containing these features. The more information rich motif and domain models are usually more effective in this regard. This is also facilitated by the availability of many efficient motif or domain search algorithms in this area.
Much of the information available in conserved sequence databases is the direct result of mining the available protein space with existing feature prediction tools. This includes very established databases on protein motif or domain information, such as PROSITE, InterPro and Pfam [2, 4, 18]. However, the annotation and curation process of the conserved features provided by these databases is still a very time consuming and largely manual curation processes by many experts in the field. Therefore, the development of additional functional feature prediction methods, that can facilitate the automation of various steps in this laborious annotation process, will be of great importance for the field.
Here we propose an automated method for identifying conserved protein motifs by creating sub-HMMs from custom or existing profile HMM data sets, such as Pfam. The method builds on existing profile HMM domain models and expands their utility spectrum to motif discovery. The approach has many applications for studying protein functions. First, it is useful for predicting the most highly conserved and functionally relevant sequence motifs in protein families. Second, it provides an effective alternative for profile-based similarity searches to detect sequences with short similarities in any order. Finally, it can be used for the characterization of domains of unknown function by associating them with sub-HMMs from functionally characterized domains.
The most closely related method for modeling protein families by a fragment-based approach was proposed by Plotz and Fink . Their goal was to minimize the number of parameters used by the model in order to improve its performance on small training sets. To achieve this, the authors started with a signal-like protein sequence representation  and trained a new model on this data set. Their model consisted of Sub-Protein Units (SPUs) that were concatenated in an order learned from the data set. Each SPU of this method is an HMM by itself. In contrast to this, our method uses pre-calculated profile HMMs to discover functionally relevant motifs in protein domains. In addition, our method has the ability to allow any combination of sub-HMMs to occur in any order. Another related method is Meta-MEME . This method also minimizes the number of model parameters. It accomplishes this by concatenating short PSSMs (Position Specific Scoring Matrix) instead of HMMs, which are generated by its sister tool MEME . This approach is similar to the BLOCKMAKER program , which also models conserved regions with un-gapped PSSMs. Our method differs from these approaches significantly by retaining full HMMs of the most highly conserved sub-regions within protein domain families. This allows us to model more complex consensus regions containing gaps. The method developed by Sun and Buhler  attempts to speed up searching with profile HMMs by extracting un-gapped subsections (blocks) of HMMs and then modifying the match distributions in each position to make each block as sensitive as possible. These blocks are then used as pre-filters to eliminate sequences which would not match the whole HMM well.
Our proposed protein sub-HMM method starts with a profile HMM that has been trained on the multiple sequence alignment of a protein family. We then extract the most conserved sub-HMMs from the original HMM. A robust scoring method is used to predict the presence of the sub-HMMs in any protein sequence of interest. The HMMs required for this approach can be easily generated from unaligned protein sequences of interest by aligning them with a multiple sequence alignment program and then generating an HMM for them with tools like HMMER [14, 35, 36] or SAM . Alternatively, one can use existing protein family HMMs from databases like Pfam . The latter approach is taken in this paper for benchmarking the proposed protein sub-HMM method.
In this equation, P back(S) is the probability of S, assuming each amino acid has been drawn independently from the background distribution, while P HMM(S, Z) is the probability that the HMM would generate the state sequence Z and the observed sequence S. A positive score means that S is more likely to be derived from the HMM than randomly generated from the background distribution. A more detailed description of profile HMMs can be found in .
Sub-HMMs were extracted from Pfam domain families using HMMER2 and HMMER3 models . Pfam 22.0 was used for all experiments, whereas Pfam 24.0 was mainly used in the performance comparisons with HMMER3. This is because Pfam has adopted HMMER3 models only very recently, and at this point many of its families have not been as rigorously tested and curated by experts in the field as in the earlier HMMER2-based releases.
Proteins in Pfam database
Domains in Pfam database
Pfam domains of known function
Pfam domains of unknown function
Sub-HMMs excised from Pfam domains
Sub-HMMs excised from DKFs
Sub-HMMs excised from DUFs
Subsequently, we performed several benchmark tests to determine the performance of the new sub-HMM method in identifying functionally relevant sequence features and searching for sequences sharing them. For this, we determined the presence of each Pfam HMM and our sub-HMMs in all protein sequences from the Pfam database by applying the scoring system described in the Materials section. We found that the processing time of our method is comparable to HMMER2. The slightly better time performance of our method by a factor 1.4 is most likely due to the lower complexity of its sub-HMM models. The sub-HMM method showed comparable time improvements when using it with the HMMER3 software.
Next, we determined how well the sub-HMM method performed in identifying known motifs that are likely to be of functional relevance. This was addressed by comparing the extracted sub-HMMs from the Pfam 22.0 database with the hand curated conserved protein motifs from the PROSITE database. If the sub-HMMs are enriched in functionally relevant candidates, then one would expect a high degree of overlap with the motifs from the PROSITE database. This should be the case because the PROSITE motifs are derived from a comparable protein knowledge space as the sub-HMMs generated by this study. The overlaps were determined by comparing the matching positions of the two fragment data sets in their corresponding protein family sequences. For counting overlaps, we used relatively conservative filtering criteria: the two fragment models had to have 50% of their matching protein sequences in common and the overlaps had to occur in least 95% of the common protein members. In addition, we consider a sub-HMM to match only if it has a score of 0 or higher. Furthermore, we compute the probability of this event happening by chance and require that it be less than 0.01.
A similar test was performed for the catalytic residue annotations from the Catalytic Site Atlas (CSA) . This is a database of active site residues from enzymes represented in the Protein Data Bank (PDB). Due to their functional importance, most of these residues are highly conserved within protein families. In our tests, we considered only those sites which are supported by the literature and also mapped to protein domain regions in the Pfam data set. This left us with 4147 sites mapping to 642 proteins. Subsequently, we counted how many sub-HMMs overlapped with these sites and found that 847 sub-HMMs overlapped with CSA residues. These corresponded to 2903 active sites from 546 proteins. Thus, our sub-HMM data set contained 70% of these active sites. The probability of observing ≥2903 overlaps among the two data sets just by chance is < 1.5 * 10-18. The complete result set of this analysis is available in Additional file 2:csa-comp.
The considerable agreement of our method with the PROSITE and CSA data sets indicates that the sub-HMM method is efficient in identifying many of the known functionally important residues in protein families. Therefore, it is reasonable to assume that the novel conserved regions, identified by this study, are a useful resource for characterizing the functional hotspots in protein sequences of known or unknown function in the future.
We also performed more rigorous comparisons of our method against HMMER2, HMMER3, SAM and PSI-BLAST . Additionally, we tested our sub-HMM method with HMMER3 profile HMMs. In this case the sub-HMMs where excised from HMMER3 models and the HMMER3 search tool was used to map and score the individual sub-HMMs to the sequences. We then combined the scores as described in the Methods section. In the following text of this section, the sub-HMM experiments performed with HMMER2 and HMMER3 are referred to sub-HMM-HMMER2 and sub-HMM-HMMER3, respectively. In all tests we trained the models ourselves by randomly selecting 20% of the members from each protein family, but the training data were not included in the test data sets. HMMER2, HMMER3 and SAM use a multiple sequence alignment for the model building step. Since it was not our goal to test the alignment quality, we used the curated domain alignments provided by Pfam as input to all methods. Although SAM can create its own alignments, we forced it to use the alignments we provided to make this method more comparable to HMMER2 and HMMER3. For PSI-BLAST, we first created multiple sequence alignments for all the training data sets using CLUSTALW. Subsequently, we built PSSMs to search the test data set with PSI-BLAST. For all methods, we compared how well they could recover the remaining 80% in each protein family from the combined set of all test sequences. Due to computational resource constraints, it was not possible to test these methods on all Pfam families. Instead we created two smaller subsets of families, one composed of smaller families and one composed of larger families. The small family set contained 933 families randomly selected from Pfam 22.0 with of 10 to 100 members, while the large set contained 1002 families with more than 100 members. In addition, we tested the different methods on the HMMER3-based Pfam 24.0 data set. To maximize the comparability of the results, we selected only families that were available in both Pfam releases and fell into the same size categories. For the small set, we found 899 families in Pfam 24.0 but only 491 of them had less than 100 members. For the large set, 988 families were also available in Pfam 24.0 and all of them contained more than 100 members.
Since our method is designed to find short sequence similarities, it is expected to have a lower selectivity (higher false positive rate) than the other methods when reassembling family relationships that are based on longer domain similarities. In fact, such a performance characteristics on known family data sets is required for discovering novel conserved fragments in sequences that do not necessarily belong to the same domain family. The latter is the main utility feature of the sub-HMM method.
Matches Among DUFs and DKFs.
sub-DKF → DKF
sub-DKF → DUF
sub-DUF → DKF
sub-DUF → DUF
The large number of sub-HMMs matching different Pfam domains indicates the usefulness of our sub-HMM approach for discovering short sequence features that are conserved among different protein domains. Due to their high conservation, an important functional role for many of these features can be expected. Many of the sub-DKFs will be useful for assigning potential functions to DUFs. A much more comprehensive study on applying our sub-HMM approach to biologically relevant questions will be published in an experimental journal.
We have developed a simple but effective method for identifying the most highly conserved residues in protein sequences in a fully automated manner. Its design strategy is highly practical and versatile by making efficient use of a well-established bioinformatic infrastructure, such as existing domain databases and profile HMM search tools. In addition, the conserved patterns, identified by this study, are useful for characterizing proteins of unknown function by associating them with those of known function by their common sub-HMMs. Furthermore, the sub-HMM search method appears to be a very effective tool for finding sequences that share only very short sequence similarities with a sensitivity performance similar to HMMER2. The possibility to ignore the order of different sub-HMM matches in sequences is another advantage, which will allow the identification of more complex similarity arrangements among otherwise unrelated sequences.
Once the consecutive regions of match states are identified from the original profile HMM, we convert each of them into a sub-HMM. Each sub-HMM has the same structure, transition probabilities, and observation distributions as the corresponding segments in the source HMM. As the original HMMs, the sub-HMMs begin and end with looped insertion states. Typically, a sub-HMM obtained from this process is identical to a profile HMM trained on the corresponding region of a multiple alignment that was used for generating the original profile HMM.
Here score y (S) is the score from Equation 1 for HMM y. The term |Y| log M arises from the uniform distribution over positions at which any sub-HMM might begin. We implemented our method in Java and used code from HMMEditor  to run the Viterbi algorithm. This score can be computed in time linear in M and the combined lengths of the sub-HMMs.
In the ROC performance tests, we scored sequences using sub-HMMs grouped by the Pfam families they were excised from. For all other tests, we scored individual sub-HMMs by using score y (S) as the final score.
In equation (9), we replace the sum from the previous equation with the sum over the possible sizes of R. For each size, the binomial term gives the number of sets of size k, and the last term gives the probability of a set of size k. However, this bound is often too loose in practice. This is because for large values of p ij*, the last term in equation (7) makes that term very small, whereas the corresponding term in our bound would still be large. Therefore, we adopt a method of removing extreme outliers to obtain a tighter bound.
where n' is the number of elements remaining in the intersection after the outliers have been removed. More details about this method are provided in Additional file 3:prosite_scoring.pdf.
We use the Hoeffding bound  to upper-bound the likelihood of finding a certain number of PROSITE or CSA overlaps with our sub-HMM data set by chance (that is, if the sub-HMM data set had instead been chosen at random). The Hoeffding bound states that if the random chance of any single test matching is p, then the probability of m or more matches in M tests is less than where .
For the PROSITE comparisons, matches are only considered if the prior probability is less than 0.01, therefore, p = 0.01. We found m = 1,055 overlaps out of a total set of M = 48, 535 sub-HMMs. This yields a p-value (by the Hoeffding bound) of less than 1.6 * 10-6 for the probability of our sub-HMMs matching PROSITE models at this level by chance.
For the CSA comparison, each site is only a single amino acid. We restrict the comparisons to only those sequences containing annotated CSA sites. There are M = 95, 076 amino acids matching our sub-HMMs, of which m = 2, 903 are annotated by CSA. There are a total of 261, 857 amino acids, of which 4, 147 are annotated by CSA. Therefore, , and we obtain (again with the Hoeffding bound), a p-value of less than 1.5 * 10-18 for the probability of our sub-HMMs overlapping these CSA-annotated amino acids by chance.
For the PSI-BLAST tests, the training sets were aligned with CLUSTALW  and then a PSSM was generated using blastpgp with just one round of searching. The test data was then scored by blastpgp using the trained PSSM as a starting point and running for up to 6 rounds. For each sequence, we recorded the maximum log-odds score from all the rounds. For the SAM tests, we extracted the aligned training data from the Pfam database and used them to train the models, forcing SAM to use the given alignments rather than create its own. These models were then used to classify the test data. In the case of HMMER2 and HMMER3, we trained models with hmmbuild and hmmcalibrate (HMMER2 only) using the same alignments as for the SAM tests. In all cases, HMMER2 tests were performed with HMMER2 models and HMMER3 tests with HMMER3 models. We then used these models to classify the test data with hmmsearch. If multiple domains were found in one sequence, the result from the best scoring one was used.
For the sub-HMM method, we used the aligned training data to build HMMER2 and HMMER3 models, and then extracted sub-HMMs from them. We then used our hmmsearch implementation to score each sequence according to our model. For all tests, the training sets consisted of a random selection of 20% of the sequences from each Pfam family, while the test database contained the union of the remaining sequences. The ROC curves where computed with the ROCR library  using the concatenation of all the scores for each method. Log-odds scores were used for all methods to obtain comparable results. In the case of SAM, we used reverse log-odds scores .
The sub-HMM software developed by this project is available for free download from our web page: http://subhmm.ucr.edu. The site also contains download options of the complete set of extracted sub-HMMs and data for the Pfam network analysis, as well as a searchable web interface.
We thank the community projects - Pfam, HMMER, SAM and R - for providing the excellent software and data resources that were used by this study. TG acknowledges support from the Bioinformatics Core Facility, the Center for Plant Cell Biology (CEPCEB) and the Institute for Integrative Genome Biology (IIGB) at UC Riverside.
This work was supported by the National Science Foundation grant numbers 2010-0420152 and 2010-0820842.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.