Predicting conserved protein motifs with Sub-HMMs
© Horan et al; licensee BioMed Central Ltd. 2010
Received: 17 December 2009
Accepted: 26 April 2010
Published: 26 April 2010
Profile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins.
To identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities.
Sub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future.
The identification of functionally relevant features in protein sequences is an important task for gaining insight into their molecular and biological activities. Commonly used feature classifications systems focus on protein regions of different lengths ranging from single residues in active site representations and relatively short sequence motifs to much longer protein domains. The identification of these functional modules is often of immediate importance for guiding molecular and evolutionary studies of genes and genomes, such as experimental or computational discoveries of drug targets, catalytic residues and ligand binding sites [1–6]. Due to the greater evolutionary constraints, the functionally important regions in proteins tend to be more conserved among related sequences than their less relevant regions. As a result of this basic similarity-function principle, one can predict the functional features in proteins relatively reliably by identifying their conserved regions [7, 8]. The same information is often useful to predict differences of the catalytic and substrate specificities within subgroups of protein families by identifying their specificity determining residues [9, 10].
Profile hidden Markov models (profile HMMs) provide the basis of very efficient approaches for modeling longer conserved regions in protein families, which are referred to as protein domains [11–14]. These domain models usually co-align well with longer functional and structural units of proteins, such as protein folds [15, 16]. The genome regions coding for protein domains, rather than entire genes, are often considered the functional base units of protein evolution. Because domain models are relatively complex by covering longer conserved sequence areas, the identification of essential sub-features within protein domains can greatly facilitate their functional characterization. Well known examples are the conserved protein motifs from the PROSITE database [17, 18]. These much shorter patterns frequently map to residues within protein domains that are directly involved in the core functions of many proteins, such as the coordination of the catalytic centers of enzymes. The most specific and functionally insightful information about known or predicted active sites is provided by protein structure-based resources, such as the Catalytic Site Atlas (CSA), CASTp, ActSitePred, ConSurf and PDBSite [7, 19–23]. The utility spectrum of these structure-based resources is typically restricted to proteins that share sequence similarity with proteins of known 3D structure. This requirement of structure information makes these methods less suited for functional site predictions of many membrane proteins or other difficult to crystallize protein classes. Thus, it is important to develop additional tools that can be used for the prediction of functionally relevant features of all protein classes. Conservation analyses are widely used alternatives for this purpose [8, 24–26]. Typically, these methods aim to identify conserved residues in multiple sequence alignments of related proteins. Based on the above principle, these conserved sites tend to be functionally more important than more variable ones. More recently developed approaches incorporate additional information with conservation data, such as secondary structure predictions, solvent accessibility data and other parameters [27, 28]. In addition, Mistry et al.  have developed a set of strict rules that allows the transfer of experimentally validated active site information to other sequences within the same enzyme family. A disadvantage of most conserved residue approaches is the difficulty of using their data sets without major modifications for search applications in order to identify novel proteins containing these features. The more information rich motif and domain models are usually more effective in this regard. This is also facilitated by the availability of many efficient motif or domain search algorithms in this area.
Much of the information available in conserved sequence databases is the direct result of mining the available protein space with existing feature prediction tools. This includes very established databases on protein motif or domain information, such as PROSITE, InterPro and Pfam [2, 4, 18]. However, the annotation and curation process of the conserved features provided by these databases is still a very time consuming and largely manual curation processes by many experts in the field. Therefore, the development of additional functional feature prediction methods, that can facilitate the automation of various steps in this laborious annotation process, will be of great importance for the field.
Here we propose an automated method for identifying conserved protein motifs by creating sub-HMMs from custom or existing profile HMM data sets, such as Pfam. The method builds on existing profile HMM domain models and expands their utility spectrum to motif discovery. The approach has many applications for studying protein functions. First, it is useful for predicting the most highly conserved and functionally relevant sequence motifs in protein families. Second, it provides an effective alternative for profile-based similarity searches to detect sequences with short similarities in any order. Finally, it can be used for the characterization of domains of unknown function by associating them with sub-HMMs from functionally characterized domains.
The most closely related method for modeling protein families by a fragment-based approach was proposed by Plotz and Fink . Their goal was to minimize the number of parameters used by the model in order to improve its performance on small training sets. To achieve this, the authors started with a signal-like protein sequence representation  and trained a new model on this data set. Their model consisted of Sub-Protein Units (SPUs) that were concatenated in an order learned from the data set. Each SPU of this method is an HMM by itself. In contrast to this, our method uses pre-calculated profile HMMs to discover functionally relevant motifs in protein domains. In addition, our method has the ability to allow any combination of sub-HMMs to occur in any order. Another related method is Meta-MEME . This method also minimizes the number of model parameters. It accomplishes this by concatenating short PSSMs (Position Specific Scoring Matrix) instead of HMMs, which are generated by its sister tool MEME . This approach is similar to the BLOCKMAKER program , which also models conserved regions with un-gapped PSSMs. Our method differs from these approaches significantly by retaining full HMMs of the most highly conserved sub-regions within protein domain families. This allows us to model more complex consensus regions containing gaps. The method developed by Sun and Buhler  attempts to speed up searching with profile HMMs by extracting un-gapped subsections (blocks) of HMMs and then modifying the match distributions in each position to make each block as sensitive as possible. These blocks are then used as pre-filters to eliminate sequences which would not match the whole HMM well.
Our proposed protein sub-HMM method starts with a profile HMM that has been trained on the multiple sequence alignment of a protein family. We then extract the most conserved sub-HMMs from the original HMM. A robust scoring method is used to predict the presence of the sub-HMMs in any protein sequence of interest. The HMMs required for this approach can be easily generated from unaligned protein sequences of interest by aligning them with a multiple sequence alignment program and then generating an HMM for them with tools like HMMER [14, 35, 36] or SAM . Alternatively, one can use existing protein family HMMs from databases like Pfam . The latter approach is taken in this paper for benchmarking the proposed protein sub-HMM method.
Results and Discussion
In this equation, Pback(S) is the probability of S, assuming each amino acid has been drawn independently from the background distribution, while PHMM(S, Z) is the probability that the HMM would generate the state sequence Z and the observed sequence S. A positive score means that S is more likely to be derived from the HMM than randomly generated from the background distribution. A more detailed description of profile HMMs can be found in .
Extraction of Sub-HMMs
Properties of Sub-HMMs
Sub-HMMs were extracted from Pfam domain families using HMMER2 and HMMER3 models . Pfam 22.0 was used for all experiments, whereas Pfam 24.0 was mainly used in the performance comparisons with HMMER3. This is because Pfam has adopted HMMER3 models only very recently, and at this point many of its families have not been as rigorously tested and curated by experts in the field as in the earlier HMMER2-based releases.
Proteins in Pfam database
Domains in Pfam database
Pfam domains of known function
Pfam domains of unknown function
Sub-HMMs excised from Pfam domains
Sub-HMMs excised from DKFs
Sub-HMMs excised from DUFs
Subsequently, we performed several benchmark tests to determine the performance of the new sub-HMM method in identifying functionally relevant sequence features and searching for sequences sharing them. For this, we determined the presence of each Pfam HMM and our sub-HMMs in all protein sequences from the Pfam database by applying the scoring system described in the Materials section. We found that the processing time of our method is comparable to HMMER2. The slightly better time performance of our method by a factor 1.4 is most likely due to the lower complexity of its sub-HMM models. The sub-HMM method showed comparable time improvements when using it with the HMMER3 software.
Cross-Validation with PROSITE and CSA
Next, we determined how well the sub-HMM method performed in identifying known motifs that are likely to be of functional relevance. This was addressed by comparing the extracted sub-HMMs from the Pfam 22.0 database with the hand curated conserved protein motifs from the PROSITE database. If the sub-HMMs are enriched in functionally relevant candidates, then one would expect a high degree of overlap with the motifs from the PROSITE database. This should be the case because the PROSITE motifs are derived from a comparable protein knowledge space as the sub-HMMs generated by this study. The overlaps were determined by comparing the matching positions of the two fragment data sets in their corresponding protein family sequences. For counting overlaps, we used relatively conservative filtering criteria: the two fragment models had to have 50% of their matching protein sequences in common and the overlaps had to occur in least 95% of the common protein members. In addition, we consider a sub-HMM to match only if it has a score of 0 or higher. Furthermore, we compute the probability of this event happening by chance and require that it be less than 0.01.
A similar test was performed for the catalytic residue annotations from the Catalytic Site Atlas (CSA) . This is a database of active site residues from enzymes represented in the Protein Data Bank (PDB). Due to their functional importance, most of these residues are highly conserved within protein families. In our tests, we considered only those sites which are supported by the literature and also mapped to protein domain regions in the Pfam data set. This left us with 4147 sites mapping to 642 proteins. Subsequently, we counted how many sub-HMMs overlapped with these sites and found that 847 sub-HMMs overlapped with CSA residues. These corresponded to 2903 active sites from 546 proteins. Thus, our sub-HMM data set contained 70% of these active sites. The probability of observing ≥2903 overlaps among the two data sets just by chance is < 1.5 * 10-18. The complete result set of this analysis is available in Additional file 2:csa-comp.
The considerable agreement of our method with the PROSITE and CSA data sets indicates that the sub-HMM method is efficient in identifying many of the known functionally important residues in protein families. Therefore, it is reasonable to assume that the novel conserved regions, identified by this study, are a useful resource for characterizing the functional hotspots in protein sequences of known or unknown function in the future.
Search Performance Comparisons
We also performed more rigorous comparisons of our method against HMMER2, HMMER3, SAM and PSI-BLAST . Additionally, we tested our sub-HMM method with HMMER3 profile HMMs. In this case the sub-HMMs where excised from HMMER3 models and the HMMER3 search tool was used to map and score the individual sub-HMMs to the sequences. We then combined the scores as described in the Methods section. In the following text of this section, the sub-HMM experiments performed with HMMER2 and HMMER3 are referred to sub-HMM-HMMER2 and sub-HMM-HMMER3, respectively. In all tests we trained the models ourselves by randomly selecting 20% of the members from each protein family, but the training data were not included in the test data sets. HMMER2, HMMER3 and SAM use a multiple sequence alignment for the model building step. Since it was not our goal to test the alignment quality, we used the curated domain alignments provided by Pfam as input to all methods. Although SAM can create its own alignments, we forced it to use the alignments we provided to make this method more comparable to HMMER2 and HMMER3. For PSI-BLAST, we first created multiple sequence alignments for all the training data sets using CLUSTALW. Subsequently, we built PSSMs to search the test data set with PSI-BLAST. For all methods, we compared how well they could recover the remaining 80% in each protein family from the combined set of all test sequences. Due to computational resource constraints, it was not possible to test these methods on all Pfam families. Instead we created two smaller subsets of families, one composed of smaller families and one composed of larger families. The small family set contained 933 families randomly selected from Pfam 22.0 with of 10 to 100 members, while the large set contained 1002 families with more than 100 members. In addition, we tested the different methods on the HMMER3-based Pfam 24.0 data set. To maximize the comparability of the results, we selected only families that were available in both Pfam releases and fell into the same size categories. For the small set, we found 899 families in Pfam 24.0 but only 491 of them had less than 100 members. For the large set, 988 families were also available in Pfam 24.0 and all of them contained more than 100 members.
Since our method is designed to find short sequence similarities, it is expected to have a lower selectivity (higher false positive rate) than the other methods when reassembling family relationships that are based on longer domain similarities. In fact, such a performance characteristics on known family data sets is required for discovering novel conserved fragments in sequences that do not necessarily belong to the same domain family. The latter is the main utility feature of the sub-HMM method.
Discovery of Conserved Fragments in Protein Families with Sub-HMMs
Matches Among DUFs and DKFs.
sub-DKF → DKF
sub-DKF → DUF
sub-DUF → DKF
sub-DUF → DUF
The large number of sub-HMMs matching different Pfam domains indicates the usefulness of our sub-HMM approach for discovering short sequence features that are conserved among different protein domains. Due to their high conservation, an important functional role for many of these features can be expected. Many of the sub-DKFs will be useful for assigning potential functions to DUFs. A much more comprehensive study on applying our sub-HMM approach to biologically relevant questions will be published in an experimental journal.
We have developed a simple but effective method for identifying the most highly conserved residues in protein sequences in a fully automated manner. Its design strategy is highly practical and versatile by making efficient use of a well-established bioinformatic infrastructure, such as existing domain databases and profile HMM search tools. In addition, the conserved patterns, identified by this study, are useful for characterizing proteins of unknown function by associating them with those of known function by their common sub-HMMs. Furthermore, the sub-HMM search method appears to be a very effective tool for finding sequences that share only very short sequence similarities with a sensitivity performance similar to HMMER2. The possibility to ignore the order of different sub-HMM matches in sequences is another advantage, which will allow the identification of more complex similarity arrangements among otherwise unrelated sequences.
Extracting sub-HMMs from Profile HMMs
Once the consecutive regions of match states are identified from the original profile HMM, we convert each of them into a sub-HMM. Each sub-HMM has the same structure, transition probabilities, and observation distributions as the corresponding segments in the source HMM. As the original HMMs, the sub-HMMs begin and end with looped insertion states. Typically, a sub-HMM obtained from this process is identical to a profile HMM trained on the corresponding region of a multiple alignment that was used for generating the original profile HMM.
Scoring of Sub-HMM Matches
Here score y (S) is the score from Equation 1 for HMM y. The term |Y| log M arises from the uniform distribution over positions at which any sub-HMM might begin. We implemented our method in Java and used code from HMMEditor  to run the Viterbi algorithm. This score can be computed in time linear in M and the combined lengths of the sub-HMMs.
In the ROC performance tests, we scored sequences using sub-HMMs grouped by the Pfam families they were excised from. For all other tests, we scored individual sub-HMMs by using score y (S) as the final score.
PROSITE and CSA Comparisons
In equation (9), we replace the sum from the previous equation with the sum over the possible sizes of R. For each size, the binomial term gives the number of sets of size k, and the last term gives the probability of a set of size k. However, this bound is often too loose in practice. This is because for large values of pij*, the last term in equation (7) makes that term very small, whereas the corresponding term in our bound would still be large. Therefore, we adopt a method of removing extreme outliers to obtain a tighter bound.
where n' is the number of elements remaining in the intersection after the outliers have been removed. More details about this method are provided in Additional file 3:prosite_scoring.pdf.
We use the Hoeffding bound  to upper-bound the likelihood of finding a certain number of PROSITE or CSA overlaps with our sub-HMM data set by chance (that is, if the sub-HMM data set had instead been chosen at random). The Hoeffding bound states that if the random chance of any single test matching is p, then the probability of m or more matches in M tests is less than where .
For the PROSITE comparisons, matches are only considered if the prior probability is less than 0.01, therefore, p = 0.01. We found m = 1,055 overlaps out of a total set of M = 48, 535 sub-HMMs. This yields a p-value (by the Hoeffding bound) of less than 1.6 * 10-6 for the probability of our sub-HMMs matching PROSITE models at this level by chance.
For the CSA comparison, each site is only a single amino acid. We restrict the comparisons to only those sequences containing annotated CSA sites. There are M = 95, 076 amino acids matching our sub-HMMs, of which m = 2, 903 are annotated by CSA. There are a total of 261, 857 amino acids, of which 4, 147 are annotated by CSA. Therefore, , and we obtain (again with the Hoeffding bound), a p-value of less than 1.5 * 10-18 for the probability of our sub-HMMs overlapping these CSA-annotated amino acids by chance.
For the PSI-BLAST tests, the training sets were aligned with CLUSTALW  and then a PSSM was generated using blastpgp with just one round of searching. The test data was then scored by blastpgp using the trained PSSM as a starting point and running for up to 6 rounds. For each sequence, we recorded the maximum log-odds score from all the rounds. For the SAM tests, we extracted the aligned training data from the Pfam database and used them to train the models, forcing SAM to use the given alignments rather than create its own. These models were then used to classify the test data. In the case of HMMER2 and HMMER3, we trained models with hmmbuild and hmmcalibrate (HMMER2 only) using the same alignments as for the SAM tests. In all cases, HMMER2 tests were performed with HMMER2 models and HMMER3 tests with HMMER3 models. We then used these models to classify the test data with hmmsearch. If multiple domains were found in one sequence, the result from the best scoring one was used.
For the sub-HMM method, we used the aligned training data to build HMMER2 and HMMER3 models, and then extracted sub-HMMs from them. We then used our hmmsearch implementation to score each sequence according to our model. For all tests, the training sets consisted of a random selection of 20% of the sequences from each Pfam family, while the test database contained the union of the remaining sequences. The ROC curves where computed with the ROCR library  using the concatenation of all the scores for each method. Log-odds scores were used for all methods to obtain comparable results. In the case of SAM, we used reverse log-odds scores .
Availability of Software and Data Sets
The sub-HMM software developed by this project is available for free download from our web page: http://subhmm.ucr.edu. The site also contains download options of the complete set of extracted sub-HMMs and data for the Pfam network analysis, as well as a searchable web interface.
We thank the community projects - Pfam, HMMER, SAM and R - for providing the excellent software and data resources that were used by this study. TG acknowledges support from the Bioinformatics Core Facility, the Center for Plant Cell Biology (CEPCEB) and the Institute for Integrative Genome Biology (IIGB) at UC Riverside.
This work was supported by the National Science Foundation grant numbers 2010-0420152 and 2010-0820842.
- Jones S, Thornton JM: Searching for functional sites in protein structures. Curr Opin Chem Biol 2004, 8: 3–7. 10.1016/j.cbpa.2003.11.001View ArticlePubMed
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, (32 Database):138–141. 10.1093/nar/gkh121
- Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Res 2006, (34 Database):247–251. 10.1093/nar/gkj149
- Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJ, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: New developments in the InterPro database. Nucleic Acids Res 2007, (35 Database):224–228. 10.1093/nar/gkl841
- López G, Valencia A, Tress ML: Firestar-prediction of functionally important residues using structural templates and alignment reliability. Nucleic Acids Res 2007, (35 Web Server):573–577. 10.1093/nar/gkm297
- López G, Valencia A, Tress M: FireDB-a database of functionally important residues from proteins of known structure. Nucleic Acids Res 2007, (35 Database):219–223. 10.1093/nar/gkl897
- Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, (33 Web Server):299–302. 10.1093/nar/gki370
- Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23(15):1875–1882. 10.1093/bioinformatics/btm270View ArticlePubMed
- Mayer KM, Shanklin J: Identification of amino acid residues involved in substrate specificity of plant acyl-ACP thioesterases using a bioinformatics-guided approach. BMC Plant Biol 2007, 7: 1–1. 10.1186/1471-2229-7-1View ArticlePubMedPubMed Central
- Capra JA, Singh M: Characterization and prediction of residues determining protein functional specificity. Bioinformatics 2008, 24(13):1473–1480. 10.1093/bioinformatics/btn214View ArticlePubMedPubMed Central
- Krogh A, Brown M, Mian IS, Sjölander K, Haussler D: Hidden Markov Models in Computational Biology: Applications to Protein Modeling. Journal of Molecular Biology 1994, 235(5):1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMed
- Baldi P, Chauvin Y, Hunkapiller T, McClure M: Hidden Markov Models of Biological Primary Sequence Information. Proceedings of the National Academy of Sciences 1994, 91(3):1059–1063. 10.1073/pnas.91.3.1059View Article
- Eddy S: Multiple alignment using hidden Markov models. Proc Int Conf Intell Syst Mol Biol 1995, 3: 114–20.PubMed
- Eddy SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol 2008., 4(5): 10.1371/journal.pcbi.1000069
- Johannissen LO, Taylor WR: Protein fold comparison by the alignment of topological strings. Protein Eng 2003, 16(12):949–955. 10.1093/protein/gzg128View ArticlePubMed
- Redfern OC, Harrison A, Dallman T, Pearl FM, Orengo CA: CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 2007., 3(11): 10.1371/journal.pcbi.0030232
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, (34 Database):227–230. 10.1093/nar/gkj063
- Hulo N, Bairoch A, Bulliard V, Cerutti L, Cuche B, de Castro E, Lachaize C, Langendijk-Genevaux P, Sigrist C: The 20 years of PROSITE. Nucleic Acids Research 2008, (36 Database):D245.
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucl Acids Res 2004, 32(suppl_1):D129–133. 10.1093/nar/gkh028View ArticlePubMedPubMed Central
- Ivanisenko VA, Pintus SS, Grigorovich DA, Kolchanov NA: PDBSite: a database of the 3D structure of protein functional sites. Nucleic Acids Res 2005, (33 Database):183–187.
- Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, Liang J: CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res 2006, (34 Web Server):116–118. 10.1093/nar/gkl282
- Mistry J, Bateman A, Finn R: Predicting active site residue annotations in the Pfam database. BMC Bioinformatics 2007, 8: 298. 10.1186/1471-2105-8-298View ArticlePubMedPubMed Central
- Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ: MEROPS: the peptidase database. Nucleic Acids Res 2008, (36 Database):320–325.
- Magliery TJ, Regan L: Sequence variation in ligand binding sites in proteins. BMC Bioinformatics 2005, 6: 240–240. 10.1186/1471-2105-6-240View ArticlePubMedPubMed Central
- Liang S, Zhang C, Liu S, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Res 2006, 34(13):3698–3707. 10.1093/nar/gkl454View ArticlePubMedPubMed Central
- Sankararaman S, Sjölander K: INTREPID-INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 2008, 24(21):2445–2452. 10.1093/bioinformatics/btn474View ArticlePubMedPubMed Central
- Fischer JD, Mayer CE, Soding J: Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 2008, 24(5):613–620. 10.1093/bioinformatics/btm626View ArticlePubMed
- Alterovitz R, Arvey A, Sankararaman S, Dallett C, Freund Y, Sjolander K: ResBoost: characterizing and predicting catalytic residues in enzymes. BMC Bioinformatics 2009, 10: 197–197. 10.1186/1471-2105-10-197View ArticlePubMedPubMed Central
- Plotz T, Fink G: A new approach for HMM based protein sequence family modeling and its application to remote homology classification. Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on 2005, 1008–1013. full_textView Article
- Plotz T, Fink G: Feature extraction for improved Profile HMM based biological sequence analysis. Proc Int Conf on Pattern Recognition 2004.
- Grundy W, Bailey T, Elkan C, Baker M: Meta-MEME: motif-based hidden Markov models of biological sequences. Comput Appl Biosci 1997, 13: 397–406.PubMed
- Bailey T, Elkan C: Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning 1995, 21: 51–80.
- Henikoff S, Henikoff J, Alford W, Pietrokovski S: Automated construction and graphical presentation of protein blocks from unaligned sequences. Gene 1995., 163(2): 10.1016/0378-1119(95)00486-P
- Sun Y, Buhler J: Designing Patterns and Profiles for Faster HMM Search. IEEE/ACM Trans Comput Biol Bioinformatics 2009, 6(2):232–243. 10.1109/TCBB.2008.14View Article
- Durbin R: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press; 1998.View Article
- Eddy SR: Hidden Markov models. Curr Opin Struct Biol 1996, 6(3):361–365. 10.1016/S0959-440X(96)80056-XView ArticlePubMed
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14(10):846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMed
- Sonnhammer E, Eddy S, Durbin R: Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins Structure Function and Genetics 1997, 28(3):405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-LView Article
- Dempster A, Laird N, Rubin D, et al.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society 1977, 39: 1–38.
- Forney G Jr: The Viterbi algorithm. Proceedings of the IEEE 1973, 61(3):268–278. 10.1109/PROC.1973.9030View Article
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. 1990, 267–296.
- Kullback S, Leibler R: On information and sufficiency. Annals of Mathematical Statistics 1951, 22: 79–86. 10.1214/aoms/1177729694View Article
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Res 2010, (38 Database):211–222. 10.1093/nar/gkp985
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed Central
- Meilă M: Comparing clusterings--an information based distance. J Multivar Anal 2007, 98(5):873–895. 10.1016/j.jmva.2006.11.013View Article
- Madera M: Profile Comparer: a program for scoring and aligning profile hidden Markov models. Bioinformatics 2008, 24(22):2630–2631. 10.1093/bioinformatics/btn504View ArticlePubMedPubMed Central
- Dai J, Cheng J: HMMEditor: a visual editing tool for profile hidden Markov model. BMC Genomics 2008, 9(Suppl 1):S8. 10.1186/1471-2164-9-S1-S8View ArticlePubMedPubMed Central
- Gattiker A, Gasteiger E, Bairoch A: ScanProsite: a reference implementation of a PROSITE scanning tool. Appl Bioinformatics 2002, 1(2):107–108.PubMed
- Hoeffding W: Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association 1963, 58(301):13–30. 10.2307/2282952View Article
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673View ArticlePubMedPubMed Central
- Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21(20):3940–3941. 10.1093/bioinformatics/bti623View ArticlePubMed
- Karplus K, Karchin R, Shackelford G, Hughey R: Calibrating E-values for hidden Markov models using reverse-sequence null models. Bioinformatics 2005, 21(22):4107–4115. 10.1093/bioinformatics/bti629View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.