- Research article
- Open Access
Improving model construction of profile HMMs for remote homology detection through structural alignment
© Bernardes et al; licensee BioMed Central Ltd. 2007
- Received: 19 March 2007
- Accepted: 09 November 2007
- Published: 09 November 2007
Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs) are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the Twilight Zone, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance.
We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test.
We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.
- Structural Alignment
- Scop Database
- Profile Hide Markov Model
- Remote Homology
- Twilight Zone
Hidden Markov models (HMMs)  are probabilistic models utilized in pattern recognition problems. HMMs were initially used for speech recognition tasks . Nowadays, HMMs are being applied successfully to several molecular biology problems, including gene finding [3, 4], multiple sequence alignment [5–7], protein structure prediction [8–10], and many others. One particularly important application of HMMs is in remote homology detection between protein sequences. Remote homology detection is the problem of finding homology between sequences, when the actual sequence identity is low (usually, lower than 30%). HMMs can be used by first training an HMM to represent a group of homologue sequences, and then matching a sequence against this HMM. The HMMs used to represents groups of homologues sequences are called profile hidden Markov models (pHMMs) [11, 12]. Several studies have shown pHMMs to perform better than methods based on sequence similarity only [13, 14], such as BLAST  and FASTA , and than methods based on position-specific scoring matrices (PSSMs) , such as PSI-BLAST .
A pHMM is therefore a probabilistic model built from a multiple alignment of related sequences. The two major programs that apply pHMM for remote homology detection are HMMER  and SAM . Both programs are widely used within the Bioinformatics community. Namely, HMMER was used to build the PFAM database , and SAM was used to build Super-family . In these tools, an alignment is represented by creating a sequence of nodes, usually one node per alignment column. Each node is composed of three states: match (M), insert (I) and delete (D). Match states model conserved regions in the alignment. Insert and delete states model indel regions.
Profile HMMs have probabilities on two events: a transition from a state to another state, and the probability that a specific state will emit a specific character (say, a specific amino-acid when comparing proteins). Only match and insert states generate characters. Delete states are quiet. Therefore, each match and insert state has an emission probability distribution. In the case of proteins, the distribution will have 20 entries, one per amino acid.
Transitions define the structure of the pHMM. Systems such as SAM  allow transitions between all types of states, totaling 3 transitions per state, hence 9 per node. This is not always the case, the HMMER system relies on the Plan7 model , which disallows I → D and D → I transitions.
Performance of a pHMM critically depends on the quality of the estimated emission and transition probabilities. Emission probabilities are obtained by counting amino-acid frequencies at each match column. Unfortunately, the global alignment will usually have too few sequences to estimate all the parameters with sufficient confidence. Priors, such as mixtures of Dirichlets components , are used to compensate for the small sample size and avoid over-fitting. A second major issue when estimating parameters is the relationship between the sequences themselves. Clearly, the information that a residue is better conserved across a number of very different sequences should carry more weight than the information the residue is conserved across a large number of very similar sequences. Most pHMMs thus include a sequence weighting step, which may be based on sequence trees, as in HMMER , or in entropy, as in SAM . In all cases, closer sequences carry less weight than more divergent sequences. Last, notice that the total weight of the sequences governs how much we trust the sequences versus the prior. Increasing the total weight of the sequence counts over the priors reinforces our trust in the sequence data, but may lead to over-fitting.
To the best of our knowledge, Madera and Gough were the first ones to systematically compare the performance of the two systems . Their comparison studied the performance of the two tools over two protein families, globins and cupredoxins, using the nrdb90 database , and in an all-against-all experiment in the SCOP database . Several alignment strategies were used, including: manual alignment on globins and cupredoxins, SAM-T99  seeded from a single protein, WU-BLAST  search from the seed protein followed by CLUSTALW . The authors show that the initial multiple alignment can significantly affect performance, and that the T99 package generates good quality multiple alignments. Their results further suggested that SAM had better model quality than HMMER. Wistrand and Sonnhammer  further evaluated the two systems. The experiments relied on SCOP for a high quality database of labeled hierarchies of protein domains. The authors explicitly avoided conditioning on the use of particular programs to perform the initial multiple alignment. Instead, they used the PFAM alignment database. The authors concluded that SAM's model estimation is superior, due to a better usage of priors, which avoids over-fitting. On the other hand, HMMER's model scoring is more accurate, probably due to a better null model.
Madera and Gough's work showed the importance of multiple alignment for HMMER performance. It has been observed that protein three-dimensional structures are remarkably stable with respect to amino acids sequences . This suggests that alignments derived from structural information should identify motifs and functional residues accurately. In this direction, Jones and Bateman  assessed the performance of pHMMs derived of structural alignments versus sequence alignments. The benchmark was obtained from the PFAM and HOMSTRAD  databases. HOMSTRAD is a curated database of structure-based alignments for homologous protein families and PFAM is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. To build up a mapping of HOMSTRAD and PFAM families, the sequences of each HOMSTRAD family were searched against PFAM using HMMER. Each HOMSTRAD family was thus made to correspond to a single PFAM family. Theses PFAM memberships were considered the true positive data set. To provide sequence alignments, the sequences of each HOMSTRAD family were realigned using CLUSTALW and TCOFFEE . The authors concluded that the use of structure information to increase alignment accuracy does not aid homologue detection with pHMMs. However, their experiments considered sequences with different degrees of identity, from 20% up to the 80%, and the author did not applied his experiments to proteins in the Twilight Zone, where identity between amino-acids sequences is a weaker indicative of the evolutionary relationship.
This study investigates the contribution of using structural alignments to build pHMMs for remote homology detection. Therefore, our experiments consider proteins with identity below 30%. We performed our studies by analyzing the performance of these tools on SCOP super-families. Under these conditions, we show that pHMMs derived from structural alignments perform significantly better than pHMMs derived from sequence alignments. We show that accuracy alignment is not directly related to alignment identity. Although structural alignments often present smaller identity than sequence alignments, the best quality alignments based on structural information are generally considered to derive from structural superposition. We compare the performance of two HMMs packages, HMMER and SAM, when the two different kind of alignments were used. Our results show that HMMER based on structural alignment to outperform SAM for such remote homologues.
We compare sequence-based and structure-based multiple alignment packages on the SCOP Protein Database. We evaluated experimentally the performance of the HMMER and SAM packages using alignments from four sequence and two structural multiple-alignment packages. All data sets and perl scripts used in this study are freely available from the web site .
Multiple Alignment Tools
We used CLUSTALW , TCOFFEE , MAFFT , and PROBCONS  packages to provide sequence alignments based on primary structure. CLUSTALW is one of the most widely used tools for multiple sequence alignment. TCOFFEE has been reported to achieve significantly better quality alignments than CLUSTALW . MAFFT is a series of five progressive alignment programs, we used L-INS-i, an algorithm based progressive alignment with iterative refinement. We used the 3DCOFFEE  and MAMMOTH-mult  packages to provide structural alignments. 3DCOFFEE extends TCOFFEE with structural alignment information. MAMMOTH-mult is a package that seems to achieve good performance by focusing on structural information.
CLUSTALW is a progressive alignment algorithm . First, it derives a guide tree and then uses a greedy search over aligned clusters of sequences. Although, it perform faster and uses less memory than other programs, arguably it is less accurate or scalable than modern ones.
T-COFFEE also implements a progressive alignment algorithm. However, it tries to improve the quality of the initial pair-wise sequence alignment by considering the alignment between all the pairs as it executes every step in the progressive alignment algorithm. It presents high accuracy while sacrifices computation time and memory usage.
The MAFFT package includes five alignment programs. We used the recommended option, in this case, L-INS-i, that uses progressive aligner followed by iterative refinement.
PROBCONS uses a combination of probabilistic modeling and consistency-based alignment techniques. It introduces a novel scoring function, probabilistic consistency, based on paired hidden Markov models. Alignments are still performed progressively but a post-processing refinement step may apply.
The 3DCOFFEE aligner is based on TCOFFEE, but it uses pairwise structure comparison to improve accuracy. Pairwise structure comparison is performed by SAP if both structures are known . If only one structure is known, 3DCOFFEE uses the Fugue threading method .
MAMMOTH is a progressive multiple alignment program that uses a sequence independent heuristic to obtain a fully structural alignment. It starts from a Cα trace to obtain an alignment. Second, it finds an alignment of local structures based on computing a similarity score from the URMS metrics. Third, it finds similar local structures with their Cα close in Cartesian space.
We compare two arguably major profile Hidden Markov Model (pHMM) packages, HMMER and SAM.
The HMMER package was developed at the Sean Eddy's Lab, University of Washington Saint-Louis. It provides an open-source environment based on pHMMs for protein sequence analysis. Besides the PFAM database, HMMER is also at the heart of other databases, such as TIGRFAMs , and SMART . In this work we used HMMER version 2.3.2, updated in 2003. HMMER requires at least two stages: model building and scoring. A third, recommended but optional stage, is model calibration: we have used it in this study.
In model building, HMMER distinguishes match alignment columns and insert alignment columns. HMMER assigns columns to match or insert states so as to maximize the posterior probability of the aligned sequences, given the model. By default, HMMER uses a Dirichlet mixture with 9 components for priors. Scoring was performed using the Viterbi algorithm. We used hmmbuild procedure to build HMMER models, and the hmmsearch for score. In our experiments we used HMMER default parameters.
The SAM package was developed at the University of California Santa Cruz; it is not open source but is free to academic use. One of the major SAM differences with respect to HMMER is the SAM-T2K script. This is an iterative procedure to generate multiple alignments and HMMs starting from a single sequence . Moreover, the SAM team has worked on improving SAM through using information on structure protein , and prior probabilities . SAM uses a standard profile HMM architecture with 9 transitions. Each alignment column correspond a node (match, insert and delete). In other words, SAM does not distinguish between match and insert columns. SAM uses a Dirichlet mixture with 20 components for priors and by default scores using the forward algorithm. We used modelfromalign to build the models and hmmscore to compute. In our experiments we used SAM default parameters.
Our experiments require structure coordinates for protein sets with low sequence identity. Therefore we used the SCOP database , version 1.67 with 6600 proteins sequences. SCOP is a manually inspected database of protein folds, and is particularly interesting for our study because it describes structural and evolutionary relationships between proteins, including all entries in the Protein Data Bank . SCOP is thus an excellent data-set for evaluating the performance of remote homology detection methods, and it has been widely used for that purpose [31, 50–53]. SCOP classifies all protein domains of known structure into a hierarchy with four levels: class, fold, super family, and family. In our study, we work at the super family level, which groups families such that a common evolutionary origin is not obvious from sequence identity, but probable from an analysis of structure and from functional features. We believe that this level best represents remote homologies.
Note that in our experiments, none of the sequences in a test set had >30% sequence identity with any protein in the corresponding training set. Results were graphically analyzed by building ROC. We experimented with e-values between 10-50 and 10 to obtain the curves. Finally, we have used the paired two tailed t-test to assess significance. We consider a result with p ≤ 0.02 (i.e. 98% of confidence) to be significant.
As a first step, we categorize our alignment data set according to both the number of sequences, and the average length of sequences within SCOP super-family. In our data set, the number of sequences per super-family ranges from 3 sequences in the smallest super-family up to 44 sequences for the largest super-families. In average, we worked with 23 sequences per alignment. Comparing with previous work on aligning families , we observe that super-families give us much more training examples to construct the pHMMs. Regarding sequence length, the average sequence length within SCOP super-families is well distributed in the interval between small sequences with less than 50 residues to large sequences with up to 400 residues. In average, we worked with sequences of 193 residues.
Alignment Length and Gap Percentage and by Alignment Tool
First, we assess average alignment length. CLUSTALW seems to generate the smallest alignments, with in average 318 residues. MAFFT and MAMMOTH generate longer alignments, in average around 400 residues. The longest alignments are generated by PROBCONS followed by COFFEE family. Notice that 3DCOFFEE generates somewhat longer alignments than TCOFFE. Next, we measure the percentage of gaps within alignments. CLUSTALW introduced the smallest gap percentage. MAFFT produced alignment with less gaps than MAMMOTH. The COFFEE family and PROBCONS present the longest alignments and have the highest percentage of gaps.
Gap introduction is clearly related with alignment length, and thus with identity. In general, sequence alignment tools need to introduce gaps to preserve identity across sequences. As a case in point, PROBCONS achieves the highest average identity, but 60% of PROBCONS alignments were gaps. We observed a similar pattern in TCOFFEE alignments. MAFFT achieves less identity but also introduces few gaps. Analogously, CLUSTALW presented the lowest identity average, and also introduced the smallest number of gaps in its alignments. Comparing the structural alignments, 3DCOFFEE achieved higher average identity than MAMMOTH, and also introduces more gaps than MAMMOTH. A Pearson test shows the correlation between alignment percentage and gaps to be indeed quite high, at 94%.
We assessed HMMER performance using multiple alignments generated by CLUSTALW, TCOFFEE, MAFFT, PROBCONS, 3DCOFFEE, and MAMMOTH. For a super-family with N elements, the results indicate whether models trained on N - 1 families can predict the sequences in the remaining family. Please see the Methods section above for further discussion on the experimental methodology.
HMMER Significance Results
For better understanding, we further partition our results according to identity ranges. Given that our best results were obtained from HMMER-3DCOFFEE, we rely on 3DCOFFEE as our measure of sequence identity.
SAM Significance Results
Best results were achieved with SAM-3DCOFFEE, followed by SAM-MAMMOTH. Difference between the two was not statistically significant. The pHMMs derived from sequence alignments achieved worse results, but surprisingly SAM-CLUSTALW and SAM-MAFFT actually operate significantly better than SAM-PROBCONS. The difference between SAM-TCOFFEE and SAM-PROBCONS is not significant. On the other hand, there is a clear difference between SAM-CLUSTALW, SAM-PROBCONS and SAM-TCOFFEE.
HMMER and SAM Performance
HMMER-SAM Significance Results
HMMER-CLUSTALW × SAM-CLUSTALW
HMMER-TCOFFEE × SAM-TCOFFEE
HMMER-MAFFT × SAM-MAFFT
HMMER-PROBCONS × SAM-PROBCONS
HMMER-3DCOFFEE × SAM-3DCOFFEE
HMMER-MAMMOTH × SAM-MAMMOTH
Detecting remote homologue is an important, but hard, problem, as there is high divergence between training sequences. Several approaches have been proposed to improve pHMMs performance in these conditions [56–58]. A natural approach is to use protein structural information to improve model quality [52, 59, 60]. In this work, we investigated whether one can leverage preexisting tools, such as SAM and HMMER, by applying multiple alignments based on structural information.
The major question we address is therefore whether pHMMs for remote homology detection will benefit from structure alignments. Previous work showed negative results  on sequences having identity between 20–80%. To study whether similar results would apply to the Twilight Zone, we performed experiments comparing performance across SCOP super-families. We used the SCOP database, as this is the standard database with structural information being used in most related studies. Throughout, we used leave one-family out cross-validation instead of leave one-sequence out, as we believe this most closely represents the problem of finding a novel remote homologue.
Our focus was on how HMMER and SAM can benefit from structural information. We therefore used the two tools with external alignments. SAM is often used together with the T-99 aligner (that can use secondary but not tertiary information).
Our results show clear benefit from using structural aligners. The benefit was noticeable for both SAM and HMMER. A detailed analysis shows that the improvement was obtained in the 10–20% identity range in both cases. Below 10% identity is too low, and the tools do not generate useful models. Above 20% identity, both for SAM and for HMMER alignments from the sequence based tools start performing comparably to the structural aligners, a results consistent with the literature.
Studying the difference between TCOFFEE and 3DCOFFEE is particularly enlightening, as the two aligners mostly differ on the use of structural information. There is indeed a significant difference between the two tools in this study, and the difference applies both to SAM and HMMER models. Moreover, the difference stems from lower identity, in the 10–20% identity range, and disappears as sequences become more conserved.
We found no correlation between alignment size and model performance. PROBCONS consistently generates the longest alignments, but it does not outperform the other tools. MAMMOTH tends to generate relatively short alignments, and performs well in this study. This would suggest that the problem is not just finding conserved regions, but that the aligners might be reporting regions to be conserved when they are not.
Although our key results are similar for SAM and HMMER, we did observe a number of interesting differences. First, our studies indicate better sensitivity of HMMER-based models than of SAM based models. Second, some aligners perform quite differently when they are used by SAM and by HMMER. Namely, PROBCONS generated alignments performs particularly badly with SAM. In fact, SAM-CLUSTALW actually outperforms SAM-PROBCONS.
We believe that the explanation for both phenomena lies in the way that HMMER and SAM treat their input alignments. SAM is designed to be used together with the T99 aligner, and thus each column in the multiple alignment results in a state on the resulting pHMM. In contrast, HMMER is designed to be used with external aligners. Thus, it implements a MAP algorithm to estimate the actual number of states. Our results do show this MAP algorithm to significantly reduce the number of states for HMMER.
Finding remote homologue is a hard, but important problem in molecular biology. We study the performance of two pHMM based tools, SAM and HMMER, when provided with structural and sequential alignments. We reach two main conclusions. First, structural alignments are very important in low-identity regions, below 20%. Using structural information can significantly improve performance in this task. On the other hand, our results indicate that alignments are low quality, even in the best case. Thus sensitivity is still quite low: we achieved at most 200 of about 1000 sequences in our study.
We believe that there is still much open work in achieving best performance in recognizing remotely related proteins. Our results suggest a number of possible directions for improvements in this area. The good results obtained by 3DCOFFEE, which performs quite well both when compared to a tool such as MAMMOTH-mult, designed from the beginning to perform structural alignments, and when compared with the corresponding sequential aligner, TCOFFEE, suggests that similar improvements could be considered for other sequence aligners. Our results also show that structural identity does provide a good prior on alignment quality. In current approaches, this prior is only used to generate the alignments. It would be interesting to go one step further and to integrate this information with the model construction process itself.
We are grateful to CNPq for financial support. Most of Vitor S Costa's contribution was given while Assistant Professor at UFRJ. However, he was partially supported by funds granted to LIACC through the Programa de Financiamento Plurianual, Fundação para a Ciência e Tecnologia and Programa POSC. We thank the referees for their insightful comments that very much contributed to improve our paper.
- Rabiner L: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 1989, 77: 257–286. 10.1109/5.18626View ArticleGoogle Scholar
- Mendel M: A commercial large-vocabulary discrete speech recognition system: Dragon Dictate. Language Speech 1992, 35: 237–246.Google Scholar
- Majoros W, Pertea M, Salzberg S: Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics 2005, 21: 1782–1788. 10.1093/bioinformatics/bti297View ArticlePubMedGoogle Scholar
- Brejova B, Brown D, Li M, Vinar T: ExonHunter: a comprehensive approach to gene finding. Bioinformatics 2005, 21: 57–65. 10.1093/bioinformatics/bti1040View ArticleGoogle Scholar
- Mamitsuka H: Finding the biologically optimal alignment of multiple sequences. Artificial Intelligence in Medicine 2005, 35: 9–18. 10.1016/j.artmed.2005.01.007View ArticlePubMedGoogle Scholar
- Edgar R, Sjolander K: COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics 2004, 20: 1309–1318. 10.1093/bioinformatics/bth091View ArticlePubMedGoogle Scholar
- Knudsen B, Miyamoto M: Sequence alignments and pair hidden Markov models using evolutionary history. Journal of Molecular Biology 2003, 333: 453–460. 10.1016/j.jmb.2003.08.015View ArticlePubMedGoogle Scholar
- Bae K, Mallick B, Elsik C: Prediction of protein interdomain linker regions by a hidden Markov model. Bioinformatics 2005, 21: 2264–2270. 10.1093/bioinformatics/bti363View ArticlePubMedGoogle Scholar
- Camproux AC, Tufféry P: Hidden Markov model-derived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity. Biochim Biophys Acta 2005, 1724(3):394–403.View ArticlePubMedGoogle Scholar
- Lin K, Simossis V, Taylor W, Heringa J: A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 2005, 21: 152–159. 10.1093/bioinformatics/bth487View ArticlePubMedGoogle Scholar
- Krogh A, Brown M, Mian I, Sjolander K, Haussler D: Hidden markov models in computational biology applications to protein modeling. Journal of Molecular Biology 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden markov models for sequence analysis: extension and analysis og the basic method. Computer Applications in the Biosciences 1996, 12: 95–107.PubMedGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins ok known structure. Journal of Molecular Biology 2001, 313: 903–919. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiples sequence detect three times as many remote homologues as pairwise methods. Journal of Molecular Biology 1998, 284: 1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
- Altschul F, Gish W, Miller W, Myers E, Lipman D: A basic local alignment search tool. Journal of Molecular Biology 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Pearson WR: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1985, 183: 63–98.View ArticleGoogle Scholar
- Gribskov M, McLachlan A, Eisenberg D: Profile analysis: detection of distantly related proteins. National Academy of Sciences 1987, 84: 4355–4358. 10.1073/pnas.84.13.4355View ArticleGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: PSI-BLAST searches using hidden markov models of structural repeats: prediction of an unusual sliding DNA clamp and of beta-propellers in UV-damaged DNA-binding protein. Nucleic Acids Research 2000, 28: 3570–3580. 10.1093/nar/28.18.3570View ArticleGoogle Scholar
- Eddy S: Profile hidden Markov models. Bioinformatics 1998, 14: 755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Hughey R, Krogh A: Hidden Markov models for sequence analysis: extension and analysis of the basic method. Computer Applications in the Biosciences 1996, 12: 95–107.PubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths S, Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats C, Eddy S: The Pfam Protein Families Database. Nucleic Acids Research 2004, 32: 138–141. 10.1093/nar/gkh121View ArticleGoogle Scholar
- Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian I, Haussler D: Dirichlet mixtures: a method for improving detection of weak but significant protein sequence homology. Computer Applications in the Biosciences 1996, 12(4):327–345.PubMedGoogle Scholar
- Thompson J, Gibson T: Improved sensitivity of profile searches through the use of sequence weights and gap excision. Computer Applications in the Biosciences 1994, 10: 19–29.PubMedGoogle Scholar
- Krogh A, Mitchison G: Maximum entropy weighting of aligned sequences of proteins or DNA. Proc Int Conf Intell Syst Mol Biol 1995, 3: 215–221.PubMedGoogle Scholar
- Madera M, Gough J: A comparison of profile hidden Markov model procedure for remote homology detection. Nucleic Acids Research 2002, 30: 4321–4328. 10.1093/nar/gkf544PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 1998, 14: 423–429. 10.1093/bioinformatics/14.5.423View ArticlePubMedGoogle Scholar
- Andreeva A, Howorth D, Brenner S, Hubbard T, Chothia C, Murzin A: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research 2004, 32: 226–229. 10.1093/nar/gkh039View ArticleGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14: 846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673PubMed CentralView ArticlePubMedGoogle Scholar
- Wistrand M, Sonnhammer E: Improved profile HMM performance by assessment of critical algorithmic in SAM and HMMER. BMC Bioinformatics 2005, 6: 99–109. 10.1186/1471-2105-6-99PubMed CentralView ArticlePubMedGoogle Scholar
- Bourne P, Weissig H: Structural Bioinformatics. Sinauer Associates; 2003.View ArticleGoogle Scholar
- Jones S, Bateman A: The use of structure information to increase alignment accuracy does not aid homologue detection with profiles HMMs. Bioinformatics 2002, 18: 1243–1249. 10.1093/bioinformatics/18.9.1243View ArticleGoogle Scholar
- Mizuguchi K, Deane C, Blundell T, Overington J: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 1998, 7: 2469–2471.PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-coffee: a novel method for fast and accurate multiple sequence alignment. Computer Applications in the Biosciences 2000, 302: 205–217.Google Scholar
- Hmmer-struct BiowebDB[http://wiki.biowebdb.org/index.php/Hmmer-struct]
- Katoh K: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Research 2005, 33: 511–518. 10.1093/nar/gki198PubMed CentralView ArticlePubMedGoogle Scholar
- Do C, Mahabhashyam M, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research 2005, 15: 330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Nuin P, Wang Z, Tillier E: The accuracy of several multiple sequence alignment programs for proteins. BMC Bioinformatics 2006, 7: 1–18. 10.1186/1471-2105-7-471View ArticleGoogle Scholar
- Sullivan O, Suhre K, Abergel C, Higgins D, Notredame C: 3DCoffee: combining protein sequences and structures within multiple sequence alignments. Journal of Molecular Biology 2004, 340: 385–395. 10.1016/j.jmb.2004.04.058View ArticleGoogle Scholar
- Attwood T, Bradley P, Flower D, Gaulton A, Maudling N, Mitchell A: A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 2005, 21: 3255–3263. 10.1093/bioinformatics/bti527View ArticleGoogle Scholar
- Feng D, Doolittle R: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of molecular evolution 1987, 25: 351–360. 10.1007/BF02603120View ArticlePubMedGoogle Scholar
- Taylor W, Flores T, Orengo A: Multiple protein structure alignment. Protein Science 1994, 3: 1858–1870.PubMed CentralView ArticlePubMedGoogle Scholar
- Shi J, Blundell T, Mizuguchi K: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology 2001, 310: 243–257. 10.1006/jmbi.2001.4762View ArticlePubMedGoogle Scholar
- Haft D, Selengut J, White O: The TIGRFAMs database of protein families. Nucleic Acids Research 2003, 31: 371–373. 10.1093/nar/gkg128PubMed CentralView ArticlePubMedGoogle Scholar
- Letunic I, Copley R, Schmidt S, Ciccarelli F, Doerks T, Schultz J, Ponting C, Bork P: SMART 4.0: towards genomic data integration. Nucleic Acids Research 2004, 32: 142–144. 10.1093/nar/gkh088View ArticleGoogle Scholar
- Karchin R, Cline M, Gutfreund YM, Karplus K: Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Proteins 2003, 51: 504–514. 10.1002/prot.10369View ArticlePubMedGoogle Scholar
- Karplus K, Karchin R, Shackelford G, Hughey R: Calibrating E-values for hidden Markov models with reverse-sequence null models. Bioinformatics 2005, 6: 305–316.Google Scholar
- Helen M, Westbrook J, Feng Z, Gilliland G, Bhat T, Weissig H, Shindyalov I, Bourne P: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235–242. 10.1093/nar/28.1.235View ArticleGoogle Scholar
- Espadaler J: Detecting remote related proteins by their interactions and sequence similarity. PNAS 2005, 102: 7151–7156. 10.1073/pnas.0500831102PubMed CentralView ArticlePubMedGoogle Scholar
- Söding J: Protein Homology detection by HMM-HMM comparison. Bioinformatics 2005, 21: 951–960. 10.1093/bioinformatics/bti125View ArticlePubMedGoogle Scholar
- Alexandrov V, Gerstein M: Using 3D Hidden Markov Models that explicitly represent spatial coordinates to model and compare protein structures. BMC Bioinformatics 2004, 5: 1–10. 10.1186/1471-2105-5-2View ArticleGoogle Scholar
- Hou Y, Hsu W, Lee M, Bystroff C: Remote homology detection using local sequence-structure correlations. PROTEINS: Structure, Function and Bioinformatics 2004, 57: 518–530. 10.1002/prot.20221View ArticleGoogle Scholar
- Mitchell T: Machine Learning. McGraw-Hill; 1997.Google Scholar
- Beck JR, Shultz EK: The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 1986, 110(1):13–20.PubMedGoogle Scholar
- Qian B, Goldstein R: Performance of an iterated T-HMM for homology detection. Bioinformatics 2004, 20: 2175–2180. 10.1093/bioinformatics/bth181View ArticlePubMedGoogle Scholar
- Bystroff C, Baker D: HMMSTR: A hidden Markov model for local sequence-structure correlation in proteins. Journal of Molecular Biology 2000, 301: 173–190. 10.1006/jmbi.2000.3837View ArticlePubMedGoogle Scholar
- Wistrand M, Sonnhammer E: Improving Profile HMM Discrimination by Adapting Transition Probabilities. Journal of Molecular Biology 2004, 338: 847–854. 10.1016/j.jmb.2004.03.023View ArticlePubMedGoogle Scholar
- Goyon F, Tufféry P: SA-Search: A web tool for protein structure mining based on structural alphabet. Nucleic Acids Research 2004, 32: 545–548. 10.1093/nar/gkh467View ArticleGoogle Scholar
- Hou Y, Hsu W, Lee M, Bystroff C: Remote homolog detection using local sequence-structure correlations. Journal of Molecular Biology 2004, 340: 385–395. 10.1016/j.jmb.2004.04.058View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.