Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction
© Scheeff and Bourne. 2006
Received: 18 April 2006
Accepted: 14 September 2006
Published: 14 September 2006
Skip to main content
© Scheeff and Bourne. 2006
Received: 18 April 2006
Accepted: 14 September 2006
Published: 14 September 2006
One of the most powerful methods for the prediction of protein structure from sequence information alone is the iterative construction of profile-type models. Because profiles are built from sequence alignments, the sequences included in the alignment and the method used to align them will be important to the sensitivity of the resulting profile. The inclusion of highly diverse sequences will presumably produce a more powerful profile, but distantly related sequences can be difficult to align accurately using only sequence information. Therefore, it would be expected that the use of protein structure alignments to improve the selection and alignment of diverse sequence homologs might yield improved profiles. However, the actual utility of such an approach has remained unclear.
We explored several iterative protocols for the generation of profile hidden Markov models. These protocols were tailored to allow the inclusion of protein structure alignments in the process, and were used for large-scale creation and benchmarking of structure alignment-enhanced models. We found that models using structure alignments did not provide an overall improvement over sequence-only models for superfamily-level structure predictions. However, the results also revealed that the structure alignment-enhanced models were complimentary to the sequence-only models, particularly at the edge of the "twilight zone". When the two sets of models were combined, they provided improved results over sequence-only models alone. In addition, we found that the beneficial effects of the structure alignment-enhanced models could not be realized if the structure-based alignments were replaced with sequence-based alignments. Our experiments with different iterative protocols for sequence-only models also suggested that simple protocol modifications were unable to yield equivalent improvements to those provided by the structure alignment-enhanced models. Finally, we found that models using structure alignments provided fold-level structure assignments that were superior to those produced by sequence-only models.
When attempting to predict the structure of remote homologs, we advocate a combined approach in which both traditional models and models incorporating structure alignments are used.
The current stream of genome sequence data has lead to a bottleneck between DNA sequencing and elucidation of protein function . Therefore, considerable effort has been expended to develop computational methods to suggest functions for putative genes . In particular, many researchers have focused on attempting to predict the structure of unknown proteins using only their sequences [3, 4]. The structure of a protein provides some of the richest information about its possible functions, as well as hints as to important residues and the location of functional sites [5–7].
Homology-based methods for structure prediction rely on the observation that proteins that share a common ancestor will usually have similar sequences, and that, in turn, similar protein sequences produce proteins with similar fold and (often) similar functions . However, a given protein structure can be formed by a highly diverse array of possible sequences, and over long evolutionary time scales protein sequence divergence can be extensive [9, 10]. Proteins which have similar structures may display such a paucity of sequence similarity that detection via current sequence homology search methodologies is not possible [11, 12].
The profile-type methods address this problem by incorporating family-specific information inherent in multiple sequence alignments into sequence searches [13–15], and they can be generated automatically based on alignments built in an iterative fashion [16–18]. Hidden Markov models (HMMs) have been adapted for use as a particularly powerful profiling method [11, 19, 20], and have therefore been termed "profile hidden Markov models" . As an alternative to iteration, profile HMMs can also be built based on carefully constructed seed alignments . Profile HMMs built using either method can also be used as a panel of models, against which unknown sequences may be tested for similarity to a known family or superfamily [21, 22].
Because profile HMMs model the information present in a sequence alignment, they are affected by quality of the input alignment , but the accuracy of alignments based only on sequence can be limited in cases of distant homology [23–25]. However, sequences at this level of similarity are the most informative sequences with which to build a model, as they will clearly demonstrate which sections of the family are well conserved, and the exact nature of the conservation.
Given the limitations inherent in sequence alignments as inputs for profile HMMs, researchers have explored the use of sequence alignments derived directly from structural alignments of proteins, particularly in cases where superfamily-level assignments are desired [22, 26–31]. Structural alignments eliminate many of the problems with standard seed alignments. First, they can provide a highly diverse set of sequences from a variety of superfamily members, with more diversity than might be found even with an extensive iterative search. Second, because the structures are known, structure alignments can be used to provide an accurate alignment of the sequences.
One limitation in using structure alignments for profile HMMs is the weak level of structure representation in many superfamilies. In addition, coverage of a given superfamily is often strong in one area and weak in others. Hence, models made from alignments of these sequences, even if supplemented with additional homologous sequence, may not optimally describe the entire superfamily in question. It has also been suggested that there is an optimal range of sequence similarity for training of profile HMMs, and that using a very broad range of diverse sequence can lead to "profile dilution", reducing model quality [29, 32].
Accordingly, the literature record for the use of structure alignments for profile HMMs yields mixed results. Several researchers, using a variety of experimental arrangements, have reported that HMMs based on structure alignments do not provide a benefit over standard HMMs. Gough et al. and Sillitoe et al. have suggested that pools of models, each built iteratively from a single structural representative (referred to here as the "master" sequence) provide better performance at the superfamily level than models incorporating structure alignments of multiple masters [22, 30]. Sillitoe et al. also reported that combining the two types of models yielded essentially no benefit . Griffiths-Jones and Bateman compared HMMs built from seed alignments based on structure against HMMs built from seed alignments based only on sequence information, and concluded that there was no benefit to structure alignments for the production of profile HMMs . However, their analysis was done for family-level sequence targets, not the superfamily level targets that represent a more difficult challenge for homology search. In addition, they did not assess the value of structural alignments to the augmentation of iterative alignment procedures, as we do here.
Other researchers have provided qualified support for the use of structure alignments in the construction of profiles (usually using PSI-BLAST [16, 17] instead of HMMs). Panchenko and Bryant reported a small improvement in structure prediction accuracy when profile seed alignments were created using structure-based, rather than sequence-based, alignments . However, though their method used PSI-BLAST, their protocol did not include any iteration. Kelley et al. used structure alignments to augment iterative profile construction, and reported improved results when attempting to predict very distant homologous relationships. However, some of the improvement reported was based on the combination of the structure alignments with additional structure-derived information (secondary structure and solvent accessibility) . Other groups have tested structure alignments when applied to newer profile-to-profile approaches, as opposed to the usual profile-to-sequence approach. Tang et al. explored the inclusion of structure alignments in the construction of profiles in a profile-to-profile approach based on PSI-BLAST . Similar to the findings of Kelly et al., they found that profiles utilizing structure alignments provided some improvement over standard models, and that this increase in performance could be furthered through the addition of other structural data into the models. Casbon and Saqi tested a profile-to-profile approach relying on a hybrid PSI-BLAST/HMM protocol . They found that the structure-alignment models had similar performance to the standard models at a low error rate, and weaker performance at a higher error rate. However, they noted that in some superfamilies the structure-alignment based models had a clear advantage over the standard models, and vice versa. This suggests that the two types of models might be complementary if used in combined searches.
We undertook a large-scale assessment of the utility of structure alignments for the generation of profile HMMs, using the traditional sequence-to-profile method. First, we tested several iterative protocols to determine which method generated the most sensitive profile HMMs with HMMER , using only sequence information. The aim was the generation of models that represent the sequence space around a single structural domain representative (the "single-master" models). Next, we developed a protocol for the production of merged sequence alignments, built by combining the sequence alignments from several domains together based on a multiple structure alignment. The method is designed such that the best aspects of iterative sequence search and structural seed alignments are combined. A maximum amount of sequence information is gathered for each superfamily through iterative search, but a structural seed alignment is used to combine the information accurately. Finally, the combined superfamily alignments were used to train another profile HMM with HMMER. We term such models based on combined alignments "structure-linked alignment hidden Markov models", or SLAHMMs (pronounced "slams"). We show that SLAHMMs provide an improvement upon sequence-only models built though iteration, when they are used together in a combined search. Our study supports the notion that structural information, in the form of structure-based alignments, provides a useful enhancement to standard profile HMM models.
In order to determine the best iterative methodology for building both single-master HMMs and SLAHMMs with HMMER, performance was compared for four different parameter sets (PS1-PS4, see methods for details). In all cases, an alignment was built through repetitive HMM searches against a sequence database (with re-training of the HMM after each cycle), but the parameter sets tested different cutoffs for sequence inclusion into the growing alignment, use of heuristics to improve the alignment, and the method for aligning sequences to the model to create the alignment. Test sequences (or "probes") were then searched against the resulting models and the correct or incorrect structure assignments recorded. The superfamily assignments provided in SCOP [33, 34] were used as a standard of truth for purposes of benchmarking the methods (with modifications in a few cases, see methods).
Creation of SLAHMMs based on the same sequence alignments (combined based on structure alignments, see methods) also resulted in relatively similar performance between the various iterative parameters (Figure 1). However, in this case there was a larger variability in performance, suggesting that SLAHMMs are more sensitive to the methodology used to create the sequence alignment inputs. The parameter sets that repetitively re-aligned all sequences to the model (PS2 and PS4) performed better than those that did not (PS1 and PS3). This result suggests that sequence alignment drift does not pose a severe threat to the quality of SLAHMMs, and that re-aligning all sequences to the model may help to sharpen sequence patterns in a way that improves the resulting HMM.
However, the results also suggest that the use of the heuristics (to select sequences for addition to the model) applied in PS1 and PS2 had a beneficial effect on the quality of the models. Both PS1 and PS2 outperformed counterparts that did not use the heuristics. In addition to the small performance improvement, the heuristics provided the practical benefit of smaller alignment sizes (because they are more selective with respect to added sequences), and therefore lower computational overhead.
It was possible to use both sets of models together in a combined search, simply by placing both in a single database and searching the probes against this database. When performance of the combined models was compared for each parameter set, the difference again narrowed, much as it did for the single-master HMMs (Figure 1). Although the coverage vs. error curves were similar, PS1 was alone in identifying additional homologs at an EPQ of 1. The practical benefits of the heuristics used in PS1 and PS2, coupled with the slight performance benefits seen in the SLAHMM and combined tests, argued for the use of these heuristics in the generation of future models. Because of these factors, we chose to use PS1 for the remainder of the analysis.
Probes correctly assigned with SLAHMMs that were not assigned with single-master HMMs, using a strict cutoff of 80 incorrect assignments (theoretical EPQ ~0.05).
Probe SCOP ID
ND1: (EPQ = 1)
ND2: nr60 (EPQ ~0.05)
ND3: nr60 (EPQ = 1)
The unique assignments made by SLAHMMs occur at high E-values, near the threshold at which errors start to be made. Indeed, only 10 unique assignments (~24%) were made prior to the first incorrect assignment. The rest were made after errors had been recorded. In contrast, of the unique assignments made by the single-master HMMs, 91 (~41%) were made prior to the first recorded error. This suggests that, as expected, SLAHMMs are very generalized models that capture more difficult assignments made right at the edge of "the twilight zone", but can miss easier matches.
If the single master HMMs are allowed a looser cutoff of up to 1575 errors (a theoretical EPQ of 1), they will correctly assign 31 of the 42 sequences that SLAHMMs capture at a cutoff of 80 (Table 1). Thus, part of the benefit provided by SLAHMMs is produced by simply improving the scores of probes that could be assigned, albeit with much lower confidence, using single-master HMMs. However, 11 structures are still not correctly assigned by single-master HMMs, even at this less-stringent cutoff (Table 1, column ND1). Of course, if SLAHMMs are also allowed a looser cutoff of 1575 errors, they will correctly assign additional structures that single-master HMMs miss at the same threshold. Indeed, SLAHMMs correctly assign 35 structures missed by single-master HMMs if both are held to a threshold of an EPQ of 1 (single-master HMMs make 198 unique hits at this cutoff).
The differing behaviors of the two types of models suggested that they would behave synergistically when combined, with single-master HMMs capturing targets that were easier, and SLAHMMs capturing targets that were more challenging. When the combined models were tested, the SLAHMMs were able to provide increased assignment performance, improving upon the capability of single-master HMMs (Figure 2). The combined models made 35 additional correct assignments at an EPQ of ~0.05, an improvement of ~3.3% (Figure 3). This pattern held throughout the sampled cutoffs, with 31 additional correct matches at an EPQ of 1.
Although the improvement provided by the addition of SLAHMMs appears modest, the structure of the experiment was likely to make the degree of improvement appear substantially smaller than actually realized in cases of very distant sequence homology. Our test set of probes was filtered such that all were non-trivial to recognize with BLAST  (see methods), but this does not mean that all of the probe assignments will be challenging ones, given the power of profile methodology . We did not attempt to remove these easier to assign domains, because this may reduce the overall coverage of sequence space by the resulting model pool . Therefore, it is expected that there will be a large "floor" of domains that are relatively easy to assign, which is consistent with the large degree of overlap between the results for single-master HMMs and SLAHMMs.
Interestingly, the single-master HMMs are able to make 20 assignments not made by the combined models at an EPQ of ~0.05 (Figure 3), reducing the net improvement to only 15 additional assignments (~1.4%). These assignments are missed because additional errors are also incurred by having both sets of models present, as a result of the added "noise". Such noise would be reduced in real-world use of SLAHMMs. The requirement for a set of test models with a sequence (and its accumulated hits) removed meant that a SLAHMM had to be made for each probe, yielding 1,575 models (see methods). These additional models doubled the size of the overall model pool, which presented a greater availability for random hits in a practical sense, especially at high E-values, where SLAHMMs seemed to provide the most benefit. However, in real-world use, only one model would be required per superfamily (242 models using the test set for this experiment), which would only produce a small relative increase in the number of models in the model pool.
We sought to test the notion that structural alignments were truly essential for the production of SLAHMMs. It was possible that SLAHMMs simply benefited from a wider sampling of sequence based on structural information (because they combined the HMM search results of several distantly related sequences) but did not actually require the explicit incorporation of structural information in the form of a structure-based alignment. Further, it has been suggested that highly accurate alignments are not essential to the production of useful profile HMMs .
To see if structural alignments were required, we aligned our test protein domain sequences with ClustalW . These sequence based-alignments were then used to build a unified superfamily alignment and SLAHMMs, in an otherwise identical fashion to the standard method. The results demonstrate that the benefits of SLAHMMs cannot be realized without the use of structural alignments (Figure 2). SLAHMMs built using a ClustalW alignment substantially underperform those built with a genuine structural alignment. Similarly, the ClustalW SLAHMMs do not provide any net improvement when combined with standard sequence-only models. At least in the case of SLAHMMs, or results indicate that highly accurate alignments are essential to the production of quality profile HMMs.
To make the iterative HMM searches computationally tractable during initial model building, BLAST was used to pre-filter a large pool of possible homologs from the sequence database for each SCOP master domain (see methods). Though it provides practical benefits, the BLAST pre-filter could also limit the possible sequence space available for training each single-master HMM. Thus, it could be argued that the primary benefit of SLAHMMs might be primarily practical rather than theoretical: they allow the HMM to sample a broader range of sequence space, without the computationally intensive requirement of iterating directly against the entire sequence database.
To help determine the practical vs. theoretical benefits of SLAHMMs, the 42 probes that SLAHMMs uniquely assigned at an EPQ ~0.05 (provided in Table 1) were used as the basis for a smaller-scale test of single-master HMMs vs. SLAHMMs. The superfamilies to which these probes belong were separated, and the single-master HMMs representing them rebuilt. The protocol used (PS1) was identical to that used in the initial experiments, except that in the last iteration, the model was searched against the entire, unfiltered, sequence database ("nr60", see methods), allowing it to potentially discover new homologs previously excluded by the BLAST pre-filter. The new models were then added as a supplement to the model database.
The results of this test suggest that the majority of the benefit of SLAHMMs is theoretical, though there is a substantial practical benefit as well. When given the advantage of a full nr60 search, the single-master HMMs could correctly assign 18 of the probes, but still were unable to assign the remaining 24 (Table 1, column ND2). When given the dual advantages of a full nr60 search and accumulation of additional errors out to a theoretical EPQ of 1, the single-master HMMs could still not assign 5 probes correctly assigned by SLAHMMs without these advantages (Table 1, column ND3).
The behavior of SLAHMMs suggested that they might show the strongest performance at the very edge of detectable sequence similarity. Therefore, we assessed their performance in assignment of correct fold level SCOP similarity. These are cases where the SCOP authors have detected an overall similarity between structures, but there is no current basis to presume an evolutionary relationship .
Our study indicates that structural alignments can be used to improve the results of searches based on profile HMMs. This improvement occurs for the prediction of both superfamily and fold-level relationships. In the case of superfamily-level assignments, SLAHMMs underperformed standard iterated HMMs, but provided a modest improvement in overall performance when used in a combined search, relative to standard HMMs alone. In the case of fold-level assignments, SLAHMMs substantially outperformed standard HMMs. Our experiments with modification of iterative parameters demonstrated that modest changes in the way HMMs are constructed have almost no effect on the performance of the final models. This result suggests that simple adjustments in the construction of sequence-only HMMs will be insufficient to replace the role of SLAHMMs. Further, our work indicates that accurate, structure based alignments are essential for the production of high-quality SLAHMMs.
It has been suggested that the use of superfamily-level structure alignments will lead to "profile dilution", where the additional sequences in the alignment actually reduce the information content, and lead to weaker models [29, 32]. However, the success of SLAHMMs in improving the combined search results indicates that, in some superfamilies, there are underlying sequence patterns that are sufficient for the construction of effective models.
This notion is further supported by the strong results of SLAHMMs when tested against fold-level targets from SCOP. Sequences with similarities at the fold level usually do not share any functional motifs and are not presumed to have a homologous relationship, so the sequence match must be made solely based on residue patterns inherent in the production of similar structures. Improvement in HMMs for fold-level structure assignment has been reported through the inclusion of predicted local structure into multi-track HMMs . Our work indicates that SLAHMMs may provide another avenue for the improvement of HMMs for fold recognition.
The computational overhead for the inclusion of SLAHMM-type models into sequence annotation studies is modest, though non-trivial. Multiple structure alignments must be created for the structure representatives in each superfamily. In real-world conditions, only one SLAHMM per superfamily would be required (because structures would not have to be removed for a leave-one-out test, see methods), making the additional HMM training requirements minimal. Since the alignments used to build SLAHMMs are derived from sequence alignments already available from the generation of standard, single-master HMMs, they can be conveniently created as an adjunct set of models to improve overall performance.
The primary limitation of SLAHMMs is that they require at least two structural representatives per superfamily to be modeled, and some superfamilies still lack even a single structural representative. However, with the gradual growth in structural data likely to be generated from the various structural genomics projects [38, 39], multiple structure representatives should become available for many more superfamilies. This new structural data will allow SLAHMMs to be used to help provide structural assignments for an increasing percentage of the growing body of genomic sequence.
The SCOP structural classification database [33, 34] has become the most common standard used for the benchmarking of protein structure prediction methods [3, 11, 28, 40] and was used in this work. Domains were retrieved from the ASTRAL database (version 1.61 [41, 42]). The available ASTRAL pre-filtering was used to collect a set of domains in which no domain could be aligned to another with a BLAST E-value <10-3. The intent of the chosen E-value cutoff was to provide a balance between broad coverage of the resulting models (any sequence relationship with a BLAST E-value <10-3 can essentially be considered trivial to detect) and challenging potential assignments with which to test the methods. The chains included the ASTRAL "genetic domains", and were filtered to be at least 80 residues in length. Domains from only the first five classes in the SCOP hierarchy were used, as these represent the "typical" globular proteins to which this work aims to successfully assign structures.
These domains were used both for model building and benchmarking. The filtered ASTRAL set from above was additionally filtered to only retain superfamilies with three or more members, so that models could be constructed based on the structure alignment of two sequences, while still retaining a sequence to test the resulting model (leave-one-out test).
The selected SCOP/ASTRAL superfamily representatives are distant homologs, and it is difficult to obtain an accurate alignments of them using sequence information alone [23–25]. Therefore, a multiple structure alignment was generated for each superfamily using a variant of the CE software  designed to create multiple alignments. Multiple alignments were built through the use of a progressive approach, by using the CE Z-score to generate a guide tree via the UPGMA method . To reduce computational overhead, a single alignment was generated for each superfamily containing all of the domains in the test set for that superfamily. Construction of a single superfamily alignment meant that domains would have to be removed from the structure alignment during benchmarking, as will be described later.
17 structures and one superfamily were removed from the set because of problems caused by inconsistencies between the ASTRAL sequence records and their corresponding PDB files. Also, our CE variant was in still in early development, and failed to align 14 superfamilies. Therefore, these were also removed from the test set. These removals reduced the set of initial structures from 257 superfamilies/1995 chains to 242 superfamiles/1575 chains. However, all five classes of SCOP were still well represented. A breakdown of the representation of SCOP fold classes in this sequence set is avaliable in Additional file 1 of the supplementary material.
The Genbank non-redundant protein sequence database ("nrprot") was utilized for the collection of sequence homologous to the SCOP domains. Two releases were used, an older release (downloaded 10/18/02) for early development and testing of the method (results shown in Figure 1) and a current release (downloaded 6/9/06) for all final results (Figures 2, 3, 4). A comparison of the results using these two databases is provided in Figure 5. For purposes of homology searches, purging databases of similar sequences does not reduce search performance , and in some circumstances may improve performance . The nr databases used in this work were filtered down to a level of redundancy such that no sequence aligned to any other with greater than 60% identity (nr60), using the program cd-hit [47, 48].
The iterative protocol presented here used an initial search with BLAST  to capture a set of homologs, which were then aligned with ClustalW  to create a seed alignment. The BLAST results, which provide local alignments, were also used to provide the subsection of the sequence homolog that could be aligned to the SCOP domain of interest. Sequences were selected from the BLAST results for inclusion in the ClustalW alignment based on a set of filtration criteria. First, the sequence was required to match the initial SCOP domain with an E-value < 0.001 and sequence identity of >40%. The first measure was used to ensure the hit was likely to be a true homolog; the second was to be sure it could be aligned with high confidence [23, 24]. Second, the sequence hit was required to incorporate at least 75% of the SCOP domain sequence in its BLAST alignment. The domains in SCOP generally represent structures that are shared throughout a superfamily, so sequences homologous to a SCOP domain should be able to align to most of it. It was confirmed empirically that aligning all sequences without regard to domain coverage resulted in profile HMMs shortened such that conserved regions at the ends of the SCOP domains were not assigned match states. The coverage requirement does not disallow large insertions in the sequence hit relative to the domain.
The final filtration criterion was that the sequence fragments collected from BLAST could not be more than 90% identical to each other. Because of the domain structure of proteins , one domain shared between two proteins can have very high sequence identity, while other domains can still have very low sequence identity. The global filtering done by cd-hit will still retain both of these proteins in nr60, but the domain that is homologous to the SCOP domain could still be highly redundant in the two proteins. Therefore, the sequence fragments collected from BLAST were filtered using the nrdb90.pl program .
Each SCOP domain and its homologs were aligned with ClustalW to generate a representative seed alignment of the sequence family. The creation of a seed alignment in this fashion allowed for insertions seen in homologous sequences, but not the SCOP domain, to still be incorporated into the initial model.
The BLAST search results were also used to provide a set of possible homologs for subsequent HMM iteration (described below). All BLAST hits to each SCOP domain with an E-value of < 500 were stored in a miniature database as possible homologs to that domain. HMM searches against the entire nr60 database were prohibitively slow, particularly for purposes of iteration, so pre-filtering the database was required. This technique was first described for the SAM-T98 protocol .
The HMMER package  was used in this work. Release 2.2g was used for early development of the method (results in Figure 1) and the most recent release (2.3.2) was used for all final results (Figures 2, 3, 4). A comparison of the results for these two versions of HMMER (as well as different versions of the nr database) are provided in Figure 5. HMMER was run with the default settings, except where otherwise noted. The only exception was that 10,000 (rather than the default 5,000) random sequences were used to calibrate the final HMM with hmmcalibrate; this provided additional scoring accuracy. Reported E-values were based on the default theoretical background database size in HMMER (59,021 sequences, the size of a version of Swissprot ). We found that this setting provided intuitive E-values (i.e. assignments with E >1 were usually incorrect). The bioperl toolkit  was used to assist in the collection of outputs from HMMER.
The ClustalW-created sequence alignment was used to train an initial HMM with HMMER. This model was then used to search the pre-filtered sequence database (matching the SCOP domain) for additional homologous sequences. This model was built as a global/local searching model, meaning that it would attempt to capture sequences that span the length of the model. In the first iteration, all of the collected sequences were aligned back to the model, including the SCOP domain. The domains collected from BLAST can be somewhat shorter than ideal for representing the sequence space around the SCOP domain (even with filtration for coverage), and the aim of this alignment step was to lengthen the model such that it encompassed all conserved sections of the SCOP domain.
After the first iteration was complete, the alignment produced was again used to build and calibrate another HMM using HMMER. The pre-filtered database was again searched and the sequences scoring below a set E-value threshold (determined by the parameter set as discussed below) were collected and aligned back to the model. The model was then re-built using this new alignment, and another iterative cycle initiated, etc.
Parameters used in tests of different protocols for the iterative generation of HMMs.
Run Parameter Set (PS)
E-value Cutoff Progression (≤)
% Coverage Cutoff Progression (>)
R im Cutoff Progression (<)
Realign All Sequences To Model After Each Iteration?
10-25, 10-10, 10-5, 0.001, 0.01, 0.1
70, 70, 70, 50, 50, 50
10, 3, 3, 3, 3, 3
10-25, 10-10, 10-5, 0.001, 0.01, 0.1
70, 70, 70, 50, 50, 50
10, 3, 3, 3, 3, 3
10-6, 10-5, 10-4, 0.001, 0.01, 0.1
10 for all iterations
20 for all iterations
10-6, 10-5, 10-4, 0.001, 0.01, 0.1
10 for all iterations
20 for all iterations
From the second iteration onward in PS1, only the newly collected sequences were freshly aligned to the model, while the prior alignment was retained unchanged. This arrangement allowed sequences at each level of similarity to align to a model most suited to their relationship to the SCOP domain. It also guarded against iteration and alignment drift by anchoring both the alignment and the model to the initial SCOP domain.
The iteration was run for a total of six cycles. With each cycle, the E-value threshold was raised to allow more distant sequences into the alignment as detailed in Table 2.
In addition to the E-value criterion, sequences were filtered prior to addition to the alignment at each iteration based on heuristics that aimed to improve the quality and relevance of the alignments. First, sequences were filtered based on their coverage of the HMM model, calculated as the percentage of match states available in the model aligned to by the sequence. It was determined empirically that a reasonable cutoff level was 70% model coverage in the first three iterations and 50% model coverage in the last three iterations. Sequences that aligned to the model below these cutoffs were almost always short fragments or poor alignments.
In addition to the filtration for coverage, a heuristic was applied to screen out poor alignments. The measure used was the ratio:
where i s is the number of insertion states in the model caused by the sequence and m s is the match states the sequence matched to. The point of this measure was to detect poor alignments in which a sequence only matched a few highly conserved match states, but otherwise aligned to insert states in the model (a high value for R im ). These sorts of alignments were sometimes seen at the higher E-value (i.e. lower statistical significance) thresholds, and were presumably not helpful to the construction of a good model. Through the use of R im , large insertions in the model were still possible, provided the sequence also aligned to a large amount of match states. This prevented the exclusion of sequences simply because they produced a large but valid insertion in the model. R im was required to be less than 10 for the initial iteration and less than 3 for all other iterations. R im was set to a very high level in the first iteration because alignments at the low E-value used will almost always provide a very good alignment without any heuristic.
The other parameter sets were arranged to test the validity of the settings used in PS1. PS2 used the same E-value progression, % coverage cutoff progression, and R im progression, but continually re-aligned all sequences back to the new model after every iteration. PS3 and PS4 tested a much less controlled iteration, in which the heuristic parameters were essentially turned off, as well as a more rapid E-value progression (Table 2).
As the alignments were iteratively constructed, information about each sequence was stored for later retrieval. The aim of collecting this information was to maintain a record of how well a given sequence matched to a model at the time of its alignment. By extension, this provided information as to which SCOP domain was likely to provide the best alignment partner for a given nr60 sequence. This information was required for the rational combination of the sequence alignments based on the structure alignment of their corresponding SCOP master sequences (described below).
At the end of the iterative cycle each SCOP domain had a matching sequence alignment corresponding to the sequence space surrounding it, as well as a profile HMM that provided a model of the alignment for purposes of sequence searches. At this point, the model was stored as a single-master HMM for later use in the benchmarking experiment, while the sequence alignment was used in the creation of SLAHMMS.
Structure alignments were used to generate a high-quality multiple sequence alignment of the SCOP master domains for each superfamily. The sequence alignments to each SCOP master were then grafted on to the structure alignment scaffold by using the one-to-one correspondence of the SCOP master sequence in both alignments. In most cases, the SCOP master did not participate in some columns of the sequence alignment; these columns were removed prior to the merging procedure. In large superfamilies, the resulting alignment could contain thousands of sequences.
The superfamily alignments generated by this merging process often had redundant copies of sequences from the nr60 database. These redundancies occurred because several SCOP masters detected and incorporated the same sequences as they iteratively build alignments. This was expected and desirable, as it meant the HMMs were reaching far into the available sequence space. The redundant sequences were removed such that only the best instance of each sequence was retained.
The best instance of a sequence in an alignment was chosen based on its relationship to the SCOP master that retrieved it during the initial iterative alignment construction. The aim was to select the instance that was likely to be in the highest quality alignment with its SCOP master. This was determined using a set of cascading tests using the data stored during the iteration runs. In order, the tests were: iteration cycle at which the sequence was added to its alignment, the E-value it matched the model with at the time of its addition, and the length of the sequence fragment (a longer length implies a better SCOP master partner for the sequence). The sequence instance which scored best in these tests was retained in the superfamily alignment, and all other instances of the same sequence were removed. Once the redundancy had been removed from the alignment, it was suitable for submission to HMMER to generate a SLAHMM representative of the entire superfamily.
In normal use, it is expected that a single SLAHMM model would be generated and stored. However, for purposes of benchmarking it was necessary to remove a SCOP master and all of the sequence it detected from the alignment, so that that domain could be tested against the (now uncontaminated) model in a leave-one-out test. A version of the alignment for each SCOP master was generated, which was missing that master and its detected sequences. This was done prior to the redundancy removal described above, so that the maximum number of sequences collected without the use of the removed domain could be retained. The net effect of the procedure was to create alignments that essentially were constructed as if the removed SCOP master was never present. The SLAHMMs could then be directly compared with the single-master HMMs.
When producing SLAHMMs with HMMER, difficulties were encountered in a few superfamilies. It appeared that the cause was the input alignments, as these problems were not encountered with the single-master alignments. In some superfamilies, the superfamily-wide alignments were very sparse, with large sections filled mostly with gaps, and areas of conserved structure with little or no recognizable sequence conservation. This type of alignment occasionally caused severe model shortening with HMMER where, despite the input of a large alignment, HMMER would generate a model containing less than 25 match states. In such cases, this problem was addressed by turning off the more sophisticated maximum a posteriori method for match state generation  and using the simple 50% column representation option to determine model architecture in hmmbuild. The use of 25 match states as a cutoff was arbitrary, but seemed reasonable based on reviews of the alignments.
HMMER crashed when attempting to build a model with hmmbuild for three alignments in both PS3 and PS4. Therefore, in order to provide PS3 and PS4 with the same amount of models as PS1 and PS2, models from PS1 and PS2 were copied over to their closest partners (PS3 and PS4, respectively) for these cases. In three cases, the models did not make any assignments in the subsequent benchmarking studies. Thus, the added models should have no notable effect on the results for PS3 and PS4. The net result of the above construction steps was 1575 single-master HMMs and 1575 SLAHMMs (built with the superfamily alignments, but each missing a different SCOP master and its corresponding alignment) for each parameter set.
During collection of the final results (Figures 2, 3, 4), HMMER also crashed while attempting to build SLAHMMs for two alignments. These alignments were left out, producing a slight but inconsequential disadvantage for SLAHMMs in that their model set only contained 1573 models.
The experiments were run by using each SCOP master sequence as a probe against all of the models of each type, as well as a combined database containing all the models. Structure assignments were scored in a "probe-centric" fashion, in which each SCOP domain could only be correctly assigned once. This was done because for each of the 1575 domains, there were at least two correct superfamily-level hits available in the case of single-master HMMs, but only one possible correct SLAHMM hit (the SLAHMM that did not consider that domain and its corresponding sequence alignment). Probe-centric scoring allowed for direct comparison of the number of correct assignments between the two methods. All incorrect assignments for a given probe where counted when determining coverage vs. error [11, 40], with the exception that each probe could only be incorrectly assigned to a given superfamily once (this was done to suppress possible artifacts resulting from the use of SCOP for benchmarking, see below). Structural assignment of a probe was considered "correct" if the probe was assigned to the correct superfamily as defined by SCOP. Cases where a probe was assigned to the correct fold grouping but not the correct superfamily were ignored (not counted as correct or incorrect).
Several adjustments were made to the experiment for the benchmarking of fold recognition. Matches of a probe to a model in a different superfamily but the same fold were counted as correct hits, and matches of a probe to a model in the same superfamily were ignored (this is the opposite of the superfamily benchmarking above, and provided a measure of recognition of sequences with extremely distant similarities). The requirement for a cross-superfamily match meant that some of the probes from the superfamily tests could not possibly be correctly assigned in the experiments (because no models were available from another superfamily in the same fold); these were removed, leaving 675 probes (and possible correct assignments). To expand the size of the test set, we added additional probes from smaller superfamilies which had not been used to make any models because they did not have sufficient structural representatives to meet our initial superfamily filtration criteria. Inclusion of these additional test sequences increased the size of the test set substantially, from 675 sequences from 88 superfamilies to 933 sequences from 291 superfamilies. Thus, the reported theoretical EPQ values for fold recognition (Figure 4) are based on 933 possible correct assignments. For both single-master HMMs and SLAHMMs, all models were used in the benchmarking, including those for which no qualified probe sequence was available. This was done to maintain a comparable "noise background" for the different model sets, which correct matches would have to score above.
During initial testing, our benchmarking revealed that some aspects of the SCOP classification can lead to misleading results. As we have not seen these issues described in the literature, we provide an overview of them here. A number of probes were assigned with low E-values to structures that were incorrect according to their SCOP classification, but upon further consideration could arguably be seen to be valid profile matches. These "false positive" hits can be caused by two basic problems: possible SCOP underpredictions and HMM artifact generation. The discovery of these problems required both general and specific modifications to the experiment, because incorrect false positives can effectively make more sensitive methods appear to do worse.
Underpredictions refer to cases where the SCOP authors did not group similar structures into the same fold grouping, even though structural comparison indicates a similar overall fold (and perhaps even an evolutionary relationship). Hence, the structures are grouped only into the same class in the SCOP hierarchy, a relationship that is not considered to be a correct match for a structure assignment technique. In order to suppress the negative effect of these matches on the results, it was necessary to treat them as fold level matches (i.e. these matches would be ignored in the tabulation of correct and incorrect results).
Examples of possible underpredictions detected in the SCOP database.
Model Example [Built From Domain(s)]
Total # of Hits
d1lqta2 et al. (except m1gtea4)
1.3 × 10-5
d1iow_1 et al. (except d1ehia1)
d3grs_2 et al. (except d1gpea1)
d1gsoa2 et al. (except d1ehia1)
All of the possible underpredictions our HMMs detected had some similarity to the "Rossman like folds". Most were α/β/α sandwiches, and all shared a minimal core formed by a parallel β-sheet with a strand order of 2134 (many of the folds had elaborations on this sheet that inserted strands on either or both ends). It seems likely that the regular α/β alternation in these domains produced a distinctive generalized sequence signal that our HMMs detected. It is unclear if these matches were simply the result of analogy or a true homologous relationship. To our knowledge, detectable sequence similarity between these folds has not been previously reported.
Artifact generation refers to cases where, because of a SCOP domain's architecture, models built using that SCOP domain incorporated sequences related to other SCOP domains. Then, when the experiment was run, these models correctly recognized these other SCOP domains, but this recognition was treated as an "error" because these domains were not part of the same SCOP superfamily as the domain used to initiate the model.
Specifically, such artifact generation occurred where a protein consisted of two interacting fragments that flank a central fragment. In some cases, SCOP split such a protein into two compact domains, with one domain consisting of the central fragment, and the other consisting of the two flanking regions joined together at the break points into a single "composite domain". When this flanking domain was used to build an initial ClustalW alignment in the protocol described above, the central fragment was incorporated via the homologous sequences brought into the initial alignment (it was also possible for the central fragment to be brought into the alignment during the HMM iteration phase). The resulting models then recognized the central fragment as a homolog.
Only two superfamily pairings displayed these types of errors for multiple domain and model matches. In one pair, the domains of b.92.1 (composite domain of metallo-dependent hydrolases) flank those of c.1.9 (metallo-dependent hydrolases). In the other, some (but not all) domains of c.4.1 (nucleotide binding domain) flank those of c.3.1 (FAD/NAD(P)-binding domain). These matches were therefore ignored in the collection of the results.
Cases involving other superfamilies were also detected, but in every one of these cases, the errors did not occur for multiple probes and models across the superfamily. They only occurred for a probe and its corresponding flanking domains with the same PDB ID and chain designation. Therefore, all of these cases could easily be dealt with by adding a simple prohibition to the collection of results: any errors where a probe and matching model shared the both the same PDB ID and chain were ignored. This prohibition automatically removed most artifactual matches, while having a minimal effect on the collection of legitimate errors (identical PDB ID and chain designations are rare for SCOP domains).
Although the methods above removed all obvious cases where correct hits were mislabeled as errors, it was important to institute a generalized method to help suppress the effect of these sorts of errors, in the event that some were missed. This was done by limiting the amount of errors counted for a given probe to one per incorrect superfamily match. This had the effect of allowing artifacts and underpredictions to only add one incorrect match to the total for each probe, as opposed to several. At the same time, it would be expected that limiting counted errors in this fashion would only have a small effect on the counting of legitimate errors, as these sorts of errors tend to be random and so come from multiple different superfamilies for a given probe. As described above, only the first correct match of probe to its superfamily was counted. By only counting each incorrect superfamily once per probe, the experiment effectively asked the question: at a given score cutoff, has this probe been assigned to the correct superfamily, and how many incorrect superfamilies scored higher?
We thank Ilya Shindyalov, Lynn Fink, Michael Gribskov, and Shankar Subramaniam for helpful discussions, and Gerard Manning for his critical reading of the manuscript. We also thank Ilya Shindyalov for assistance with the CE software, and Michael Gribskov for the provision of a Perl module used in this work. This work was supported in part by NIH grant GM63208.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.