PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability
© Margelevičius and Venclovas; licensee BioMed Central Ltd. 2005
Received: 17 March 2005
Accepted: 21 July 2005
Published: 21 July 2005
Protein sequence alignments have become indispensable for virtually any evolutionary, structural or functional study involving proteins. Modern sequence search and comparison methods combined with rapidly increasing sequence data often can reliably match even distantly related proteins that share little sequence similarity. However, even highly significant matches generally may have incorrectly aligned regions. Therefore when exact residue correspondence is used to transfer biological information from one aligned sequence to another, it is critical to know which alignment regions are reliable and which may contain alignment errors.
PSI-BLAST-ISS is a standalone Unix-based tool designed to delineate reliable regions of sequence alignments as well as to suggest potential variants in unreliable regions. The region-specific reliability is assessed by producing multiple sequence alignments in different sequence contexts followed by the analysis of the consistency of alignment variants. The PSI-BLAST-ISS output enables the user to simultaneously analyze alignment reliability between query and multiple homologous sequences. In addition, PSI-BLAST-ISS can be used to detect distantly related homologous proteins. The software is freely available at: http://www.ibt.lt/bioinformatics/iss.
PSI-BLAST-ISS is an effective reliability assessment tool that can be useful in applications such as comparative modelling or analysis of individual sequence regions. It favorably compares with the existing similar software both in the performance and functional features.
Protein sequence alignments are at the heart of many biological applications such as sequence database searches, annotation of new sequences, inference of functional regions, comparative protein modeling. Modern sequence comparison methods (e.g. PSI-BLAST ) often can reliably establish an evolutionary link between two proteins and align them even if they share little sequence similarity. However, the resulting significant match between these protein sequences may well include incorrectly aligned regions that are impossible to identify by straightforward inspection. Usually, the lower is the sequence similarity the more challenging is to distinguish alignment regions that can be trusted from those that may have errors. Yet, such a distinction is very important if the exact correspondence of residue positions in sequence alignments is used to extrapolate the biological information from one protein to another. Modeling protein structure by comparison (comparative modeling), identification of active site residues, selection of sites for point mutations are just a few examples where the reliability of aligned positions is critical.
The importance of delineating reliable alignment regions has been recognized more than a decade ago, however, earlier studies focused on pairwise alignments [2–5]. Currently, due to abundant sequence data, most protein sequence comparisons are performed within the context of multiple homologs, and the importance of pairwise alignments has diminished. By including multiple homologous sequences, methods such as PSI-BLAST are able to reliably detect more distant evolutionary links and also produce more accurate alignments. Unfortunately, even most advanced sequence alignment methods do make mistakes and the identification of reliable alignment regions remains an important problem. Estimation of position-specific alignment reliability is being addressed in some recent multiple sequence alignment methods [e.g. [6, 7]]. However, in the multiple alignment case the position-specific reliability index estimates the overall proportion of correct pairwise matches in each alignment column without specifying the contribution of individual sequences. Yet in applications such as comparative modeling usually it is more important to know the position-specific alignment reliability for a given sequence pair than for the whole set of aligned sequences. Recently, a growing understanding of the importance of the problem led to several studies aiming at identification of reliable alignment regions for a pair of sequences within the context of multiple homologs. For example, one of these studies found that a substantial number of misaligned positions could be removed using the near-optimal alignment information . Two other recent methods have been developed that predict reliable alignment regions either directly from a generated sequence profile [9, 10] or using a consensus result of several alignment algorithms [11, 12]. Both latter methods are implemented as web-based servers, which makes them easily accessible and simple to use, but not without certain limitations. For example, both servers require that one of the two sequences in the alignment would have a corresponding PDB structure, which in turn would have to be present in local databases used by these servers.
Here, we present the PSI-BLAST Intermediate Sequence Search tool (PSI-BLAST-ISS) that is primarily designed to help identify reliable regions of the alignment as well as suggest potential alignment variants in unreliable regions. In comparative modeling PSI-BLAST-ISS can also help identify best matching structural templates. In addition, PSI-BLAST-ISS can be used to detect remote homologs that cannot be identified by a straightforward single PSI-BLAST search. However, it should be noted that the detection of remote homologs, unlike in the original and subsequent implementations of the Intermediate Sequence Search (ISS) strategy [13–17], is not the main purpose of our tool.
Since PSI-BLAST-ISS might be most useful in comparative modeling we are going to refer to the sequence pair of interest as the target (query) and the template (reference) sequences throughout the article. However, it should be emphasized that the tool can be applied for any protein sequences that could be linked through common homologs, independently whether the three-dimensional structure for any of them is available or not.
The main idea of PSI-BLAST-ISS is to obtain a number of alignment variants for the sequence pair of interest (target and template) and analyze their consistency. This idea has stemmed from previous manual analysis of multiple PSI-BLAST alignment variants suggesting that regions where variants do agree are likely to be aligned correctly and display close structural similarity .
As an input, PSI-BLAST-ISS takes the target sequence in FASTA format and a file containing a number of parameters that enable a user both to specify sequence databases and to control the execution of the whole ISS procedure at every step. The target sequence is initially searched against a sequence database to collect intermediate sequences (step 1). By default, the target is searched against the non-redundant sequence database. Intermediate sequences are collected from the user-specified PSI-BLAST iteration in the resulting output file using the expectation value (E-value) threshold provided as a parameter. The reduced representative sequence set is constructed by filtering the initial set to a user-defined percentage of sequence similarity with CD-HIT (Li et al., 2001), the sequence clusterization program (step 2). Optionally, a user may introduce a strict limit to the number of sequences to be included in the representative set or even supply independently pre-selected set of sequences. A PSI-BLAST-ISS user can also choose whether to collect intermediate sequences as complete protein sequences or just as sequence fragments matching the target sequence. In the case when the target sequence represents a domain that is also found in multidomain proteins the ability to select only homologous fragments of matching sequences may help to keep the ISS procedure from straying into the realm of unrelated sequences. Each of the intermediate sequences is used to generate a sequence profile in the form of the PSI-BLAST checkpoint file by running a user-defined number of PSI-BLAST iterations (step 3). The resulting checkpoint files are then used to restart PSI-BLAST searches in a second sequence database specified by the user (step 4). This database is expected to include sequences of both proteins of interest (target and template). In a common situation, when the template represents a structural template intended for use in comparative modeling, such a database may be derived by simply appending the target sequence to the PDB sequence database. In this case there is no need to define template(s) in advance since they are identified automatically. Searches against the second database generate corresponding multiple sequence alignments that contain a number of target-template alignment variants. The significance of the target-template alignment is then determined by counting the number of alignment variants that satisfy the expectation value threshold (step 5). Both parameters can be specified by the user. The significant target-template alignment variants are extracted and merged into a single multiple sequence alignment, where the target sequence is aligned with multiple instances of the template sequence according to different alignment variants (step 6). Such an alignment immediately reveals the regions where most (or all) alignment variants are identical and thus might be considered reliable as well as those regions where there is little agreement between alignment variants and therefore unreliable. Often it is useful to analyze position-specific reliability for target alignments with multiple templates. However, it may be inconvenient to contrast/compare at once many multiple sequence alignments obtained by PSI-BLAST-ISS. To make this task easier we introduced a step (step 7) that reduces template alignment variants into a consensus template sequence for each of the target-template alignments. The consensus sequence is generated by analyzing each column of the alignment. A residue is considered conserved in the consensus template sequence if its repetition count in the corresponding position exceeds the user-defined conservation threshold.
PSI-BLAST-ISS currently is implemented as a standalone UNIX-based tool meant to be installed and run locally. It consists of fairly independent modules linked together using Perl. Some of the sequence data processing tasks in PSI-BLAST-ISS are handled by a few modified SEALS scripts .
Results and Discussion
PSI-BLAST-ISS produces several types of results. Perhaps the most informative output file is the FASTA-formatted sequence alignment between the target and automatically detected multiple template sequences, each represented as a consensus sequence derived from multiple alignment variants. The definition line for each consensus template sequence indicates the strength of the consensus in the interval from 0 to 1 (0 – no consensus, 1 – complete agreement) and the number of significant target-template alignment variants that were used to produce the consensus. This output provides a possibility to simultaneously assess the alignment reliability between the target and multiple templates in a region-specific manner. In addition, the consensus strength and the number of significant target-template alignments may help in selecting templates that are structurally most consistent with the target. PSI-BLAST-ISS also produces individual FASTA-formatted multiple sequence alignment files for each target-template pair, where the target is aligned with multiple copies of the same template according to obtained multiple alignment variants. These alignment files provide a visual assessment of the region-specific alignment reliability as well as candidate alignment variants if further analysis of unreliably aligned stretches is needed. Finally, all the template sequences represented in the consensus alignment are collected together in a separate output file.
Performance of PSI-BLAST-ISS in the assessment of alignment reliability
Like for any method it is important to know how PSI-BLAST-ISS performs relative to other available methods. At the time of this study we have been aware of only two publicly available servers that estimate the position-specific reliability of sequence alignment using information from multiple sequences: the Consensus server  and SQUARE . Of those, the performance of PSI-BLAST-ISS could be directly compared only with the Consensus server since SQUARE estimates reliability only for the supplied alignment and does not address the problem of alignment itself.
Comparison of PSI-BLAST-ISS and the Consensus server performance
Seq id, %
PSI-BLAST-ISS (consensus, 0.8)
PSI-BLAST-ISS (consensus, 0.9)
154, 246, 281–285
154, 246, 281–285
113–119, 121–124, 128–132
5–6, 40, 42–43, 66, 158
6, 40, 42–43, 66, 158
136, 245, 325–327
11, 13–14, 16
11, 13–14, 16
207–208, 443, 478–479
10–11, 20–22, 24–26, 28–30, 65–66
76, 115–126, 128–129, 162–165, 261, 302
76, 162, 305
44, 126, 249
89–92, 95, 119–120
8, 43, 73
18, 39, 150–151
144–145, 202–204, 258
Average per target:
Average per target:
Average per target:
The data in Table 1 indicate that using consensus assignment threshold of 0.8 PSI-BLAST-ISS produces more extensive coverage than the Consensus server at a slightly higher rate of discrepancies with DaliLite structure-based alignments. The visual inspection of the superimposed structures revealed that most of these alignment discrepancies are minor. Some of them occur simply due to a difference in a gap placement position when, for example, one of the structures in the pair has either single residue insertion or deletion. Some other discrepancies are short stretches at the transition of a conserved secondary structure into a non-conserved loop and also can hardly be considered alignment errors. Most of these minor discrepancies disappear once the consensus assignment stringency is increased to 0.9. While the coverage becomes only slightly less extensive than for the Consensus server, the discrepancy rate is almost two times lower. Thus the increase in the stringency of the PSI-BLAST-ISS consensus assignment lowers the chances of including both non-conserved structural motifs and alignment errors within regions assigned as reliable.
Utility of multiple alignment variants
Selection of representative templates (homologs)
Often there is a need to choose a single or just a few best templates from a large number of distantly related target homologs. This becomes a challenge at low sequence similarity when the sequence signal is no longer a good indicator of structural relatedness (for example, see Fig. 1 in ). The number of significant target-template variants retained by PSI-BLAST-ISS for generation of consensus template sequence might guide such selection of the template(s). The higher is the number of target-template alignment variants that are accepted as significant, the closer structural relationship between them might be expected. This number is directly available from the file containing the alignment between the target and the individual template and is also reported within the definition line for each template in the consensus alignment file.
Detection of distant evolutionary relationships (homologous folds)
Multiple initiation points in the PSI-BLAST-ISS procedure ensure that the space of homologous sequences is explored more exhaustively than in the case of a single query-based search. Owing to that, PSI-BLAST-ISS may uncover distant evolutionary relationships, which are not seen if only a single query-initiated PSI-BLAST search is performed. In other words, PSI-BLAST-ISS may serve as a transitive PSI-BLAST tool for the detection of homologous folds. To test this PSI-BLAST-ISS capability we used CASP6 Homologous Fold Recognition targets (FR/H). These targets do have evolutionary related structures in the PDB database but these relationships could not be detected by PSI-BLAST searches initiated with the target sequence. For this test we required at least one significant match to a PDB structure (template) from all intermediate sequence searches. To make the comparison compatible with the CASP6 setting we only considered structural templates that were available from PDB at the time of the CASP6 experiment. We also excluded from consideration those FR/H CASP6 targets, for which at least one domain could be matched to a PDB structure using a straightforward PSI-BLAST search. As a result, out of fourteen considered FR/H targets, PSI-BLAST-ISS was able to identify related structures for four of them (1rxx for T0203, 1pk6 and several others for T0206, 1jx7 for T0224, 1qpn and other structures for T0228). An interesting case is T0228. While direct PSI-BLAST search initiated with the T0228 sequence failed to find any related structure, PSI-BLAST-ISS identified several structures producing over ten significant matches each (a default parameter). The latter result stresses the fact that sometimes the space of homologous sequences might be skewed in such a manner that a single sequence search may not be very effective in identifying important relationships.
We have described PSI-BLAST-ISS, a tool for delineating reliable alignment regions and suggesting possible alignment choices in unreliable yet structurally conserved regions. PSI-BLAST-ISS might be most useful in assessing target-template alignments in comparative modeling or judging whether the interpolation of biological information directly form alignments is feasible for individual sequence regions. Unlike two other recently published methods for predicting reliable alignment regions (SQUARE and the Consensus server) PSI-BLAST-ISS is not confined to reference (template) sequences with known three-dimensional structure. The performance of PSI-BLAST-ISS in alignment reliability estimation was directly compared with the Consensus server. We find that on a set of CASP6 targets PSI-BLAST-ISS on average is able to produce more extensive coverage of confident alignment or fewer errors, depending on the selected consensus stringency. The functionality of PSI-BLAST-ISS also extends into detection of non-apparent distant homologous relationships.
Availability and requirements
Project name: The PSI-BLAST intermediate sequence search tool (PSI-BLAST-ISS)
Project home page: http://www.ibt.lt/bioinformatics/iss
Operating systems: Unix-based platforms
Programming language: Perl
Other requirements: locally installed PSI-BLAST and CD-HIT (optional)
Any restriction to use by non-academics: None
This research project was supported in part by grants from Howard Hughes Medical Institute and the 6th European Community Framework Programme.
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Chao KM, Hardison RC, Miller W: Locating well-conserved regions within a pairwise alignment. Comput Appl Biosci 1993, 9(4):387–396.PubMedGoogle Scholar
- Mevissen HT, Vingron M: Quantifying the local reliability of a sequence alignment. Protein Eng 1996, 9(2):127–132.View ArticlePubMedGoogle Scholar
- Schlosshauer M, Ohlsson M: A novel approach to local reliability of sequence alignments. Bioinformatics 2002, 18(6):847–854. 10.1093/bioinformatics/18.6.847View ArticlePubMedGoogle Scholar
- Vingron M, Argos P: Determination of reliable regions in protein sequence alignments. Protein Eng 1990, 3(7):565–569.View ArticlePubMedGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15(2):330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Poirot O, O'Toole E, Notredame C: Tcoffee@igs: A web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res 2003, 31(13):3503–3506. 10.1093/nar/gkg522PubMed CentralView ArticlePubMedGoogle Scholar
- Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18(2):306–314. 10.1093/bioinformatics/18.2.306View ArticlePubMedGoogle Scholar
- Tress ML, Grana O, Valencia A: SQUARE--determining reliable regions in sequence alignments. Bioinformatics 2004, 20(6):974–975. 10.1093/bioinformatics/bth032View ArticlePubMedGoogle Scholar
- Tress ML, Jones D, Valencia A: Predicting reliable regions in protein alignments from sequence profiles. J Mol Biol 2003, 330(4):705–718. 10.1016/S0022-2836(03)00622-3View ArticlePubMedGoogle Scholar
- Prasad JC, Comeau SR, Vajda S, Camacho CJ: Consensus alignment for reliable framework prediction in homology modeling. Bioinformatics 2003, 19(13):1682–1691. 10.1093/bioinformatics/btg211View ArticlePubMedGoogle Scholar
- Prasad JC, Vajda S, Camacho CJ: Consensus alignment server for reliable comparative modeling with distant templates. Nucleic Acids Res 2004, 32(Web Server issue):W50–4.PubMed CentralView ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998, 14(10):846–856. 10.1093/bioinformatics/14.10.846View ArticlePubMedGoogle Scholar
- Li W, Pio F, Pawlowski K, Godzik A: Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 2000, 16(12):1105–1110. 10.1093/bioinformatics/16.12.1105View ArticlePubMedGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
- Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol 1997, 273(1):349–354. 10.1006/jmbi.1997.1288View ArticlePubMedGoogle Scholar
- Salamov AA, Suwa M, Orengo CA, Swindells MB: Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng 1999, 12(2):95–100. 10.1093/protein/12.2.95View ArticlePubMedGoogle Scholar
- Venclovas č: Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins 2001, Suppl 5: 47–54. 10.1002/prot.10008View ArticlePubMedGoogle Scholar
- Walker DR, Koonin EV: SEALS: a system for easy analysis of lots of sequences. Proc Int Conf Intell Syst Mol Biol 1997, 5: 333–339.PubMedGoogle Scholar
- Cozzetto D, Di Matteo A, Tramontano A: Ten years of predictions ... and counting. Febs J 2005, 272(4):881–882.View ArticlePubMedGoogle Scholar
- Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16(6):566–567. 10.1093/bioinformatics/16.6.566View ArticlePubMedGoogle Scholar
- Venclovas č: Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance. Proteins 2003, 53 Suppl 6: 380–388. 10.1002/prot.10591View ArticlePubMedGoogle Scholar
- Venclovas č, Zemla A, Fidelis K, Moult J: Assessment of progress over the CASP experiments. Proteins 2003, 53 Suppl 6: 585–595. 10.1002/prot.10530View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.