Analysis of superfamily specific profile-profile recognition accuracy
© Casbon and Saqi; licensee BioMed Central Ltd. 2004
Received: 09 August 2004
Accepted: 16 December 2004
Published: 16 December 2004
Annotation of sequences that share little similarity to sequences of known function remains a major obstacle in genome annotation. Some of the best methods of detecting remote relationships between protein sequences are based on matching sequence profiles. We analyse the superfamily specific performance of sequence profile-profile matching. Our benchmark consists of a set of 16 protein superfamilies that are highly diverse at the sequence level. We relate the performance to the number of sequences in the profiles, the profile diversity and the extent of structural conservation in the superfamily.
The performance varies greatly between superfamilies with the truncated receiver operating characteristic, ROC10, varying from 0.95 down to 0.01. These large differences persist even when the profiles are trimmed to approximately the same level of diversity.
Although the number of sequences in the profile (profile width) and degree of sequence variation within positions in the profile (profile diversity) contribute to accurate detection there are other superfamily specific factors.
Currently some of the best methods for detecting relationships between protein sequences below the so-called twilight zone of sequence similarity are offered by iterative search algorithms such as PSI-BLAST  which, in effect, compare sequences to a profile. More recently profile-profile matching protocols [2–5] have been shown to offer considerable benefits over sequence-profile matching.
Here, we examine how the performance of remote homolog detection by profile-profile methods varies between particular superfamilies. Since superfamilies are believed to constitute sets of remote homologs, detection of same-superfamily relationships is an important task for bioinformatics, and with the increasing number of structures becoming available, improvement in this area will help build a complete structural map of sequence space. In this paper, we use a set of superfamilies that are very sequence diverse to benchmark profile-profile methods. By sequence diverse, we mean that the superfamily has many domains that show no detectable sequence similarity to each other; this lack of detectable sequence similarity means this set is a difficult benchmark for remote homolog detection methods.
Previous work has shown that the performance of profile-profile methods is chiefly determined by the width and diversity of the profiles. By profile width, we mean the number of sequences in the profile, defined in contrast to profile length and by diversity we mean the degree of sequence variation within positions in the profile. In particular, Panchenko suggested that there may be an optimum level of profile diversity , whilst Grishin suggested that the inclusion of as many related sequences as possible gives maximum performance .
We examine the performance of profile-profile matching with regard to specific superfamilies with both the full profiles generated from a PSI-BLAST search, and with profiles that are trimmed to similar width and diversity. Significant differences in recognition performance exist between superfamilies for both the full and trimmed profiles. This suggests that performance of profile-profile matching is not simply a function of profile width and diversity. We examine how the performance relates to the structural diversity of superfamilies and find that structurally conserved superfamilies are recognised more successfully than structurally diverse superfamilies.
Width and diversity of profiles
Profile width and Neff for dataset
E Set domains
Superfamily specific performance of remote homolog detection
For the full profiles, the alpha/beta-Hydrolases, Cytochrome c and S-adenosyl superfamilies perform well, all having with ROC10 values ≥ 0.7, the fibronectin, thioredoxin-like, (trans)glycosidases, immunoglobulin and FAD/NAD(P)-binding have ROC10 > 0.2 and the remaining 8 superfamilies all perform poorly, having a performance less than 0.1.
After trimming, although performance is reduced, the overall pattern of performance still remains. All the well recognised superfamilies (with the exception of the (trans)glycosidases and thioredoxin-like) still show ROC10 values greater than 0.2, while the rest are still less than 0.1.
The fact that the performance varies greatly between superfamilies despite the trimming of the profiles indicates that the profile generation is not the only limiting step in the performance of profile-profile methods. One might have thought that, for instance, the bad recognition of 4-helical cytokines is due to the small number of homologs drawn from the profile-building stage. Whilst this still may be true, it is not necessarily true: the Cytochrome c superfamily still shows a ROC10 of 0.7 when using trimmed profiles despite having, on average, less than 20 sequences in the profile.
Relation between structural diversity, sequence conservation and recognition performance
It may be the case that despite the absence of any discernible global sequence similarity within our dataset some local patterns of conservation do exist. These patterns may be present more strongly in some superfamilies than in others. In order to examine this possibility we constructed multiple structure based sequence alignments for each of the 16 superfamilies and then looked down the columns of the multiple sequence alignments to examine the extent of conservation at each position (see Methods section).
Our results suggest that profile profile methods can detect remotely related sequences for some superfamilies significantly better than for others. In our dataset the sequence identity between domains in all the superfamilies is low (not greater than 10% as defined by the ASTRAL). Although the mean width and diversity of the profiles varies across the superfamilies this does not appear to be the only factor contributing to the differences in detection.
The effect of the trimming varied depending on superfamily. For the best performing profile (alpha/beta hydrolases) the trimming reduced the performance by about 50% (from 0.95 to 0.43) but the effect on the rank was small dropping from first place to second. Similarly the trimming impacted significantly on the performance of the S-adenosyl methyl transferases with ROC10 dropping from 0.70 to 0.22. However trimming had no effect on performance for the FAD/NAD(P)-binding superfamily, and only resulted in a small reduction in performance for the immunoglobulins and the cytochrome c superfamilies. Importantly the membership of the top ranking superfamilies in terms of performance did not change after trimming.
Although the overall level of sequence similarity within our dataset is low (not more than 10% identity) the different superfamilies exhibit different levels of conservation at positions within the multiple structure based alignments. These conserved positions may facilitate recognition. The extent to which they constrain the structures leading to less diverse alignments is unclear. We recognise also that our measure of conservation and also the use of RMSD as a measure of structural diversity both have their shortcomings. It would be interesting to identify and extract a conserved core and represent structural profiles as combination of core profiles separated by regions of variable length.
There exist large superfamily specific differences in the performance of profile profile matching for the detection of remote sequence relationships. Some superfamilies can be detected far more successfully than others. The width and diversity of the profiles are important factors in successful recognition. However these are not the only factors that contribute to these superfamily specific differences.
Properties of the dataset
E Set domains
For each domain of each of the 16 superfamilies we executed a five round PSI-BLAST  run against the protein non redundant protein database nr (dated 5/2/04). We used the "-m6" option to output a multiple alignment and the "e 0.05" to only include hits with e-values less than 0.05 in the alignment. Positions in the multiple alignment that correspond to gaps in the query are removed. We use the resulting multiple alignment as the profile for the query domain.
To produce trimmed profiles, we take the full profile and remove the bottom sequence (corresponding to the most remote homolog) until a stopping criterion is reached. The stopping criterion is based on Neff, a statistic previously used for this task [1, 6, 7]. Neff is defined as the total number of different amino acids in a given column of a profile. Our stopping criterion was that Neff must be less than 8 in all non-gapped positions in profile, where non-gapped positions are defined as those with a gap content of less than half.
We use the program COMPASS  to perform the profile profile matching. COMPASS performs a local alignment of a query profile to each member of a database of profiles. COMPASS uses a generalisation of PSI-BLAST profile-sequence scoring to score similarities between profiles and estimate the statistical significance of the score of the local alignment.
To assess the performance of profile-profile matching, each domain of each of the 16 superfamilies was used as a query and its sequence profile was matched against a library of sequence profiles representing the dataset. A profile database was then created using the 543 profiles. When matching the profile of domain i of superfamily j, ( ), the sequence profile corresponding to was not included in the sequence profile library. This procedure was carried out twice: firstly with the full profiles, and the again with the trimmed profiles.
We use ROC10 as a statistic that describes the performance of the profiles for a particular super-family. ROC n is defined as , where T is the total number of true hits possible and t i is the number of true positives with a score better than the ith false hit. Variance in the ROC10 statistic was calculated using the method given in .
Structural diversity of superfamilies
To evaluate the structural diversity within each superfamily, each member of a superfamily was structurally compared to every other member. For all the domains in a superfamily we perform pairwise structural alignments using the program SAP  to all other domains. Since these domains do not share more than about 10% sequence identity, we would expect that they effectively capture the extent of structural variation within the superfamily. We obtain an average measure of structural similarity (root mean square deviation, RMSD) for each of the 16 superfamilies.
Structure based multiple alignments
To create a structure based multiple alignment of a superfamily, we first made all pairwise structural comparisons between all pairs within a superfamily using SAP [11, 12]. We then created a T-Coffee  library for each pairwise comparison, where the score between two equivalenced residues is i and j at positions x i , x j in the superposition, is defined to be ((1 + RMSD)(1 + |x i - x j |))-1. A detailed explanation and analysis of this method is given in .
We used the Taylor Venn diagram  to assign residues in a column of the multiple alignment to a given set. The sets are overlapping and they group together amino acids at differing levels of detail (eg the hydrophobic set includes aromatic [FYWH] as a subset). However, we adopted a fairly general measure of conservation and marked a position (column) as conserved if 80% of the residues at that position could be assigned to any one set. The conservation measure for a superfamily was the number of conserved positions divided by the average length of domains in our dataset belonging to that superfamily. Only those columns that contained at least 80% of positions ungapped were considered.
JAC wishes to acknowledge the financial support from the Special Trustees of the Royal London Hospital
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326: 317–36. 10.1016/S0022-2836(02)01371-2View ArticlePubMedGoogle Scholar
- Sadreyev R, Baker D, Grishin N: Profile-profile comparisons by COMPASS predict intricate homologies between protein families. Protein Sci 2003, 12(10):2262–72. 10.1110/ps.03197403PubMed CentralView ArticlePubMedGoogle Scholar
- Tang C, Xie L, Koh I, Posy S, Alexov E, Honig B: On the role of structural information in remote homology detection and sequence alignment: new methods using hybrid sequence profiles. J Mol Biol 2003, 334(5):1043–62. 10.1016/j.jmb.2003.10.025View ArticlePubMedGoogle Scholar
- Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol 2002, 315(5):1257–75. 10.1006/jmbi.2001.5293View ArticlePubMedGoogle Scholar
- Panchenko A: Finding weak similarities between proteins by sequence profile comparison. Nucleic Acids Res 2003, 31(2):683–9. 10.1093/nar/gkg154PubMed CentralView ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics 2004, 20: 818–28. 10.1093/bioinformatics/btg485View ArticlePubMedGoogle Scholar
- Chandonia J, Walker N, Lo Conte L, Koehl P, Levitt M, Brenner S: ASTRAL compendium enhancements. Nucleic Acids Res 2002, 30: 260–3. 10.1093/nar/30.1.260PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–40. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Schaffer A, Aravind L, Madden T, Shavirin S, Spouge J, Wolf Y, Koonin E, Altschul S: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001, 29(14):2994–3005. 10.1093/nar/29.14.2994PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor W, Orengo C: Protein structure alignment. J Mol Biol 1989, 208: 1–22. 10.1016/0022-2836(89)90084-3View ArticlePubMedGoogle Scholar
- Taylor W: Protein structure comparison using SAP. Methods Mol Biol 2000, 143: 19–32.PubMedGoogle Scholar
- Notredame C, Higgins D, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–17. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
- Casbon J, Saqi M: S4: Structure-based Sequence-alignments of Scop Superfamilies. To appear in Nucleic Acids Research Database Issue 2005.Google Scholar
- Taylor W: The classification of amino acid conservation. J Theor Biol 1986, 119(2):205–18.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.