TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences
BMC Bioinformatics volume 8, Article number: 2 (2007)
Solenoid repeat proteins of the Tetratrico Peptide Repeat (TPR) family are involved as scaffolds in a broad range of protein-protein interactions. Several resources are available for the prediction of TPRs, however, they often fail to detect divergent repeat units.
We have developed TPRpred, a profile-based method which uses a P-value-dependent score offset to include divergent repeat units and which exploits the tendency of repeats to occur in tandem. TPRpred detects not only TPR-like repeats, but also the related Pentatrico Peptide Repeats (PPRs) and SEL1-like repeats. The corresponding profiles were generated through iterative searches, by varying the threshold parameters for inclusion of repeat units into the profiles, and the best profiles were selected based on their performance on proteins of known structure. We benchmarked the performance of TPRpred in detecting TPR-containing proteins and in delineating the individual repeats therein, against currently available resources.
TPRpred performs significantly better in detecting divergent repeats in TPR-containing proteins, and finds more individual repeats than the existing methods. The web server is available at http://tprpred.tuebingen.mpg.de, and the C++ and Perl sources of TPRpred along with the profiles can be downloaded from ftp://ftp.tuebingen.mpg.de/ebio/protevo/TPRpred/.
Solenoid repeat proteins have recently attracted interest because of their versatility as scaffolds for the engineering of protein-protein interactions . This class of proteins is characterized by homologous, repeating structural units, which stack together to form an open-ended superhelical structure. Such an arrangement is in contrast to the structure of most proteins, which fold into a compact shape . Solenoid structures adopt a variety of shapes, depending on the structural features of the repeating structural unit and the arrangement of individual units in the solenoid. The curvature created by the superhelical nature of these proteins predetermines the target proteins that can bind to them . The Tetratrico Peptide Repeats (TPRs) together with their related repeats, the Pentatrico Peptide Repeats (PPRs) and the SEL1-like repeats, form a large family within the solenoid repeat proteins. The repeating unit of TPRs, PPRs and SEL1-like repeats are formed of two or more stacked 34, 35 and 36-amino acid αα-hairpin repeat units, respectively [4–6]. These solenoid repeat proteins are involved in a diverse spectrum of cellular functions such as cell cycle control, transcription, splicing, protein import, regulatory phosphate turnover and protein folding, by virtue of their tendency to bind target proteins [5, 7, 8].
Homologous structural repeat units are often highly divergent at the sequence level, a feature that makes their prediction challenging. Currently, several web-based resources are available for the detection of TPRs, including Pfam , SMART , and REP . These resources use hidden Markov model (HMM) profiles or sequence profiles, which are constructed from the repeats trusted to belong to the family. However, the profiles used are constructed from closely homologous repeats; therefore, divergent repeat units often get a negative score and are not considered in computing the overall statistical significance, even though they are individually significant. For this reason Pfam, SMART, and REP perform with limited accuracy in detecting remote homologs of known TPR-containing proteins and in delineating the individual repeats within a protein [12, 13]. For example, TPR-like repeats from the central domain of MalT protein [E. coli;PDB:1HZ4] are not detected by these resources. MalT is the transcription regulator of the maltose regulon, which is responsible for the uptake and catabolism of malto-oligosaccharides in E. coli . In order to predict such divergent repeats, we have developed a specialized tool (TPRpred), which is able to predict TPR-, PPR- and SEL1-like repeats from protein sequences. The advantages of our method are the following:
We construct optimized profiles through iterative searches by varying the threshold for inclusion of repeats into the profiles.
We apply a score offset in such a way that repeats with P-value ≤ 0.01 will get a positive score. Therefore, even marginally significant repeats will contribute to the whole-protein P-value.
Putative repeat units located near an already identified repeat get a tight-fit reward in order to account for the tendency of repeats to occur in tandem.
Our tool reports not only P-values, based on the score distribution of true negatives derived from the known protein structures, but also computes a probability that a target sequence is a TPR protein.
Given a query sequence of length L and a sequence profile of length W representing a single repeat unit, TPRpred finds the best-scoring alignment of the sequence with an integer number of repeats, each of them aligned without internal gaps using standard log-odds scoring. Tandem repeats with a gap of ≤ K residues are rewarded with r bits, while gaps of > K residues are penalized with g bits (K = 10 and g = 0 in our benchmarks).
Since no internal gaps are allowed within repeats, the score distribution of the repeat profile with equal-length windows of unrelated sequences has an almost perfect Gaussian distribution. (The score is a sum of W independent random variables and therefore it approaches a Gaussian according to the central limit theorem.) The σ and μ parameters of this distribution are derived from a calibration search against a database of unrelated protein sequences from the SCOP database . The tails of a Gaussian distribution approach zero much faster than the tails of a Gumbel distribution (which would be obtained if internal gaps were allowed). Therefore, the same positive score of a true repeat unit will generally have a much higher significance in the case of a Gaussian as compared to a Gumbel distribution. Hence, the restriction of ungapped repeats increases the sensitivity of TPRpred for detecting ungapped repeat families such as TPR-, PPR- and, SEL1-like repeat proteins and others with duplicated helical hairpins.
If the reward r for closely spaced repeat units is set low (e.g. zero) then one will fail to detect many repeats if their score is below zero. This is the case for the HMMER software , where often repeat instances have scores below zero even though their P-values are significant (e.g. below 0.01). Since alignment algorithms find the alignment with maximum score, they will skip repeat instances that are assigned negative scores. On the other hand, if r is set high, many false positive repeat units will be found within K residues of an already ascertained repeat unit. We therefore set the reward r such that the probability of finding a false positive repeat instance within K residues of another repeat is p r = 0.01. In the appendix, it is shown that this requires to set the tight fit reward r to
Here erfc-1 is the inverse of the complementary error function, and σ and μ are derived from the calibration of the profile as explained before.
To further increase sensitivity, we add an offset c to the repeat unit match score in such a way that the probability for the observation of a repeat in an unrelated database protein of length L is equal to p c = 0.01. In the appendix it is shown that this requires to set the offset c to
This ensures that even repeat units with no neighbours within K residues will get detected, if their P-value is better than 0.01, independent of the original score baseline (which depends on a null model that is not appropriate for this purpose). At the same time, this global offset guarantees that only very rarely (with probability ≈ 10-4) TPRpred will find more than one false positive repeat unit in an unrelated protein. TPRpred not only computes P-values, which are solely based on the true negative score distribution, but is also able to report the probability that a target sequence is a true homolog, by making use of both the true positive and true negative score distributions. In addition, TPRpred is able to calculate more realistic (i.e. less optimistic) E-values, by calibrating with true negative sequences as opposed to random sequences. The algorithm has been implemented as a computer program "TPRpred", written in C++ (a Perl version is also available). The profiles used by TPRpred are generated by the program ppmake in the TPRpred software package. The Henikoff and Henikoff sequence weighting and pseudocounts are added in a way completely analogous to the procedure used in PSI-BLAST software package , except that the Gonnet matrix is used instead of BLOSUM62. The tool has been tested on a GNU/Linux platform with a i386 processor architecture.
Results and discussion
Definition of TPR-like and non-TPR-like proteins
We define the positive (i.e. the TPR-like) and the negative (i.e. non-TPR-like) set of protein sequences by reference to a set of 13 bona fide TPR-like domains. These are the domains contained in the "TPR-like" superfamily [a.118.8] of the SCOP database (version 1.69) , which consists of the TPR family and the MalT protein. (We use a SCOP version filtered to 70% maximum pairwise sequence identity, available from the ASTRAL server .) The SCOP classification of MalT as TPR-like is supported both by structural and sequence similarity: (1) A DALI search  with the MalT structure [PDB:1HZ4] for structural neighbors yields ten SCOP domains above Z-score of 10, all of them from the TPR family in SCOP (supplementary material, see the file "Additional File 1"). (2) Furthermore, a search with the remote homology prediction server HHpred [20, 21] through the SCOP database readily yields TPRs as closest relatives (supplementary material, see the file "Additional File 1"). To take into account more recent TPR structures not yet contained in SCOP v1.69, we used DALI to search the PDB database (version of December 2005) with the 13 bona fide TPR-like repeat domains as defined by SCOP. We included all structures into our true positve set that obtained a Z-score of at least 10 with one or more of the bona fide TPR-like repeat domains.
The true negative is defined conservatively to include all sequences in SCOP vl.69 (filtered to 70%) which have no Z-score better than 5 with any of the 13 bona fide TPR-like repeat domains (supplementary material, see the file "Additional File 2"). This ensures that marginal cases of proteins which can be neither classified safely as TPR-like nor as non-TPR-like will be ignored in the benchmark.
Profile generation and test set
The performance in the high-selectivity regime of sequence profiles depends on the number of close homologs, whereas the performance in the high-sensitivity regime depends on the number of remote homologs used in constructing the profiles. Relaxing the threshold value to include remote homologs often results in false positives. To optimize the trade-off between remote homologs and false positives, we have constructed a series of TPR profiles. These profiles were generated by iterative searches against the non-redundant (NR) database at NCBI, filtered to a maximum pairwise sequence identity of 70% (NR-70) by CD-HIT [22, 23]. Prior to the searches we broadly removed homologs of MalT [GI:16131294], which we intended to use as a test set, from the NR-70 database using three iterations of PSI-BLAST  at an E-value cutoff of 1.
Homologs of MalT contain divergent TPR units and therefore represent a challenging test set. These proteins belong to the STAND family of ATPases [24, 25], which themselves are part of the AAA+ superfamily . We extracted these sequences conservatively with PSI-BLAST (two iterations, E-value cutoff of 10-4) from NR-70, using the central domain of MalT [GI:17942835] as a query sequence. Using the defining characteristic of STAND proteins, namely an N-terminal P-loop NTPase domain, as a criterion we selected 56 proteins for the test set. The sequences of these proteins are given in the supplementary material (see the file "Additional File 3").
We performed iterative searches to convergence on NR-70 minus STAND proteins with various threshold parameters (whole-protein E-value, and single-repeat P-value). The initial searches were seeded with a manually prepared structure-based sequence alignment of known TPR protein structures (supplementary material, see the file "Additional File 4"). We tested the resulting profiles on the STAND family, TPR family, and the true negative set. The best profile was selected based on its performance on the STAND family, as illustrated in Figure 1.
Further, we built the PPR and the SEL1-like profiles by using the same procedure and cutoff value as for the TPR profile.
We benchmarked our method and the web-server against Pfam, SMART and REP.
Comparison of TPRpred and HMMER
To demonstrate the sensitivity/selectivity of TPRpred against HMMER (version 2.3, default parameters), which is the underlying method employed by the Pfam and SMART web-servers, we benchmarked the performance of both these methods, and the results are shown using the receiver operating characteristic (ROC) plot as illustrated in Figure 2. We could not benchmark against REP, because the stand-alone version is not available. The data sets for the benchmark were obtained using the same true positive and true negative sets which we defined in the profile generation section, but with a 25% maximum sequence identity. In order to enrich these data sets with reliable homologs, two iterations of PSI-BLAST searches were performed for each domain sequence. The first iteration was performed on the NR-90 database. The hits with an E-value ≤ 10-3 and ≥ 85% coverage to the query sequence were extracted into a multiple alignment, that was used to jump-start the second iteration against the NR-70 database. The same selection criteria as in the first iteration were applied in obtaining the homologs for the query. The resulting enriched data sets were simultaneously filtered to a 50% maximum sequence identity using CD-HIT to reduce the redundancy.
Both methods were used to perform searches through the true positive and true negative data sets, using their own TPR profiles or HMMs. The ROC plot shows that TPRpred detects more sequences with E-value better than the first false positive compared to HMMER. However, for lower selectivity TPRpred performance is comparable to HMMER.
Comparison of the web-servers using STAND family members
To assess the sensitivity of TPRpred in detecting divergent TPR units over Pfam (version 20.0 of May 2006), SMART (5.0), and REP (1.1), we evaluated the performance of the web-servers using the STAND family test set. Additionally, we also used 53 true negative sequences by selecting arbitrarily from the all-α class of the SCOP database. The sequences of these proteins are given in the supplementary material (see the file "Additional File 5"). The hits that were confidently predicted according to the web-servers for the STAND proteins are tabulated in Table 1. None of the servers detected false positives from the true negative sequences (data not shown). This shows that all the servers are unbiased to the α-helical proteins which are unrelated.
TPRpred performs significantly better in detecting the TPR units from the members of the STAND family, although sequences of the STAND family members were explicitly excluded from our TPR profile. For instance, the 8 TPR units present in MalT  were detected only by our server. Overall, TPRpred detected twice as many proteins as TPR-containing proteins and over 6 fold more individual repeats as the next best web-server, Pfam. This could be due to the more sensitive Gaussian scoring as well as the score base-line strategy employed by our tool.
Comparison of the web-servers using known protein structures
In order to assess the sensitivity of the web-servers in detecting the individual repeat units, we submitted the sequences of the TPR structure set, along with 2 SEL1-like repeat proteins classified under the HCP-like family [SCOP:a.118.18.1], to TPRpred, Pfam, SMART, and REP web-servers. The number of repeats detected confidently for each sequence are tabulated in Table 2 and the repeats detected only by TPRpred are shown in Figure 3. The TPR structure set contains both proteins that were present in the training databases of the individual methods (Table 2, top) and proteins whose structure became available subsequently (Table 2, bottom). All servers performed well on the former proteins, although TPRpred stood out with 100% detected individual repeats over the other servers, which only detected between 70% and 90%, but the real difference between servers became visible on the latter proteins. Here, TPRpred recognized all proteins as TPR-containing, whereas the other servers recognized less than half, and TPRpred detected 97% of individual repeats, whereas the other servers detected only about 54%.
Comparison of TPRpred, Pfam and SMART on the human proteome
To assess the global gain in the protein annotation of TPRpred over Pfam and SMART, we scanned a set of 37 444 sequences of the human proteome downloaded from Integr8 . The number of proteins and individual repeats detected confidently by TPRpred, Pfam and SMART are tabulated in Table 3. TPRpred detected more proteins as TPR-containing proteins and over 2 fold more individual repeats than Pfam and SMART.
TPRpred is a profile-sequence comparison method for predicting solenoid repeat proteins of TPRs, PPRs and SEL1-like repeats. It shows a marked improvement over existing methods, particularly in the detection of non-canonical, divergent repeats. We attribute this to the exploitation of simple traits such as the tendency of repeats to occur in tandem, robust statistical evaluations and the construction of profiles by iterative searches. The algorithmic improvements of the P-value-dependent score offset as well as the tight-fit reward are quite general and easily transferable to other repeat detection approaches.
Availability and requirements
Project name: TPRpred.
Project home page: http://tprpred.tuebingen.mpg.de/
Sources: The C++ and Perl source codes for TPRpred along with the profiles are freely available by anonymous ftp to ftp://ftp.tuebingen.mpg.de/ebio/protevo/TPRpred/
Operating systems: Linux, Unix.
Programming language: C++ and PERL.
Other requirements: The Perl script requires Perl interpreter version 5.8.5 or higher.
License: GNU GENERAL PUBLIC LICENSE http://www.gnu.org/licenses/gpl.txt
Any restrictions to use by non-academics: None.
First we show that, if the tight-fit reward r is calculated according to equation 1, the P-value to observe a second repeat unit within K residues from an existing one will be p r . To start, note that the P-value for observing a score S > s between the profile and an unrelated equal-length sequence window is
where erfc() is the complementary error function. Because the alignment between the profile and equal-length sequence windows is gap-free, the scores of neighbouring sequence windows can be assumed to be independent from each other. Hence, by elementary probability theory, the probability to obtain a score S i + r larger than zero at any of K start positions (i = 1,...,K) is
We now set this expression to p r , the P-value for observing a spurious second repeat within K residues of an already detected one. Solving for r yields equation 1.
Equation 2 can be proved analogously. A database protein of length L contains L - W + 1 windows of length W. The score between the profile and the i'th window is written as S i + c, which already includes the score offset c that needs to be determined. The probality that at least one of the scores is larger than zero is the same as in the previous equation when r is replaced by c, and K by L - W + 1. Setting the right-hand expression equal to p c and solving for c then yields equation 2.
Main ERG, Lowe AR, Mochrie SGJ, Jackson SE, Regan L: A recurring theme in protein engineering: the design, stability and folding of repeat proteins. Curr Opin Struct Biol 2005, 15: 464–471. 10.1016/j.sbi.2005.07.003
Groves MR, Barford D: Topological characteristics of helical repeat proteins. Curr Opin Struct Biol 1999, 9: 383–389. 10.1016/S0959-440X(99)80052-9
Kobe B, Kajava AV: When protein folding is simplified to protein coiling: the continuum of solenoid protein structures. Trends Biochem Sci 2000, 25: 509–515. 10.1016/S0968-0004(00)01667-4
D'Andrea LD, Regan L: TPR proteins: the versatile helix. Trends Biochem Sci 2003, 28: 655–662. 10.1016/j.tibs.2003.10.007
Small ID, Peeters N: The PPR motif – a TPR-related motif prevalent in plant organellar proteins. Trends Biochem Sci 2000, 25: 46–47. 10.1016/S0968-0004(99)01520-0
Grant B, Greenwald I: The Caenorhabditis elegans sel-1 gene, a negative regulator of lin-12 and glp-1, encodes a predicted extracellular protein. Genetics 1996, 143: 237–247.
Kotera E, Tasaka M, Shikanai T: A pentatricopeptide repeat protein is essential for RNA editing in chloroplasts. Nature 2005, 433: 326–330. 10.1038/nature03229
Scheufler C, Brinker A, Bourenkov G, Pegoraro S, Moroder L, Bartunik H, Hartl FU, Moarefi I: Structure of TPR domain-peptide complexes: critical elements in the assembly of the Hsp70-Hsp90 multichaperone machine. Cell 2000, 101: 199–210. 10.1016/S0092-8674(00)80830-2
Sonnhammer EL, Eddy SR, Durbin R: Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997, 28: 405–420. 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L
Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci USA 1998, 95: 5857–5864. 10.1073/pnas.95.11.5857
Andrade MA, Ponting CP, Gibson TJ, Bork P: Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol 2000, 298: 521–537. 10.1006/jmbi.2000.3684
Steegborn C, Danot O, Huber R, Clausen T: Crystal structure of transcription factor MalT domain III: a novel helix repeat fold implicated in regulated oligomerization. Structure (Camb) 2001, 9: 1051–1060. 10.1016/S0969-2126(01)00665-7
Dohm JA, Lee SJ, Hardwick JM, Hill RB, Gittis AG: Cytosolic domain of the human mitochondrial fission protein fis1 adopts a TPR fold. Proteins 2004, 54: 153–156. 10.1002/prot.10524
Boos W, Shuman H: Maltose/maltodextrin system of Escherichia coli: transport, metabolism, and regulation. Microbiol Mol Biol Rev 1998, 62: 204–229.
Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159
HMMER: profile HMMs for protein sequence analysis[http://hmmer.wustl.edu/]
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32: 189–192. 10.1093/nar/gkh034
Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233: 123–138. 10.1006/jmbi.1993.1489
Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21: 951–960. 10.1093/bioinformatics/bti125
Soding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 2005, 33: 244–248. 10.1093/nar/gki408
Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 2001, 17: 282–283. 10.1093/bioinformatics/17.3.282
Li W, Jaroszewski L, Godzik A: Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 2002, 18: 77–82. 10.1093/bioinformatics/18.1.77
De Schrijver A, De Mot R: A subfamily of MalT-related ATP-dependent regulators in the LuxR family. Microbiology 1999, 145(Pt 6):1287–1288.
Leipe DD, Koonin EV, Aravind L: STAND, a class of P-loop NTPases including animal and plant regulators of programmed cell death: multiple, complex domain architectures, unusual phyletic patterns, and evolution by horizontal gene transfer. J Mol Biol 2004, 343: 1–28. 10.1016/j.jmb.2004.08.023
Ammelburg M, Frickey T, Lupas AN: Classification of AAA+ proteins. J Struct Biol 2006, 156: 2–11.
Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, Gattiker A, Kulikova T, Faruque N, Duggan K, Mclaren P, Reimholz B, Duret L, Penel S, Reuter I, Apweiler R: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33: 297–302. 10.1093/nar/gki039
Kraulis PJ: MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J App Cryst 1991, 24: 946–950. 10.1107/S0021889891004399
Merritt EA, Murphy ME: Raster3D Version 2.0. A program for photorealistic molecular graphics. Acta Crystallogr D Biol Crystallogr 1994, 50: 869–873. 10.1107/S0907444994006396
We thank Christian Mayer for re-coding the Perl script in C++ and Andreas Biegert for integrating the program into the MPI Bioinformatics Toolkit. Funding by the Max Planck-society is gratefully acknowledged.
JS developed the algorithm and programmed the Perl version. MRK was involved in the analysis and interpretation of the data, wrote the wrapper program for the web-interface and drafted the manuscript. ANL supervised the overall work. ANL and JS critically revised the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional File 1: Relatives of MalT by structure and sequence comparison. DALI and HHpred search results for MalT [PDB:1HZ4] (PDF 22 KB)
Additional File 2: Structural neighbours of TPRs. Structural neighbours of known TPRs according to the DALI structure comparison server. The structures with Z scores ≥ 5 are tabulated. The PDB codes were mapped on to the SCOP domain database. (PDF 11 KB)
Additional File 4: Structure-based sequence alignments. Structure-based sequence alignments for TPR and SEL1-like repeat families. (PDF 11 KB)
Additional File 5: True negative data set used in servers benchmarking. Arbitrarily selected 53 true negative sequences from the all-α class of the SCOP database. (PDF 25 KB)
About this article
Cite this article
Karpenahalli, M.R., Lupas, A.N. & Söding, J. TPRpred: a tool for prediction of TPR-, PPR- and SEL1-like repeats from protein sequences. BMC Bioinformatics 8, 2 (2007). https://doi.org/10.1186/1471-2105-8-2
- Repeat Unit
- Scop Database
- Iterative Search
- Remote Homolog
- Pentatrico Peptide Repeat