- Open Access
BioShell-Threading: versatile Monte Carlo package for protein 3D threading
© Gniewek et al.; licensee BioMed Central Ltd. 2014
- Received: 25 November 2012
- Accepted: 18 November 2013
- Published: 20 January 2014
The comparative modeling approach to protein structure prediction inherently relies on a template structure. Before building a model such a template protein has to be found and aligned with the query sequence. Any error made on this stage may dramatically affects the quality of result. There is a need, therefore, to develop accurate and sensitive alignment protocols.
BioShell threading software is a versatile tool for aligning protein structures, protein sequences or sequence profiles and query sequences to a template structures. The software is also capable of sub-optimal alignment generation. It can be executed as an application from the UNIX command line, or as a set of Java classes called from a script or a Java application. The implemented Monte Carlo search engine greatly facilitates the development and benchmarking of new alignment scoring schemes even when the functions exhibit non-deterministic polynomial-time complexity.
Numerical experiments indicate that the new threading application offers template detection abilities and provides much better alignments than other methods. The package along with documentation and examples is available at: http://bioshell.pl/threading3d.
- Monte Carlo
- Structural Alignment
- Template Structure
- Protein Structure Prediction
- Sequence Profile
Protein structure prediction has become one of the key tasks in computational biology of the post-genomic era. Due to the growing size of structural databases, the most important and widely used method is homology modeling. This methodology relies on the existence of structures of homologous protein(s) in databases. The major parts of this procedure are i) recognition of homology between two proteins and ii) correct alignment for the pair of two proteins for which homology was recognized. Here we focus on the latter, still challenging problem. Accurate alignment is essential for many state-of-the-art 3D protein structure prediction algorithms [1–4]. The development of novel threading algorithms however is hindered by i) lack of a general consensus on scoring schemes and ii)plethora of different variants of the same scoring function described in literature but not available as a ready-to-use software.
Our contribution presents a versatile tool for the fast and extensive aligning of two proteins with each other. The alignment can be based on i)the two sequences, ii) one sequence and one structure or iii)on the two structures. The first case, corresponding to pairvise sequence alignment is trivial and can be solved by dynamic programming. However, the other two (corresponding to 3D threading and structure alignment, respectively) are NP-hard problems . Our novel object-oriented application, incorporated within the BioShell package [6, 7], is an integrated framework to heuristically tackle these protein-to-protein alignment problems. The application is written in Java language which facilitates its easy use on various systems and architectures. The advantages and novelties of the software over the existing and downloadable ones [8–12] are:
it employs Monte Carlo (MC) to sample the alignment space so an approximate solution to NP-hard 3D threading problem can be found,
each scoring term type is a separate object that can be easily switched on/off and fully customized with user-provided data, e.g. in a single run, several secondary structure similarity scores may be used, each of them based on a different secondary structure prediction,
new potentials and scoring schemes can be easily implemented by users,
as a result, the user obtains the best scoring solution and a number of suboptimal alignments, ranked by their score; the alignments can be outputted in the Modeller  format file and easily used to build final model structures.
it can be used as a structure alignment software, also capable of producing suboptimal structure alignments.
it can read in and score any arbitrary alignment provided by the user. This can be very helpful in the manual refinement of alignments or for threading force field development.
The project website provides extensive documentation of the library (API) and numerous examples which show how to run the executable threading application and how to interact with the software library.
The source code was divided into four main blocks: i) encoding alignment as system coordinates, ii) moves (alignment modification), iii) scoring and iv) gathering results. Each of these components forms a separate sub-package in the source code tree: jbcl.simulations.threading, jbcl.simulations. threading.movers, jbcl.simulations. threading.ff and jbcl.simulations.threading.observers, respectively. These routines are supported by other generic BioShell components such as Monte Carlo sampling and I/O operations. For user’s convenience we provide also a stand-alone application. To run calculations, the user specifies: i) input data, ii) modification scheme and the of MC sampling and iii) scoring function (force field).
Potentials implemented within BioShell-threading
Base class for all alignment scores
Score depends solely on a single position in a target and the aligned position in a template
Knows about target and template atomic coordinates
Pairwise per-position energy may be pre-calculated and stored in a 2D array
Energy that depends on a template contact map
Contact based energy that is pairwise-decomposable, but can’t be precalculated (otherwise it would result in BigMatrixEnergy score)
Here we use the application in two real life examples to demonstrate the robustness and possible applications of the software. The scripts used in the experiments with the relevant input data were published on the project website.
Threading as a structural alignment algorithm
Quality of query sequence-template structure alignments
To test the Threading algorithm on more realistic problems, the MALIDUP  benchmark has been used. Results are shown in the Figure 5B. MALIDUP benchmark comprises 241 protein pairs of diverged duplicated domains. It was chosen because the evolutionary relation for the domains under consideration is fairly recognized and not biased by sequence similarity. The 3D-Threading algorithm was compared with four other methods: global sequence alignment with the BLOSUM62 matrix (optimized in ), profile-to-profile alignment (optimized in ) with the PICASSO3  scoring function, Threading1D [38, 41] and the widely used, state-of-the-art method: HHAlign [10, 42, 43]. For profile-to-profile, threading 1D and threading 3D runs sequence profiles were generated with five PsiBlast  iterations against the NR90 database and e-value threshold below 0.00001. For the HHalign algorithm, multiple sequence alignments (MSA), were created in the local searching mode with hhblits on the NR20 database created on January 10, 2011. Subsequently, these MSAs were used in aligning query and template sequences with hhalign, in the global alignment mode.
The following scoring terms were used: EnvScore, ProbabilisticSecondaryScore, PICASSO3, Go LikeScore, TwoBodyContact with Miyazawa-Jernigan contact scoring and StrcGapPenalty. The following weights: 0.1, 0.25, 0.5, 1.0, 0.4 and 0.15, respectively were optimized on the ProSup  dataset. The objective of this test was the quality of calculated alignments i.e. average TM-score and alignment overlap with manually curated alignments. The latter was measured as the percent of correctly aligned positions AL0P and the fraction of aligned positions predicted with an error of at most four positions AL4P. For the MALIDUP set it can be observed that the threading algorithm, which incorporates 3D information from the template structure (column ‘R’ in the Figure 5B), performs better than the other tested algorithms, both in respect of average TM-score and overlap with manual alignments (as assessed by AL0P and AL4P scores). Profile-based aligners: Threading 1D and HHAlign perform comparably on these benchmarks, whereas BioShell-Threading performs much better. In particular, when compared to HHAlign (column ‘H’), it achieves approximately 0.09 higher AL0P and 0.24 higher AL4P. This is partially due to the fact, that in case of unrecognized homology, HHAlign returns a null alignment (which affects the averaged score value). There are some possible applications of this result. It can be used to generate alignment boundaries for protein modeling algorithms which can use such the information [32, 46]. In this case the alignment boundary is the range for every query’s residue to which it can align within the template structure. It is also possible, using sub-optimal alignments, to create more diverse spatial constraints for algorithms such as Modeller .
The computational approach utilized in this contribution is an example of a stochastic simulation rather than a typical alignment method. User has to define a number of parameters to control this process, such as the number of Monte Carlo replicas and the respective set of their temperatures. Fortunately, several methods have been devised for REMC parameters selection, e.g. [32, 47]. In general, these parameters depend both on query and template proteins and should be optimized separately for each case. However, for the sake of simplicity, for any benchmark calculation presented in this contribution we used the same set of ten replicas as described above. This temperature set is wide enough to obtain good results for all the test cases but inevitably increases the computational effort. Optimization of these parameters might also occasionally lead to better alignments. However, even for the optimal set of parameters it takes from several minutes to more than an hour to reliably sample the low energy area of the alignment space. The three-dimensional threading Monte Carlo simulation will always be at least an order of magnitude slower than a dynamic programming calculation but is usually faster than RAPTOR - another three dimensional threading where calulations even for short sequences take more than an hour . RAPTOR method however employs branch-and-bound approach and, unlike a stochastic simulation, the reach of the global optimum of a scoring function is guaranteed. Other parameters a user should optimize are: scoring function weights and probabilities of particular alignment modifications (i.e. moves). The extensive study of this parameter space is beyond the scope of this paper and will be described elsewhere. In this work, to avoid overtraining, we optimized the scoring function and movers set on ProSup data set , which has not been used for benchmarking purposes.
The 3D threading application can also be used as a structural alignment method. In the presented benchmark, it has been compared with TMAlign and yielded nearly the same results. CPU time required by TMAlign was however about two orders of magnitude shorther (minutes to an hour by the threading vs seconds by TMAlign). This result is a direct consequence of the number of times each of the two programs calls the TM-score evaluation routine. In order to test the convergence of the threading, calculations were started from a random alignment. During the simulation TM-score has to be evaluated at every Monte Carlo move which, unlike scores derived from ByAtomEnergy, cannot be recalculated locally just for the moved block. TMAlign, on the other hand, starts from an alignment computed by a dynamic programming procedure and evaluates TM-score a few times until convergence is reached. The threading simulation however provides a numer of different suboptimal alignments which all fall within 0.01 range in TM-score units.
BioShell-Threading implements a three-dimensional protein threading algorithm based on a Monte Carlo search scheme. The code has been written in Java language which makes it virtually machine independent. It implements numerous scoring (energy) functions. Some of them can be applied in regular Dynamic Programming. For others, the optimization becomes a NP-hard problem and demand more time consuming methods (e.g. MC). The package provides a ready to use command-line application and a Java software library. This makes BioShell-Threading a component that can be very easily incorporated into larger protein structure pipelines . By providing suboptimal alignments, the package can increase the accuracy of widely used protein folding softwares and proteins structure classification methods. However, the main goal of BioShell-Threading is the refinement of query-to-template alignments. At the time, when fold recognition methods are fast and quite accurate, alignment accuracy is the limiting factor. Thus using certain fast algorithms [10, 41, 44] to search the whole protein databases and then refine top hits with more sophisticated scoring function seems to be of a great value to the protein modeling community.
Project name: Bioshell Threading 3D
Project home page: http://www.bioshell.pl/threading3d
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java 1.4 or higher
License: Creative Commons by-sc-nd
Any restrictions to use by non-academics: licence required
We would like to acknowledge support from NIH Grants R01GM072014, R01GM073095, R01GM081680 and R01GM081680-S1. A. Kolinski acknowledges support from the Foundation for Polish Science TEAM project (TEAM/2011-7/6) co-financed by the European Regional Development Fund operated within the Innovative Economy Operational Program. D. Gront was supported by the Polish National Science Centre (NCN), grant no. DEC-2011/01/D/NZ2/07683. The computational part of this work was done using the computer cluster at the Computing Center of the Department of Chemistry, University of Warsaw.
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234: 779-815.View ArticlePubMedGoogle Scholar
- Kolinski A: Protein modeling and structure prediction with a reduced representation. Acta Biochimica Polonica. 2004, 51: 349-371.PubMedGoogle Scholar
- Zhang Y: I-TASSER server for protein 3D structure prediction. BMC Bioinformatics. 2008, 9: 40-View ArticlePubMed CentralPubMedGoogle Scholar
- Kallberg M, Wang H, Wang S, Peng J, Wang Z, Lu H, Xu J: Template-based protein structure modeling using the RaptorX web server. Nat Protocols. 2012, 7: 1511-1522.View ArticleGoogle Scholar
- Lathrop RH: The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. 1994, 7: 1059-68.View ArticlePubMedGoogle Scholar
- Gront D, Kolinski A: BioShell - a package of tools for structural biology computations. Bioinformatics. 2006, 22: 621-622.View ArticlePubMedGoogle Scholar
- Gront D, Kolinski A: Utility library for structural bioinformatics. Bioinformatics. 2008, 24: 584-585.View ArticlePubMedGoogle Scholar
- Marti-Renom MA, Madhusudjan MS, Sali A: Alignment of protein sequences by their profiles. Protein Sci. 2004, 13: 1071-87.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhou H, Zhou Y: Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins. 2005, 58: 321-328.View ArticlePubMed CentralPubMedGoogle Scholar
- Soding J: Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005, 21: 951-960.View ArticlePubMedGoogle Scholar
- Lobley A, Sadowski MI, Jones DT: pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination. Bioinformatics. 2009, 25: 1761-1767.View ArticlePubMedGoogle Scholar
- Chen H, Kihara D: Effect of using suboptimal alignments in template-based protein structure prediction. Proteins. 2010, 79: 315-34.View ArticleGoogle Scholar
- Mirny LA, Shakhnovich EI: Protein structure prediction by threading. why it works and why it does not?. J Mol Biol. 1953, 283: 507-526.View ArticleGoogle Scholar
- Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E: Equations of state calculations by fast computing machines. J Chem Phys. 1953, 21: 1087-1092.View ArticleGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science. 1983, 220: 671-680.View ArticlePubMedGoogle Scholar
- Swendsen RH, Wang JS: Nonuniversal critical dynamics in Monte Carlo simulations. Phys Rev Lett. 1987, 58: 86-88.View ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nuc Acids Res. 2005, 33: 2302-2309.View ArticleGoogle Scholar
- Miyazawa S, Jernigan RL: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules. 1985, 18: 534-552.View ArticleGoogle Scholar
- Miyazawa S, Jernigan RL: Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol. 1996, 256: 623-644.View ArticlePubMedGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993, 233: 123-38.View ArticlePubMedGoogle Scholar
- Kabsch W: A solution of the best rotation to relate two sets of vectors. Acta Crystallogr. 1976, 32: 922-923.View ArticleGoogle Scholar
- Taketomi H, Ueda Y, Go N: Studies on protein folding, unfolding and fluctuations by computer simulation. Int J Pept Prot Res. 1975, 7: 445-459.View ArticleGoogle Scholar
- Tegge A, Wang Z, Eickholt J, Cheng J: NNcon: Improved protein contact map prediction using 2D-recursive neural networks. Nucl Acids Res. 2009, 37: W515-W518.View ArticlePubMed CentralPubMedGoogle Scholar
- Godzik A, Kolinski A, Skolnick J: Are proteins ideal mixtures of amino acids? analysis of energy parameter sets. Protein Sci. 1995, 4: 2107-17.View ArticlePubMed CentralPubMedGoogle Scholar
- Skolnick J, Jaroszewski L, Kolinski A, Godzik A: Derivation and testing of pair potentials for protein folding: when is the quasichemical approximation correct?. Protein Sci. 1997, 6: 676-688.View ArticlePubMed CentralPubMedGoogle Scholar
- Skolnick J, Kolinski A, Ortiz A: Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins. 2000, 38: 3-16.View ArticlePubMedGoogle Scholar
- Vendruscolo M, Domany E: Pairwise contact potentials are unsuitable for protein folding. J Chem Phys. 2004, 109: 11101-11108.View ArticleGoogle Scholar
- Eyal E, Frenkel-Morgenstern M, Sobolev YV, Pietrokovski S: A pair-to-pair amino acids substitution matrix and its applications for protein structure prediction. Proteins. 2007, 67: 142-53.View ArticlePubMedGoogle Scholar
- Miyazawa S, Jernigan RL: Identifying sequence–structure pairs undetected by sequence alignments. Protein Eng. 2000, 13: 459-475.View ArticlePubMedGoogle Scholar
- Bioshell’s Documentation website. [http://www.bioshell.pl/~git/biosimulations.doc/html]
- Chang I, Cieplak M, Dima R, Maritan A, Banavar JR: Protein threading by learning. Proc Natl Acad Sci. 2001, 98: 14350-14355.View ArticlePubMed CentralPubMedGoogle Scholar
- Gront D, Kolinski A: Efficient scheme for optimization of parallel tempering Monte Carlo method. J Phys: Condens Matter. 2007, 19 (3): 036225-036234.. [http://dx.doi.org/10.1088/0953-8984/19/3/036225]Google Scholar
- Wang G, Dunbrack RL: Scoring profile-to-profile sequence alignments. Protein Sci. 2004, 13: 1612-1626.View ArticlePubMed CentralPubMedGoogle Scholar
- Mittelman D, Sadreyev R, Grishin NV: Probabilistic scoring measures for profile-profile comparison yield more accurate short seed alignments. Bioinformatics. 2003, 19: 1531-1539.View ArticlePubMedGoogle Scholar
- Yona G, Levitt M: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol. 2002, 315: 1257-7.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Nat Ac Sci. 1992, 89: 10915-10919.View ArticleGoogle Scholar
- Dayhoff MO, Schwartz RM: Chapter 22: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978Google Scholar
- Gniewek P, Kolinski A, Gront D: Optimization of profile-to-profile alignment parameters for one-dimensional threading. J Comp Biol. 2012, 19: 879-886.View ArticleGoogle Scholar
- Pandit SBB, Skolnick J: Fr-TM-align: a new protein structural alignment method based on fragment alignments and the TM-score. BMC Bioinformatics. 2008, 9: 531+-View ArticlePubMed CentralPubMedGoogle Scholar
- Cheng H, Bong-Hyun K, Grishin NV: MALIDUP: A database of manually constructed structure alignments for duplicated domain pairs. Proteins. 2008, 70: 1162-6.View ArticlePubMedGoogle Scholar
- Gront D, Blaszczyk M, Wojciechowski P, Kolinski A: Bioshell Threader: protein homology detection based on sequence profiles and secondary structure profiles. Nucl Acids Res. 2012, 23: 2522-2527.Google Scholar
- Farrar M: Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics. 2007, 23: 156-61.View ArticlePubMedGoogle Scholar
- Remmert M, Biegert A, Hauser A, Soding J: HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011, 25: 173-5.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402.View ArticlePubMed CentralPubMedGoogle Scholar
- Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS: ProSup: a refined tool for protein structure alignment. Protein Eng. 2000, 13: 745-752.View ArticlePubMedGoogle Scholar
- Trojanowski S, Rutkowska A, Kolinski A: TRACER. A new approach to comparative modeling that combines threading with free-space conformational sampling. Act Bioch Pol. 2010, 57: 125-133.Google Scholar
- Trebst S, Troyer M, Hansmann UHEH: Optimized parallel tempering simulations of proteins. J Chem Phys. 2006, 124 (17): 174903-174908,. [http://dx.doi.org/10.1063/1.2186639]View ArticlePubMedGoogle Scholar
- Xu J, Li M, Kim D, Xu Y: RAPTOR: optimal protein threading by linear programming. J Bioinform Comput Biol. 2003, 1: 95-117.. [http://view.ncbi.nlm.nih.gov/pubmed/15290783]View ArticlePubMedGoogle Scholar
- Domingues FS, Lackner P, Andreeva A, Sippl MJ: Structure-based evaluation of sequence comparison and fold recognition alignment accuracy. J Mol Biol. 2000, 297 (4): 1003-1013.View ArticlePubMedGoogle Scholar
- Kmiecik S, Jamroz M, Zwolinska A, Gniewek P, Kolinski A: Designing an automatic pipeline for protein structure prediction. NIC Series. 2008, 40: 105-108.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.