Three-dimensional protein structures provide invaluable insights into the molecular basis of protein functions, and such insights are essential for rational design of molecules regulating these functions. Nowadays, in an increasing number of cases, it has become possible to model protein structures with acceptable accuracy by employing much less effort than that required in experimental methods. Progress in computational protein structure prediction has been boosted by methodological improvements in the technique called template-based modeling (TBM), which uses experimental structures of homologous proteins as templates. As biological sequence and structure databases expand continuously, TBM is expected to become an even more promising tool for practical molecular biology, pharmaceutical chemistry, and protein engineering problems .
Template-based modeling, also called homology modeling or comparative modeling, generally consists of the following steps [1, 2]: (1) identification of homologous proteins with known structures to be used as templates; (2) alignment of the sequences of the target and templates; (3) creation of model structures from the alignment; and (4) refinement of the models. Contemporary methods usually treat each stage separately, and the full TBM procedure can therefore be established by combining methods for each of the above stages.
Despite recent progresses, there still remain challenges for each stage mentioned above. One of the important challenges is how to optimally combine information from multiple templates to build a single model when experimental structures of multiple homologues are available. Using multiple templates rather than a single template offers several obvious benefits: the possibility of including a better template increases, and the fraction of the target sequence covered by templates is extended [3–5]. In addition, different regions in template structures may be combined to produce a more accurate model . However, in practice, it is complicated to combine information from multiple templates in an optimal way . Since the average quality of multiple templates is bound to be worse than that of the single best template, using multiple templates is associated with a rather large risk of contaminating reliable information from the best template. To overcome this problem, various approaches have been proposed [1, 7, 8]. Most of them heavily rely on a single top template while additional templates are used to fill the gaps not covered by the top template [3, 9].
Another challenge is to model structurally variable regions among templates or those regions not covered by any templates, which we call ULRs (unreliable local regions). Unless the target sequence is quite similar to those of the templates (for example, with sequence identity > 30%), the expected quality of template-based models could be limited by such regions. Moreover, such ill-conserved regions where sequence insertion/deletion occurs may not be the subject of typical TBM. Despite previous efforts, progress in modeling such regions seems to be rather limited . Since high-resolution models are required for practical applications, it is doubtless that better ULR modeling is essential.
We recognize that the above 2 challenges are not independent of each other. For example, the performance of ULR modeling can be limited by the quality of the framework structure constructed from multiple templates [10, 11]. We therefore propose a strategy by which both initial TBM and subsequent ULR modeling can benefit from each other. In the initial TBM, we focus on accurate modeling of more conserved regions among multiple templates, without the need to consider potentially unreliable regions since such regions are taken care of in the ULR modeling stage. In the ULR modeling stage, we fix the more reliable core structure so as not to deteriorate the overall model quality by potentially less reliable ab initio ULR modeling. Therefore, ULRs can be modeled in a more accurate framework structure, and the conformational search space for ULR modeling is also effectively reduced to the local regions. Related approaches that construct a reliable core and refine unreliable regions have been proposed previously [12, 13]. The difference between our approach and these is that we put more stress on the “accuracy” (rather than the “coverage”) of the core structure in the initial TBM stage (See METHODS for details).
We call this new method GalaxyTBM, as it is based on the GALAXY molecular modeling package [11, 14–16]. GalaxyTBM employs a multiple-template method designed to produce reliable core structures by rescoring HHsearch  results for multiple-template selection and by core sequence alignment using PROMALS3D . Model building from the alignment and subsequent ULR modeling is performed using optimization modules in GALAXY [11, 16].
All components of the prediction pipeline were tested in the 9th critical assessment of techniques for protein structure prediction (CASP9) as a predictor named “Seok-server.” According to the official results of CASP9, Seok-server is recognized as one of the top 6 servers . Since the prediction strategy for Seok-server had to be modified a few times during CASP9, as the method was immature at the beginning, the most recent version, GalaxyTBM, is presented here. When GalaxyTBM was tested on 68 single-domain CASP9 TBM targets, fixing the structure database at the version with which Seok-server was used during CASP9, it reproduced the performance of Seok-server (average GDT-TS of 68.1, compared to 68.4 for Seok-server). Performance of the TBM pipeline was evaluated by analyzing the improvements achieved at each stage. Merits of the new components in the pipeline over other TBM methods are also discussed.