TransCent: Computational enzyme design by transferring active sites and considering constraints relevant for catalysis
© Fischer et al; licensee BioMed Central Ltd. 2009
Received: 02 October 2008
Accepted: 10 February 2009
Published: 10 February 2009
Computational enzyme design is far from being applicable for the general case. Due to computational complexity and limited knowledge of the structure-function interplay, heuristic methods have to be used.
We have developed TransCent, a computational enzyme design method supporting the transfer of active sites from one enzyme to an alternative scaffold. In an optimization process, it balances requirements originating from four constraints. These are 1) protein stability, 2) ligand binding, 3) pKa values of active site residues, and 4) structural features of the active site. Each constraint is handled by an individual software module. Modules processing the first three constraints are based on state-of-the-art concepts, i.e. RosettaDesign, DrugScore, and PROPKA. To account for the fourth constraint, knowledge-based potentials are utilized. The contribution of modules to the performance of TransCent was evaluated by means of a recapitulation test. The redesign of oxidoreductase cytochrome P450 was analyzed in detail. As a first application, we present and discuss models for the transfer of active sites in enzymes sharing the frequently encountered triosephosphate isomerase fold.
A recapitulation test on native enzymes showed that TransCent proposes active sites that resemble the native enzyme more than those generated by RosettaDesign alone. Additional tests demonstrated that each module contributes to the overall performance in a statistically significant manner.
Enzymes are highly specific and efficient biocatalysts. It is of great scientific and practical interest to alter the function and stability of enzymes, or even generate them de novo from first principles . Due to our limited understanding of the structure-function interplay, most of the successful enzyme design examples have been achieved by "directed evolution", i.e. by performing several rounds of random mutagenesis in combination with efficient screening or selection systems to isolate beneficial variants . However, the complexity of the problem calls for computational methods that are aimed at guiding enzyme design experiments, and during the last years we have seen significant progress along these lines. The first computer-based approaches described were Dezymer  and ORBIT . Since then, novel program suites have been developed [5–7], and a few successful experimental enzyme designs have been based on computational methods [8, 9]. Only recently, the successful design of Kemp elimination catalysts  and of retro-aldol enzymes accommodating a multistep reaction  have been reported. Despite this progress, computational enzyme design is far from being applicable to any arbitrary problem.
Why is this the case? The structural basis of enzyme catalysis is often not well understood, which makes it difficult to define the critical optimization criteria in a specific enzyme design problem . Factors that might be relevant and often mutually interdepend are binding of the substrate and transition state of a given reaction, release of the product, conformational flexibility or dynamics, and pKa values of catalytic residues. Considering protein dynamics is still beyond the scope of current optimization methods. However, ligand-binding  and adjusting pKa values of active site residues  can be integrated into the optimization process. Additionally, a statistical analysis of homologous proteins might help to incorporate factors that cannot be considered explicitly. Multiple sequence alignments (MSAs) have already been used for consensus design approaches . Ensemble-based scoring functions  and knowledge-based potentials  turned out to be valuable approaches for deducing additional characteristics from larger datasets.
Along these lines, we designed the computational enzyme design program TransCent. In order to focus on the most relevant determinants, we restrict the design goal to a well defined task: This is the transfer of an active center from one enzyme (the template) to a second protein (the scaffold) whose backbone remains fixed. The novel approach implemented with TransCent is the concurrent consideration of four constraints during modeling. These are protein stability, ligand binding, pKa values of active site residues, and structural features of the active site described by knowledge-based potentials. The program comprises four modules, each of which processes one constraint. We show that each module contributes to the quality of the 3D model, and that the combination of all modules performs best. As a first application, we present and discuss models transferring active sites of (βα)8-barrels, which form a frequently encountered and catalytically versatile enzyme family [18, 19].
Prerequisites and conventions
TransCent supports the transfer of an active site from one enzyme (the template) to another protein backbone (the scaffold). We focused on design problems that fulfill the following prerequisites: 1) The 3D structure of both the template and the scaffold are available with adequate quality (resolution < 2.5 Å, position of all atoms in the active site resolved, no loops missing). 2) The pose of the ligand-template complex is known. 3) The active sites of the scaffold and the template can easily be superimposed. 4) For the template, the sequences of at least 80 homologous proteins are available, allowing the inference of a well populated MSA.
Moreover, in order to reduce computational complexity of the algorithm, the following assumptions and restrictions are also effective: 5) The backbone of the scaffold and the position of the ligand are kept fixed in 3D space during optimization. 6) Only side chain conformations of a backbone dependent rotamer library  are considered. 7) The pose of the ligand as observed in the template is assumed to be the relevant one for catalysis (i.e. represents the active binding mode).
Note that the above limitations are either related to the precision required for reliable predictions (conditions 1 and 2) or could be circumvented by combining TransCent with existing programs like RosettaMatch  (condition 3) or by computing transition states  (condition 7). Computational complexity imposes conditions 5 and 6; for many proteins condition 4 can easily be accomplished due to the abundance of completely sequences genomes.
During the design process, residues are grouped with respect to their distance from the ligand. We name residues, which have to the ligand a distance of at most 7 Å, the active center ACT_CENT. ACT_CENT is surrounded by a shell SHELL_1 consisting of residues having a distance between 7 Å and 15 Å. All other residues belong to SHELL_2. The central shells are larger than those used elsewhere  in order not to miss relevant residues. In the design phase, the backbone of the scaffold has to be decorated with side-chains. Residues belonging to ACT_CENT will be redesigned; residues of SHELL_1 will be repacked to flexibly embed the amino acids constituting ACT_CENT. For SHELL_2, TransCent keeps residues and side chain conformations as found in the scaffold. Altogether, this selection of constraints is a compromise of speed and precision by allowing a flexible modeling of the active site and the conservation of remote residues, which are presumably less relevant for catalysis.
A framework for enzyme design
A typical protein design program consists of three elements . These are a modeling unit, which generates the atomic details of a protein model, a unit that evaluates the quality of a model via an energy function, and an optimization unit that directs the design process to find low energy configurations. A state-of-the-art program for protein modeling is RosettaDesign , which predicts an optimal sequence for a given backbone and comprises the above mentioned three elements. We utilized Rosetta as a framework for TransCent, i.e. we did not alter the modeling and optimization unit. However, we extended its energy function to include features relevant for enzyme design. The novel energy function consists of four terms, which are computed by separate modules. These are related to protein stability (ST-module), ligand binding (LB-module), knowledge-based potentials (KP-module), and pKa-values (PK-module). In the following, these modules are described in detail.
Protein stability: ST-module
E ST is the sum of all self-energies E(res i ) and pairwise energies E(res i , res j ) for all residues i and all residue pairs i, j of a 3D model . In order to increase speed, Rosetta stores pre-calculated partial results of this energy term in tabular form. The ST-module accesses this table during the optimization process. As an alternative to Rosetta's approach, a different energy function like the one implemented by EGAD  might be utilized for the ST-module.
Ligand binding: LB-module
E LB is the sum of DrugScore energies determined for the interaction of ligand atoms with residues of the model.
Knowledge-based potentials: KP-module
For modeling enzyme function, it is crucial to parameterize all relevant aspects of the active site. However, frequently it is unclear, which details of a protein structure are relevant for catalysis. In such cases, the usage of knowledge-based potentials is a proper method of including information given implicitly by protein structures or sequences . We utilized this approach to determine type and location of catalytic active residues and those residues that interact with the ligand by analyzing residue conservation and networks of hydrogen bonding (see Methods).
For each residue of an active site to be modeled, a set of atoms PUT_HB_SET i may participate in hydrogen bonds with the ligand; see Methods. For an assignment of individual PUT_HB atoms to KBP s, we used the Hungarian Method . In each case, the outcome of an assignment are three sets: HB_ KBP consists of those PUT_HB atoms and KBP s assigned to each other. HB_UNLINKED contains all PUT_HB atoms that were not allocated to a KBP, and KBP_UNLINKED subsumes all KBP s, that were not occupied by a PUT_HB atom. Based on this assignment, E KP is computed as:
E KP = E HB_KBP + E HB_UNLINKED + E KBP_UNLINKED (3)
The three terms score those residues contributing to the related sets; see Eq. (13).
Optimizing pKa values: PK-module
Here, is the predicted pKa value for residue i of the template used as a reference and is the predicted pKa value of the corresponding residue in the model. The factor λ pKa was determined together with other parameters (see below). λ pKa = 2.0 turned out to be adequate, which is in agreement with previous recommendations . The weight cons(res i ) is related to residue conservation deduced from the amino acid frequency distribution of an MSA of homologous proteins; see Methods, Eq. (7). It is 1.0 for strictly conserved residues and decreases for less conserved ones. The assignment routine of the KP-module was utilized to deduce from the set HB_ KBP those m residues that were considered for pKa optimization.
A combined energy function
Based on the outcome of the four modules, TransCent computes the energy E TransCent :
E TransCent = 1.0·E ST + w LB ·E LB +w KP ·EKP·+ w PK ·E PK (5)
which is a combination of the above introduced terms. We utilized the outcome of the ST-module as reference. Therefore, w ST is 1.0. The remaining three weight factors and λpKa were determined by analyzing a training set (see below). TransCent performed best with w LB = 0.15·10-3, w PK = 0.5, and w KP = 1.0. E TransCent was embedded into Rosetta as an alternative energy function. Note that individual energy terms of E TransCent can be eliminated by setting the respective weight to 0.0. In the following, we will designate a non-standard combination of modules by enumerating active modules as for example in TransCent(ST, LB). In this case, the ST- and the LB-modules are enabled and the KP- and the PK-modules are disabled. The term TransCent(*) is equivalent to TransCent(ST, LB, KP, PK).
Training the weights
A common method for in silico training and for benchmarking design algorithms is the recapitulation of native proteins [3, 6, 22, 33], where – for a given set of examples – the concordance of calculated models and the wild-type is evaluated. We selected a set of 128 enzymes (see Materials), which we named ENZ_TEST. For training and evaluation, we used two different similarity measures, both of which assess the correspondence of protein sequences of a given model and the native protein. The first one [IDENT_RES] was the percentage of identical residues; the second one [BLOSUM_SCORE, see Eq. (6)] was deduced from amino acid similarity. Both values were determined by analyzing residues belonging to the active centers ACT_CENT. Among the proteins of ENZ_TEST, the size of active centers varied from 10 to 60 residues; the mean was 28 residues. Due to the simulated annealing protocol  used during optimization, final models may differ for individual experiments. Therefore, we computed for all entries of ENZ_TEST 10 individual designs each and determined the mean for both similarity measures. In the following, these mean results will be reported.
At first, this approach was used to determine optimal weights required for Eq. (5). Different combinations of TransCent's modules were used and weights were varied within an appropriate range of values. Those weights were identified that gave the highest BLOSUM_SCOREs. The outcome of these computations was as follows:
In order to assess the improvement gained by combining TransCent's modules, we first determined the performance of the ST-module on ENZ_TEST. The IDENT_RES values varied between 9% and 56%, the mean was 29.5%; the BLOSUM_SCOREs were between -0.4 and 3.4, the mean was 1.1. The rank correlation of IDENT_RES values and BLOSUM_SCORE values was high with statistical significance (r s = 0.92, p << 0.001), indicating that both similarity measures are equally well suited to evaluate model quality.
For the determination of w LB , the LB-module was combined with the ST-module. IDENT_RES and BLOSUM_SCORE values had an optimum at w LB = 0.15·10-3. In this case, the mean IDENT_RES value was 37% and the mean BLOSUM_SCORE was 1.5.
A prerequisite for using this module is the existence of a sufficiently large set of homologous sequences, which was the case for the 27 elements of the subset named ENZ_TEST hom . To determine an optimal weight w KB , TransCent(ST, KP) was used. In this case, both the IDENT_RES values and the BLOSUM_SCORE values did not show a distinct optimum, but ascended a plateau. In order to avoid an overvaluation of the associated potential, we selected w KP = 1.0 which is the smallest value gaining the plateau. In this case, the mean IDENT_RES value was 48% and the BLOSUM_SCORE value was 2.2.
This module depends on the identification of a specific set of residues accomplished by the KP-module (see Methods). Therefore, ENZ_TEST hom was analyzed using the combination TransCent(ST, KP, PK). As w KP was set to 0.0, the contribution of E KP was disabled. Performance was maximal for w PK = 0.5. In this case IDENT_RES was 39.5% and the BLOSUM_SCORE was 1.7.
Assessing TransCent's performance
One might argue that training and test data were not separated for the above evaluation. However, as we trained only four parameters (w LB , w KP , w PK , λ pKa ), the program cannot memorize specific arrangements of active sites possessing hundreds of degrees of freedom each. In order to confirm this argument and to show that our training allows an unbiased determination of the above weights, we performed a leave-one-out test. Each of the 27 enzymes belonging to ENZ_TEST hom was used for validation, while deducing individual weights from the remaining 26 enzymes taken for training as described above. By conducting a grid search, optimal weight factors were determined. These factors were then used to measure the performance of each module in combination with the ST-module. Results were added to Figure 2. As can be seen, there is no significant performance difference between related tests. This result indicates that the training set allows a robust determination of the above parameters and confirms that the parameter set generalizes well.
Recapitulating the oxidoreductase cytochrome P450
A case study: converting (βα)8-barrels that bind ribulose-phosphate
As a first application, in silico transfer experiments were performed with five enzymes belonging to the SCOP  superfamily of ribulose-phosphate binding proteins, which belong to the frequently encountered and catalytically versatile enzyme class of (βα)8-barrels [18, 19]. These were the enzymes phosphoribosyl-5-amino-1-phosphoribosyl-4-imidazolecarboxamide isomerase (HisA), the cyclase subunit of imidazole glycerol phosphate synthase (HisF), phosphoribosylanthranilate isomerase (TrpF), indole-3-glycerol-phosphate synthase (TrpC), and the α subunit of tryptophan synthase (TrpA). HisA and HisF catalyze two successive steps within histidine biosynthesis, whereas TrpF, TrpC, and TrpA catalyze three consecutive reactions within tryptophan biosynthesis . For HisA and HisF, the ligand is N 1-[(5'-phosphoribosyl)formimino]-5-aminoimidazol-4-carboxamide ribonucleotide (PRFAR). Reduced 1-(o-carboxyphenylamino)-1-deoxyribulose 5-phosphate (rCdRP) is bound to TrpF and TrpC; imidazole glycerol phosphate (IGP) is bound to TrpA. Note that the ligand of HisA and HisF, which contains two sugar phosphate moieties, is twice as large as the ligands of TrpF, TrpC and TrpA, which contain only a single sugar phosphate group. Although the general topology of these five enzymes is similar, the mean RMSD value of related Cα-atoms is as high as 3.0 Å, as determined by TM-align . For each of the experiments described below, 10 models were generated and mean similarity values were calculated. Based on the five enzymes, 25 models can be generated; among these are five recapitulation experiments. For the latter cases, the mean number of identical residues in active sites was between 46% and 54%, which is in agreement with the mean performance of the program (see Figure 2). For the transfer experiments, the mean sequence identity value was 27%. In the following, several models resulting from recapitulation and transfer experiments are described.
Recapitulating HisA is hindered by insufficient data, as no X-ray structure of HisA in complex with a bound ligand is known. Alternatively, we used for the design process HisA with modeled PRFAR. The complex of HisA and PRFAR was minimized employing the force field MAB as implemented in the Program Moloc . An inspection of the HisA recapitulation and the HisA → HisF design showed that TransCent introduced the mutations D8K during recapitulation and D11K in the transfer experiment, thus replacing one of the negatively charged catalytic aspartates by a positively charged lysine residue. The Rosetta energy score E ST indicates a stabilization of about -1 kcal/mol caused by this exchange. In addition, the catalytically important Thr 164 was replaced by leucine during recapitulation. In the modeled binding mode the distances of residues Asp 8 and Thr 164 to the ligand are above TransCent's global cutoff (4.0 Å); consequently no knowledge-based potential was determined. Therefore, amino acids were selected upon stability criteria for these positions. This is why the HisA → HisF transfer sequence contains the exchanges D11K and T171A, although the native HisF residues would match the catalytic residues Asp 8 and Thr 164 in HisA. The positively charged lysine side chain of D11K is placed near the negatively charged D176E on the opposite side of the active site, enabling a stabilizing electrostatic interaction. This example indicates that the visual inspection of the models is crucial to detect stabilizing mutations, which may be disadvantageous for catalysis and could be due to parameter settings or limitations of our approach.
TrpF catalyses the isomerization of phosphoribosylanthranilate (PRA) to carboxyphenylamino-1'-deoxyribulose-5'-phosphate (CdRP). The X-ray structure of TrpF (1lbm) from T. maritima in complex with the product analog rCdRP is known; the catalytic residues are Cys 7 and Asp 126 . TransCent's redesign of TrpF recovered the complete environment around the ligand, including the catalytic residues (Figure 6C). For the 41 residues making up ACT_CENT, the program proposed 17 exchanges; 11 are located in loops at the surface. The remaining 6 replacements Q81N, A103G, L124T, V155I, S135T, and V179A, which are located inside the barrel, are conservative ones. Energy minimization of the crystal structure and the model resulted in similar stability and ligand binding energy.
When judging the quality of a model with respect to the successful recapitulation of catalytic residues and calculated ligand binding energy, then the best design of the TrpF template was accomplished on the HisF scaffold. It has been shown experimentally that moderate TrpF activity can be established on the HisF-scaffold by mutating Asp 130 to a non-negatively charged residue . In our TrpF → HisF transfer, TransCent proposed the D130H exchange, reconstituting His 83 of the TrpF template. The essential catalytic residues Cys 7 and Asp 126 of the template were introduced in the scaffold by TransCent at the correct locations near the ligand by exchanges V48C and T171D (Figure 6C, D). Interestingly, Asp 126 and T171D are both located on β-sheet 6 and perfectly superimpose, whereas Cys 7 is located on β-sheet 1, but V48C on β-sheet 2. This finding demonstrates that TransCent can position an essential residue in a different secondary structure element of the scaffold than utilized in the template. This capability is further demonstrated by an arginine interacting with the anthranilic acid moiety of the ligand: In the TrpF template, Arg 36 is located in the long loop after β-sheet 2 and its positively charged guanidinium side chain forms a salt bridge with the negatively charged carboxylate moiety of rCdRP (Figure 6C). In the HisF scaffold this loop is missing. However, β-sheet 1 is elongated and the exchange D11R introduces an arginine with a side chain oriented towards the ligand (Figure 6D). In this case, all three terms E LB , E KP , and E PK are negative, i.e. all constrains demand for this mutation. In summary, the above examples illustrate the interplay of TransCent's modules when constructing active sites.
Computational enzyme design methods improve slowly and are far from being perfect
Recent work [10, 11] demonstrates that the de novo design of enzymatic activities not found in natural biocatalysts has become feasible. However, the designed activities were considerably lower than those evolved by nature , although the computation consumed more than 100 000 CPU hours. Thus, in spite of these pioneering efforts, we are still far away from an adequate understanding of enzyme structure-function relationship and the de novo design of highly effective active sites.
When developing TransCent, we followed a less ambiguous goal by transferring an already existing active site to a different backbone. TransCent is based on RosettaDesign and comprises modules for the optimization of ligand binding (LB-module) and pKa values of essential residues (PK-module). Using knowledge-based potentials as implemented in the KP-module contributed favorably to the performance of TransCent, demonstrating that active site "fingerprints" can be deduced from homology models and structure databases. In order to limit computer time, we concentrated on the most important features of enzyme catalysis, accepting restrictions such as fixed backbones and ligands. However, due to the modular concept of TransCent and the structure of its energy function [Eq. (5)], additional constraints can easily be integrated. Along these lines, methods for assessing side-chain conformational entropy have been proposed recently  and the latest version of Rosetta contains a "Backrub"-model introducing local backbone flexibility .
Approaching specific requirements in enzyme design
The modules of TransCent constitute an approach of a multi-objective optimization for enzyme design. Here, we utilized a classical energy function [Eq. (5)]; alternative non-standard approaches have recently been described . Each of TransCent's modules contributes to the quality of the design and has its specific strengths and weaknesses that will be discussed below. In general, the ultimate proof demonstrating success in model building is the biochemical characterization of enzyme function. Unfortunately, wet-lab experiments are time-consuming and expensive. Therefore, in silico methods must serve as a surrogate especially for the evaluation of algorithms. However, these approaches allow us at best to demonstrate the plausibility of a design, and the assessment rests on the assumption that wild-type sequences are optimal for catalysis . Trusting in this postulate, we have performed active center recapitulation experiments to estimate the significance of individual modules for the design success. It is the aim of the ST-module to guarantee the stability of the modeled protein. When used exclusively, this module recovered on average 30% of the wild-type residues (Figure 2). This value is within expectation for two reasons: 1) Active sites resemble more the surface than the core of a protein. For remodeling, a recovery rate varying between 27% for surface positions and 52% for core positions has been reported . 2) Residues of the active sites often do not contribute to stability, and catalytically relevant residues even tend to destabilize the enzyme . Therefore, programs focusing on protein stability will fail to recover these residues. The same holds for residues that are constrained due to the shape of the binding pocket. The observation that RosettaDesign nevertheless recovered almost one third of the residues in active sites could indicate that the backbone conformation restricts the selection of amino acids at certain positions. Differences in rotamer frequencies for backbone independent and backbone dependent libraries support this notion. In addition, recent studies indicate that side-chain rotamers may lock the backbone into slightly different conformations .
The LB-module aims at optimizing ligand binding. It uses a rotamer-based version of DrugScore allowing TransCent to assess the impact of each individual rotamer. However, due to algorithmic complexity, the rotamer-based version, in contrast to the original version, is not able to consider the desolvation effect, as the orchestration of neighboring residues influences the outcome markedly. Despite this restriction, the performance data shown in Figure 2 illustrates that the LB-module contributes significantly to the recapitulation success of TransCent.
The KP-module determines the characteristics of the template's active site by deducing potentials without expert knowledge. The module aims at optimizing the protein-ligand hydrogen bond network by arranging donors and acceptors in a way resembling the template. As the module considers residue conservation as well, the selection of amino acids can be carefully balanced. In other programs, relevant amino acids have to be fixed before active site optimization can be started and therefore only residues relevant for catalysis could be considered until now [3, 6]. Thus, our approach of using knowledge-based potentials adds flexibility not yet implemented in traditional computational design methods. This option requires that the sequences of at least 80 homologous proteins must be available to deduce the potentials with acceptable quality. However, the sequencing of hundreds of genomes  during the past years ensures that this condition can be fulfilled in most cases.
The PK-module optimizes the electrostatic embedding of residues relevant for catalysis. We introduced this module, because in general it is not sufficient to merely place active site residues in the correct orientation. In addition, often a proper pKa value will be essential for a specific residue to act as general acid or base during catalysis. In accordance with this statement, the markedly improved activity of a computationally designed enzyme by means of a directed evolution experiment has been explained with the pKa shift of a catalytic site . We consider the simultaneous optimization of protein stability and of pKa values in a rotamer based protein design framework as a key feature of TransCent. However, one might argue that ignoring the effect of ligand atoms onto pKa values or an insufficient accuracy of pKa value determination by PROPKA could render the results ambiguous. However, a systematic deviation of absolute values has only a minor effect on the outcome of the design, as both the template and the model are treated in the same way. In accordance with this notion, Figure 2 illustrates the significant positive contribution of the PK-module to the performance of TransCent.
Transfer experiments will promote computational enzyme design
Reflecting the status quo of computational protein and enzyme design, the current potential and limitations of these methods become obvious. For several algorithms, their ability of creating stable proteins by decorating native backbones has been successfully demonstrated [51, 52]. Therefore, generating in silico a stable protein based on a native fold should be feasible in most cases. For enzyme design, taking this constraint is clearly not sufficient and additional features such as substrate binding  and transition state stabilization  have to be considered to generate native-like active sites. Drafting artificial folds might be regarded as a further step towards de novo enzyme design. However, it is doubtful whether this effort is a necessary prerequisite to establish novel functions or to surpass the proficiency of existing enzymes. For example, when searching a suitable scaffold for the Kemp-elimination reaction, more that 100 000 locations for putative active sites were identified in natural folds . In addition, the observed preference for the ancient and frequently encountered (βα)8-barrel , which accommodates enzymatic reactions covering five of the six classes defined by the Enzyme Commission , suggests that computational design can readily use folds evolved by nature. Nevertheless, in spite of first promising success cases, our limited understanding of most enzymatic reactions makes de novo design a very difficult task and leaves room for simpler, nevertheless instructive approaches. Following recapitulation experiments, the transfer of an existing active site to a new scaffold – as supported by TransCent – is the next obvious step to take. Both the in silico analysis of generated 3D models as done above, and particularly the biochemical characterization of the designed proteins will identify properties that were modeled in an acceptable or insufficient way. These findings will help to validate or improve in a feedback-loop  both TransCent as a whole and the individual methods implemented in its modules. Thus, transfer experiments will contribute to our understanding of enzyme function and bring forward computational enzyme design.
TransCent is a computational enzyme design program, which predicts mutations in a scaffold aimed at establishing the activity of a template enzyme. During the design process, protein stability, substrate binding, pKa values of essential residues and knowledge-based hydrogen bonding networks are considered simultaneously by integrating separate optimization modules. Our in silico evaluation demonstrated that TransCent can recapitulate a considerable fraction of active site residues for a given template. We will now experimentally test some of the transfer designs in order to further judge the prediction quality of the program. Depending on the outcome, we will take advantage of TransCent's modular character to incorporate additional features such as backbone and ligand flexibility, which promises a further fine-tuning of the designed active site.
Test data ENZ_TEST
Using the following rules, 128 entries of the pdb database  were selected with a culling server : 1) The resolution had to be at least 1.6 Å and the R-factor at most 0.25. 2) The structure had to be determined via X-ray crystallography and the sequence had to consist of at least 100 residues. 3) One ligand consisting of more than 10 atoms had to be part of the structure. 4) At least ten residues had to be not more than 5 Å apart from the ligand. 5) For the pairwise comparison of all entries, a maximal sequence identity value of 20% was tolerated. We named this set of 3D structures ENZ_TEST; additional file 1 lists the pdb codes. For maximal performance of TransCent, a set of at least 80 homologous sequences has to be available for a protein. This was the case for those 27 proteins printed bold in the data set listed in additional file 1. We named this set ENZ_TEST hom .
Multiple sequence alignments
To create an MSA, we realigned the sequence of the template with sequences originating from the corresponding Pfam entry by using MAFFT . Those sequences were selected that fulfilled two criteria imposed on residues belonging to ACT_CENT (For a definition of the set ACT_CENT, see Results). 1) A pairwise comparison with corresponding residues of the template resulted in a sequence identity value > 40%. 2) The mean T-Coffee core index  for these residues was > 2. This cut-off assures a sufficient quality of the alignment. We deduced these criteria from a recapitulation experiment: By using Modeller version 8.2 , we created homology models for at least 80 sequences and determined the mean RMSD value for all atoms of ACT_CENT residues of the template and the models. When applying the above criteria, the average RMSD was ~2 Å (determined by TM-align , data not shown).
As the sequences A, B of native proteins or models are of equal length, an alignment consists of n residue pairs making up sequences A or B. BLOSUM62 values are from the related scoring matrix .
Cons(res i ): Scoring the conservation of individual residues
n is the number of lines in the MSA, and are the amino acids occurring in lines j or k at position i. BLOSUM62 values are from the related scoring matrix . For strictly conserved residues, cons(res i ) is 1.0.
Creating a superposition of models and structures
In order to increase the number of structures utilized for the determination of knowledge-based potentials, homology models were created. Starting from the respective Pfam  entry to which the template belongs, an MSA was generated (see above). Related sequences were fed into Modeller version 8.2 . Resulting structures were superimposed based on the alignment given by the MSA. We name a template-specific ensemble of superimposed 3D structures TEMPL_ENS. For illustrations, plots were generated by using SWISS-MODEL .
Setting the position and conformation of the ligand
For the transfer experiments described in Results, the input for the LB-module was generated by first superimposing active site residues of the template and the scaffold, and by transferring the ligand's pose from the template to the scaffold. In general, strategies developed for ligand docking  or drug discovery  as well as methods identifying key residues  may be considered for specifying the pose of the ligand.
Determining knowledge-based potentials and scores
In order to establish a hydrogen bonding pattern for an active site, which is in agreement with those ones observed in TEMPL_ENS, knowledge-based potentials  were used. To specify these potentials, each residue position i belonging to ACT_CENT was considered separately. For each residue i, all atoms HB participating in hydrogen bonds with the ligand were determined. Candidates are those nitrogen, oxygen or sulphur atoms belonging to the side chains of Arg, Asn, Asp, Cys, Glu, Gln, His, Lys, Ser, Thr, Tyr, and Trp, which are in close proximity to the ligand (distance < 4.0 Å). Backbone atoms were not considered as the design algorithm does not alter the backbone conformation. All HB atoms related to position i were combined in the set HB_SET i .
For modeling, we use PUT_HB j to name an atom of res j that might contribute a hydrogen bond with the ligand. An atom PUT_HB j has the coordinates coord j and belongs to an amino acid as j . Thus, it can be tested to what extent this PUT_HB j matches the preferences seen in TEMPL_ENS.
fapolar, obs(i) is the ratio of cases where no hydrogen bond was observed in structures of TEMPL_ENS divided by #TEMPL_ENS. Similarly, fapolar, exp(i) was deduced from the number of rotamers that cannot provide a HB atom by normalizing with the total frequency of rotamers for position i in the rotamer library. S apolar (i) is used to score cases where an active site of a model does not provide a PUT_HB atom that corresponds to the potential .
A fixed penalty Penalty_Val = 6.9 (deduced from an error rate of approximately 1/1000) was used to score PUT_HB j atoms of the model that could not be assigned to any KBP.
S NO_HB = Penalty _Val (12)
During optimization, the KP-module computes the energy E KP as a combination of three terms and based on the outcome of the assignment due to the Hungarian Method :
E KP = E HB_KBP + E HB_UNLINKED + E KBP_UNLINKED (13)
The function HM(res j ) selects the specific KBP i assigned to res j by means of the Hungarian Method. Here, all PUT_HB atoms are considered that belong to the current set HB_KBP. See also the description of the KP-module in Results.
The term EHB_UNLINKED originates from the number of unlinked PUT_HB atoms multiplied with the penalty score SNO_HB:
E HB_UNLINKED = #PUT_HP UNLINKED ·S NO_HB
This work was supported by a grant of the Deutsche Forschungsgemeinschaft (STE 891/4-3).
- Toscano MD, Woycechowsky KJ, Hilvert D: Minimalist active-site redesign: teaching old enzymes new tricks. Angew Chem Int Ed Engl. 2007, 46 (18): 3212-3236.View ArticlePubMedGoogle Scholar
- Bloom JD, Meyer MM, Meinhold P, Otey CR, MacMillan D, Arnold FH: Evolving strategies for enzyme engineering. Curr Opin Struct Biol. 2005, 15 (4): 447-452.View ArticlePubMedGoogle Scholar
- Hellinga HW, Richards FM: Construction of new ligand binding sites in proteins of known structure. I. Computer-aided modeling of sites with pre-defined geometry. J Mol Biol. 1991, 222 (3): 763-785.View ArticlePubMedGoogle Scholar
- Dahiyat BI, Mayo SL: Protein design automation. Protein Sci. 1996, 5 (5): 895-903.PubMed CentralView ArticlePubMedGoogle Scholar
- Gordon DB, Hom GK, Mayo SL, Pierce NA: Exact rotamer optimization for protein design. J Comput Chem. 2003, 24 (2): 232-243.View ArticlePubMedGoogle Scholar
- Zanghellini A, Jiang L, Wollacott AM, Cheng G, Meiler J, Althoff EA, Röthlisberger D, Baker D: New algorithms and an in silico benchmark for computational enzyme design. Protein Sci. 2006, 15 (12): 2785-2794.PubMed CentralView ArticlePubMedGoogle Scholar
- Chowdry AB, Reynolds KA, Hanes MS, Voorhies M, Pokala N, Handel TM: An object-oriented library for computational protein design. J Comput Chem. 2007, 28 (14): 2378-2388.View ArticlePubMedGoogle Scholar
- Bolon DN, Mayo SL: Enzyme-like proteins by computational design. Proc Natl Acad Sci USA. 2001, 98 (25): 14274-14279.PubMed CentralView ArticlePubMedGoogle Scholar
- Pinto AL, Hellinga HW, Caradonna JP: Construction of a catalytically active iron superoxide dismutase by rational protein design. Proc Natl Acad Sci USA. 1997, 94 (11): 5562-5567.PubMed CentralView ArticlePubMedGoogle Scholar
- Röthlisberger D, Khersonsky O, Wollacott AM, Jiang L, Dechancie J, Betker J, Gallaher JL, Althoff EA, Zanghellini A, Dym O: Kemp elimination catalysts by computational enzyme design. Nature. 2008, 453 (7192): 190-195.View ArticlePubMedGoogle Scholar
- Jiang L, Althoff EA, Clemente FR, Doyle L, Röthlisberger D, Zanghellini A, Gallaher JL, Betker JL, Tanaka F, Barbas CF: De novo computational design of retro-aldol enzymes. Science. 2008, 319 (5868): 1387-1391.PubMed CentralView ArticlePubMedGoogle Scholar
- Lippow SM, Tidor B: Progress in computational protein design. Curr Opin Biotechnol. 2007, 18 (4): 305-311.PubMed CentralView ArticlePubMedGoogle Scholar
- Chakrabarti R, Klibanov AM, Friesner RA: Computational prediction of native protein ligand-binding and enzyme active site sequences. Proc Natl Acad Sci USA. 2005, 102 (29): 10153-10158.PubMed CentralView ArticlePubMedGoogle Scholar
- Warshel A, Sharma PK, Kato M, Parson WW: Modeling electrostatic effects in proteins. Biochim Biophys Acta. 2006, 1764 (11): 1647-1676.View ArticlePubMedGoogle Scholar
- Russ WP, Lowery DM, Mishra P, Yaffe MB, Ranganathan R: Natural-like function in artificial WW domains. Nature. 2005, 437 (7058): 579-583.View ArticlePubMedGoogle Scholar
- Lilien RH, Stevens BW, Anderson AC, Donald BR: A novel ensemble-based scoring and search algorithm for protein redesign and its application to modify the substrate specificity of the gramicidin synthetase a phenylalanine adenylation enzyme. J Comput Biol. 2005, 12 (6): 740-761.View ArticlePubMedGoogle Scholar
- Poole AM, Ranganathan R: Knowledge-based potentials in protein design. Curr Opin Struct Biol. 2006, 16 (4): 508-513.View ArticlePubMedGoogle Scholar
- Gerlt JA, Raushel FM: Evolution of function in (β/α)8-barrel enzymes. Curr Opin Chem Biol. 2003, 7 (2): 252-264.View ArticlePubMedGoogle Scholar
- Sterner R, Höcker B: Catalytic versatility, stability, and evolution of the (βα)8-barrel enzyme fold. Chem Rev. 2005, 105 (11): 4038-4055.View ArticlePubMedGoogle Scholar
- Dunbrack RL, Cohen FE: Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Sci. 1997, 6 (8): 1661-1681.PubMed CentralView ArticlePubMedGoogle Scholar
- Dahiyat BI, Mayo SL: De novo protein design: fully automated sequence selection. Science. 1997, 278 (5335): 82-87.View ArticlePubMedGoogle Scholar
- Kuhlman B, Baker D: Native protein sequences are close to optimal for their structures. Proc Natl Acad Sci USA. 2000, 97 (19): 10383-10388.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuhlman B, Dantas G, Ireton GC, Varani G, Stoddard BL, Baker D: Design of a novel globular protein fold with atomic-level accuracy. Science. 2003, 302 (5649): 1364-1368.View ArticlePubMedGoogle Scholar
- Gohlke H, Hendlich M, Klebe G: Knowledge-based scoring function to predict protein-ligand interactions. J Mol Biol. 2000, 295 (2): 337-356.View ArticlePubMedGoogle Scholar
- Sippl MJ: Calculation of conformational ensembles from potentials of mean force. An approach to the knowledge based prediction of local structures in proteins. J Mol Biol. 1990, 213: 859-883.View ArticlePubMedGoogle Scholar
- Eswar N, Eramian D, Webb B, Shen MY, Sali A: Protein structure modeling with MODELLER. Methods Mol Biol. 2008, 426: 145-159.View ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL: The Pfam protein families database. Nucleic Acids Res. 2004, D138-141. 32 DatabaseGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302 (1): 205-217.View ArticlePubMedGoogle Scholar
- Kuhn HW: The Hungarian method for the assignment problem. Naval Res Logist Quart. 1955, 2: 83-97.View ArticleGoogle Scholar
- Li H, Robertson AD, Jensen JH: Very fast empirical prediction and rationalization of protein pKa values. Proteins. 2005, 61 (4): 704-721.View ArticlePubMedGoogle Scholar
- Davies MN, Toseland CP, Moss DS, Flower DR: Benchmarking pKa prediction. BMC Biochem. 2006, 7: 18-PubMed CentralView ArticlePubMedGoogle Scholar
- Tynan-Connolly BM, Nielsen JE: pKD: re-designing protein pKa values. Nucleic Acids Res. 2006, W48-51. 34 Web ServerGoogle Scholar
- Meiler J, Baker D: ROSETTALIGAND: protein-small molecule docking with full side-chain flexibility. Proteins. 2006, 65 (3): 538-548.View ArticlePubMedGoogle Scholar
- Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by simulated annealing. Science. 1983, 220: 671-680.View ArticlePubMedGoogle Scholar
- Scott EE, He YA, Wester MR, White MA, Chin CC, Halpert JR, Johnson EF, Stout CD: An open conformation of mammalian cytochrome P450 2B4 at 1.6-Å resolution. Proc Natl Acad Sci USA. 2003, 100 (23): 13196-13201.PubMed CentralView ArticlePubMedGoogle Scholar
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008, D419-425. 36 DatabaseGoogle Scholar
- Henn-Sax M, Höcker B, Wilmanns M, Sterner R: Divergent evolution of (βα)8-barrel enzymes. Biol Chem. 2001, 382 (9): 1315-1320.View ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005, 33 (7): 2302-2309.PubMed CentralView ArticlePubMedGoogle Scholar
- Lang D, Thoma R, Henn-Sax M, Sterner R, Wilmanns M: Structural evidence for evolution of the β/α barrel scaffold by gene duplication and fusion. Science. 2000, 289 (5484): 1546-1550.View ArticlePubMedGoogle Scholar
- Beismann-Driemeyer S, Sterner R: Imidazole glycerol phosphate synthase from Thermotoga maritima. Quaternary structure, steady-state kinetics, and reaction mechanism of the bienzyme complex. J Biol Chem. 2001, 276 (23): 20387-20396.View ArticlePubMedGoogle Scholar
- Henn-Sax M, Thoma R, Schmidt S, Hennig M, Kirschner K, Sterner R: Two (βα)8-barrel enzymes of histidine and tryptophan biosynthesis have similar reaction mechanisms and common strategies for protecting their labile substrates. Biochemistry. 2002, 41 (40): 12032-12042.View ArticlePubMedGoogle Scholar
- Chaudhuri BN, Lange SC, Myers RS, Davisson VJ, Smith JL: Toward understanding the mechanism of the complex cyclization reaction catalyzed by imidazole glycerolphosphate synthase: crystal structures of a ternary complex and the free enzyme. Biochemistry. 2003, 42 (23): 7003-7012.View ArticlePubMedGoogle Scholar
- Gerber PR, Müller K: MAB, a generally applicable molecular force field for structure modelling in medicinal chemistry. J Comput Aided Mol Des. 1995, 9 (3): 251-268.View ArticlePubMedGoogle Scholar
- Leopoldseder S, Claren J, Jürgens C, Sterner R: Interconverting the catalytic activities of (βα)8-barrel enzymes from different metabolic pathways: sequence requirements and molecular analysis. J Mol Biol. 2004, 337 (4): 871-879.View ArticlePubMedGoogle Scholar
- Sterner R, Merkl R, Raushel FM: Computational design of enzymes. Chem Biol. 2008, 15 (5): 421-423.View ArticlePubMedGoogle Scholar
- Sciretti D, Bruscolini P, Pelizzola A, Pretti M, Jaramillo A: Computational protein design with side-chain conformational entropy. Proteins. 2009, 74 (1): 176-191.View ArticlePubMedGoogle Scholar
- Smith CA, Kortemme T: Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction. J Mol Biol. 2008, 380 (4): 742-756.PubMed CentralView ArticlePubMedGoogle Scholar
- Suárez M, Tortosa P, Carrera J, Jaramillo A: Pareto optimization in computational protein design with multiple objectives. J Comput Chem. 2008, 29 (16): 2704-2711.View ArticlePubMedGoogle Scholar
- Tokuriki N, Stricher F, Serrano L, Tawfik DS: How protein stability and new functions trade off. PLoS Comput Biol. 2008, 4 (2): e1000002-PubMed CentralView ArticlePubMedGoogle Scholar
- Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC: The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 2008, D475-479. 36 DatabaseGoogle Scholar
- Dantas G, Kuhlman B, Callender D, Wong M, Baker D: A large scale test of computational protein design: folding and stability of nine completely redesigned globular proteins. J Mol Biol. 2003, 332 (2): 449-460.View ArticlePubMedGoogle Scholar
- Dallüge R, Oschmann J, Birkenmeier O, Lücke C, Lilie H, Rudolph R, Lange C: A tetrapeptide fragment-based design method results in highly stable artificial proteins. Proteins. 2007, 68 (4): 839-849.View ArticlePubMedGoogle Scholar
- Caetano-Anollés G, Kim HS, Mittenthal JE: The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc Natl Acad Sci USA. 2007, 104 (22): 9358-9363.PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RL: PISCES: recent improvements to a PDB sequence culling server. Nucleic Acids Res. 2005, W94-98. 33 Web ServerGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33 (2): 511-518.PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Automated assembly of protein blocks for database searching. Nucleic Acids Res. 1991, 19 (23): 6565-6572.PubMed CentralView ArticlePubMedGoogle Scholar
- Valdar WS: Scoring residue conservation. Proteins. 2002, 48 (2): 227-241.View ArticlePubMedGoogle Scholar
- Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis. 1997, 18 (15): 2714-2723.View ArticlePubMedGoogle Scholar
- Cavasotto CN, Orry AJ: Ligand docking and structure-based virtual screening in drug discovery. Curr Top Med Chem. 2007, 7 (10): 1006-1014.View ArticlePubMedGoogle Scholar
- Mestres J: Computational chemogenomics approaches to systematic knowledge-based drug discovery. Curr Opin Drug Discov Devel. 2004, 7 (3): 304-313.PubMedGoogle Scholar
- Zhu X, Lai L: A novel method for enzyme design. J Comput Chem. 2009, 30 (2): 256-267.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.