Software
In addition to the algorithms and software developed in our previous works, such as the CPSARST [25], (PS)2 [21], and an integrated machine learning and optimization server [16, 17], several third party software packages were applied in this study, inclusive of the sequence alignment program Stretcher (vEMBOSS:6.6.0.0) [32], the comparative modeling software Modeller (v9.19) [43] and the molecular dynamics simulation package GROMACS (v2016.4) [44]. Many steps of the proposed CirPred method were computationally expensive. The distributed computation technique we developed for the iSARST protein structural similarity search server [29] was extensively used to speed up research progress and the CirPred web server. All protein structures shown in this report were rendered using PyMOL [45].
Datasets
The dataset of CP pairs from CPDB
Pairs of circular permutants sharing sequence identities ≥ 10% were downloaded from the CPDB [22]. Since the alignment data of CPMs provided by CPDB were computed by an old version of CPSARST [25], the structural alignment measures (e.g., alignment ratio and RMSD) and CP sites of these CP pairs were updated by the current CPSARST implemented in the iSARST server [29]. This dataset of 1568 CP pairs (Additional file 3) was used to perform large-scale evaluations of CirPred.
The CPDB linker dataset
CP-based structural alignments of CP pairs from CPDB were performed by CPSARST to establish this dataset. For each CP pair, as illustrated in Fig. 5, two linkers were obtained based on the algorithm stated below,
-
1.
Let Q and S represent the two proteins of a CP pair.
-
2.
Let xq represents the first residue of Q aligned with S, and xs represents the residue of S aligned with xq. Similarly, let yq be the last aligned residue of Q and its equivalent residue on S be ys.
-
3.
The candidate linker for Q was computed as the residues between ys and xs on S (excluding ys and xs). Let l be the number of residues of this candidate linker.
-
4.
Let m be the number of unaligned residues on the N-terminus of Q in front of xq and n be the number of unaligned residues on the C-terminus of Q after yq.
-
(i)
If m + n ≥ l, no linker was required to connect the native termini of Q.
-
(ii)
Otherwise, the linker for Q would be refined as the residues between ys + n and xs − m on S (excluding ys + n and xs − m).
-
5.
Referring to Fig. 5a, let x, y, m, and n be replaced with u, v, i, and j, respectively. Repeat steps 2–4 to find out the linker for protein S.
The CPDB termini linker dataset is available in Additional file 6. If a protein has more than one permutant, it may have multiple linkers in this dataset.
Dataset S, a synthetic termini linker dataset
This dataset consisted of in silico circularly permuted proteins, each with a surface polypeptide fragment removed to serve as the known missing linker. The following steps were taken to establish this dataset.
-
1.
Let the position of a residue be represented by its α-carbon (Cα).
-
2.
Using the 100% sequence identity non-redundant subset of the PDB (snapshot date: Dec. 25th, 2016), calculate the average distance, denoted by d, between two consecutive residues in a protein. The result was d = 3.36 (Å).
-
3.
Compute the distance between the N- and C-terminal residues, denoted by D, for each protein in the above PDB subset (73,073 proteins) and filter out any protein with D > 3d. Only 3027 proteins remained in the dataset, in which every protein possessed a short distance between termini, i.e., within 3 residues on average.
-
4.
Proteins sharing ≥ 25% sequence identity with any protein in the CPDB termini linker dataset (Additional file 6) were further discarded. The resultant dataset (1802 proteins) was thus suitable, as an independent dataset, for assessing the machine learning predictor trained with the CPDB linker dataset.
-
5.
For each protein remaining in this dataset, a CPM was created as stated below.
-
(i)
Since CP tends to occur at residues exposed to the solvent [17], each residue’s relative solvent accessible surface area (RSA) was computed.
-
(ii)
Consecutive residues with RSA > 20% were considered an exposed fragment on the surface of the protein. A protein might possess several exposed fragments.
-
(iii)
Randomly select one exposed fragment, set the CP site at the carboxyl-end of the fragment.
-
(iv)
Since the native N- and C-termini of the protein were close, they were directly connected in silico. New termini were created at the selected CP site. This step was performed by renumbering residues in the PDB file in a way described in the next subsection.
-
(v)
Remove the selected exposed fragment from the PDB file.
-
(vi)
Refine the structure of this CPM by energy minimization using GROMACS [44].
The exposed peptide fragments removed in the above procedure were the known linkers for these in silico synthetic CPMs. A complete list of the PDB entries, CP sites, and linker sequences of this Dataset S is available in Additional file 9.
Generating the pseudo-circularly-permuted template
Because a CPM usually folded into a structure similar to the native structure [1, 9], the native structure would be a good template when modeling a CPM. However, since the termini of the CPM had changed, the native structure should be manipulated first. CP of a protein structure is actually an amino acid sequence rearrangement such that the N- and C-terminal proportions of the sequence were interchanged. As illustrated in Fig. 5b, making a circularly permuted protein in silico was equivalent to cutting the N-terminal proportion and attaching it to the end of the C-terminal proportion. At the sequence level, this manipulation was a simple text string rearrangement. At the structure level with a PDB file, as described below, it was more complicated.
-
1.
To ensure that the PDB file’s residues appeared in the same order in the template and model structures, starting with 1, assign serial numbers to the residues from the top of the file.
-
2.
If the CP site was n, move n − 1 residues from the top of the file to the bottom.
-
3.
Starting with 1, renumber all residues and atoms from the top of the file.
-
4.
In case there were missing or structurally undetermined atoms in the PDB file, the reduce [46] and teLeap [47] programs were utilized to add/restore them.
Comparative structure modeling
The comparative modeling procedure of CirPred consisted of three major steps. First, the target CPM sequence was generated in the way stated above, unless provided by the user. For instance, Mode 2 of the CirPred web server required the user to input the sequence of CPM. Second, the global sequence alignment between the target CPM and template (the pseudo-CP template) was obtained using three methods: the algorithm combining amino acid sequence and secondary structure information that we developed for (PS)2 [21], the Smith-Waterman algorithm [31], and the Stretcher program [32]. Among the alignment results, the one with the largest number of aligned residues was selected. Third, comparative structure modeling was performed using the (PS)2 procedure according to the sequence alignment. In addition to these three pipelined steps, according to the requirements of experiments or web users, the linker design procedure might be integrated into the pipeline (Fig. 1); and, at the end of the pipeline, model refinement and MD simulation procedures might be activated.
Linker design
Overview of the linker design protocol
The linker design algorithm of CirPred utilized several machine learning and structural modeling methods. The basic idea was to coarsely determine the position of residues of a linker at first and then predict for each residue position the amino acid. The whole protocol is outlined as follows,
-
1.
Determination of the coarse residue positions of a linker.
-
(i)
Let l denote the number of residues of the linker.
-
(ii)
Randomly make t temporary linkers according to amino acid propensities of known CPM linkers (Additional file 11).
-
(iii)
Before making sequence alignment between the target CPM and the pseudo-CP template, insert each temporary linker into the target CPM.
-
(iv)
Based on the sequence alignment with each temporary linker inserted, generate a coarse model of the target CPM using Modeller, which also computed its DOPE (discrete optimized protein energy) score [43].
-
(v)
Pick up t' from the t coarse models with the lowest energy scores for the next step. In this study, t and t' were set to be 200 and 20, respectively.
-
2.
Prediction of the amino acid composition of the linker.
-
(i)
Take one coarse model of the target CPM, for each residue position of the linker, compute the feature values required to perform predictions.
-
(ii)
For each residue position, predict the occurrence probabilities of 20 amino acids by machine learning. See later subsections for algorithms of the feature set and machine learning techniques we applied.
-
3.
Amino acid sequence assignments of the linker. According to the occurrence probabilities of amino acids, randomly assign an amino acid to each residue position of the linker. For instance, if the probabilities of valine and leucine at position 1 were 70% and 30%, respectively, the chance that position 1 was assigned with a valine was 70%, while the chance of leucine was 30%. Repeat this step for k times, creating k candidate linker sequences (k = 10 in this study).
-
4.
Generation of a candidate linker sequence pool. Repeat steps 2 and 3 until all coarse models were applied. There would be t' × k candidate linker sequences generated to form the pool.
-
5.
Computation of energy score for linkers in the pool.
-
(i)
Take one candidate linker, insert it into the target CPM sequence, and make a sequence alignment with the pseudo-CP template.
-
(ii)
Based on the alignment, generate m models of the target CPM, each with a DOPE score using Modeller [43] (m = 10 in this study).
-
(iii)
Select the model with the lowest DOPE to represent the quality of this candidate linker.
-
(iv)
Repeat (i) to (iii) until all candidate linkers in the pool were processed.
-
6.
Selection of the final candidate linker(s). Find the candidate linker(s) with the lowest energy score(s) to be the final designed linker(s). For all experiments of this study, except the ones of Additional file 8, only one best-designed linker was taken. In our web implementation, several candidate linkers with low energy scores are reported to the user.
The feature set for prediction
We speculated that if the position of a residue of interest is known in a protein structure, information (or “features”) obtained from its surroundings could be utilized to predict what amino acid the residue is. For a known linker, its residue positions were readily available in the PDB file. For a linker to be designed, residue positions could be coarsely simulated as described above. Features obtained from known linkers were used to train the machine learning kernels for predicting the amino acid composition of a linker to be designed. The feature set used in this study comprised 20 features computed according to this equation,
$$F_{A} (i) = \sum\limits_{a = 1}^{{n_{A} (i,d_{r} )}} {\frac{1}{{d_{ia}^{2} }}}$$
(1)
where i denotes the residue of interest, A represents the type of amino acid, FA(i) means the feature value of amino acid A for i, a stood for a residue of A surrounding i, and dia is the distance between residues i and a. The nA(i, dr) is the number of A residues located within the radius of dr (which was set as 20 Å in this work) from i. The position of a residue is represented by its Cα.
For a given A, this feature describes how many and how close this type of amino acids appeared near the residue of interest. A high feature value means that many such amino acids are nearby, or the distances between them and the residue of interest are short. Because the sequence of a linker to be designed is unknown, there is no prior knowledge of the amino acids adjacent to each other. Therefore, when computing the feature values, no matter for a known linker or a linker to be designed, five adjacent residues before and five after the residue of interest are discarded.
Establishing the predictor by machine learning methods
For any residue in a known linker, in addition to feature values, the “answer” must be provided for machine learning. The answer in the case of linker design should be the residue’s amino acid, and thus there should be 20 candidate answers for learning and prediction. The computation loading of most machine learning algorithms would increase significantly as the number of candidate answers increases. Since 20 answers were beyond our hardware’s computation capacity, we classified 20 amino acids into three types (hydrophilic, hydrophobic, and neutral) and reduced the number of candidate answers to 3.
Previously as we studied viable CP sites, we developed an artificial intelligence system that integrated several machine learning, random sampling, and parameter optimization algorithms [16, 17]. This system was applied in this work, and the recruited algorithms included bootstrap sampling, decision tree, and artificial neural network. After obtaining the answers and feature values from a set of known linkers, 250 and 50 bootstrap samples were made to train minor predictors of decision tree and artificial neural network, respectively. The final predictor was then formed by collecting the minor ones, which made predictions by vote. With this procedure, the probabilities of candidate answers for a given case could be estimated as the proportions of votes the answers received.
For preliminarily assessing the performance of the feature set we designed, tenfold cross-validation was performed using the CPDB linker dataset. The average accuracy (rate of correct predictions) was 67.5%. In this preliminary test, for each testing case, only the best-voted candidate answer was considered. However, in the actual application, the probabilities of candidate answers played a prominent role. This procedure constituted just a part of the CirPred linker design protocol (see the previous subsection). A thorough assessment of the complete linker design protocol had been carried out, and the performance is stated in “Results” section.
Tuning the probability estimate of amino acids in a designed linker
Since the 20 amino acids were reduced to 3 classes for machine learning, the predicted amino acid probability estimates were sketchy. Before these estimates could be used in the actual linker design protocol, they should be restored to 20 amino acids.
Linkers of the training dataset had been analyzed, such that for each class, proportions of amino acids were known. After the “class-level” prediction, the occurrence probabilities of 20 amino acids at a given residue were finely estimated using the equation,
$$pe\left(A\right)=pe(C)\times p(A|C)$$
(2)
where pe() and p() stand for the probability estimate and proportion, respectively, and A and C denote amino acid and class. For example, if the probability estimate of the hydrophilic class for a residue is 0.80 and the proportion of aspartic acid in this class is 0.25, the probability estimate of aspartic acid at this position is 0.80 × 0.25 = 0.20.
Length estimate of the linker
In our experiments, the length of a linker to be designed could be obtained from the alignment between CPMs. However, before using the CirPred server for linker design, the user might not know how many residues there should be in the linker. Hence, we proposed an algorithm to estimate the length of the linker to be designed. For a protein with a distance of N- and C-termini (represented by Cα atoms) < 20 Å, the length of the linker would be estimated using this equation,
$$l=\mathrm{Round}\left(21.8\times \mathrm{ln}\left(b\right)-52.5\right);\,\,\, l\ge 0$$
(3)
where l and b denote the number of residues of the linker and the distance of termini, respectively. This equation was established according to CPDB. The 20 Å cutoff was determined based on the fact that the length of known linkers (l) and the distance of the termini they bridged (b) fit the equation within it and that for termini more distant than it, there seemed no rule between l and b (Additional file 12). For proteins with a long distance between the termini, we proposed the following algorithm to estimate the length of the linker,
-
1.
Starting with l = 20 (residues), make t temporary linkers of length l based on the amino acid propensities of the CPDB linker dataset (Additional file 11).
-
2.
Add each temporary linker to the target CPM; according to the sequence alignment between the target CPM and the pseudo-CP template, generate a model of the CPM and compute its energy score.
-
3.
Among the t temporary linkers, find the one with the lowest energy score to represent linkers of length l.
-
4.
Increase l by five residues and repeat steps 1–3 until l reaches a given maximum. We empirically suggest the maximum be 1/5 of the size of the target protein.
-
5.
Find the value of l producing the lowest energy score. Scan the length range from l − 4 to l + 4 and compute these lengths’ energy score by repeating steps 1–3.
-
6.
The length of the linker to be designed is estimated as the length of temporary linkers achieving the lowest energy score.
For evaluating this algorithm, proteins of the CPDB linker dataset with linker length ≥ 20 residues were tested with the DOPE energy score. The average difference in length between the known and designed linkers was 20.3%. The performance of this algorithm was acceptable; it was very time-consuming, however. In the CirPred web server, Eq. (3) is applied by default unless the user changes it.
Evaluation of the linker design protocol by multiple rounds of independent test
Using the CPDB linker dataset (Additional file 6), we performed an independent test of 500 rounds to evaluate the proposed linker design protocol. This procedure was an improved tenfold cross-validation test. It ensured the independence between the training and testing data for each round and reduced the imprecision of evaluation. The training datasets, independent test datasets, and results of each round are available in Additional file 7. The procedure of this test is provided below,
-
1.
Since one protein might have multiple CPMs in CPDB, all CPDB linkers were grouped according to their PDB entries.
-
2.
Repeat the following steps 500 times,
-
(i)
Randomly divide the grouped linkers into a training Dataset T and an independent test Dataset I possessing 90% and 10% of the proteins.
-
(ii)
For ensuring that Dataset I was highly different from Dataset T, any protein in Dataset I sharing ≥ 15% sequence identity with any protein from Dataset T was discarded.
-
(iii)
A machine-learning linker predictor was established using Dataset T as the training data based on the CirPred linker design protocol.
-
(iv)
For each case in Dataset I, remove the known linker from the protein, make a pseudo CPM, and then input this pseudo CPM to the CirPred system to redesign the linker using the machine-learning linker predictor established in the previous step. Sequence similarities and the difference of potential energy scores between the designed and known linkers were computed using the BLOSUM45 matrix [48] and GROMACS [44].
-
3.
Statistically analyze the sequence similarity and energy data obtained from the 500 rounds.
Refinement of the generated model
As shown in Fig. 1, a protein is virtually divided by the CP site (equivalent to the middle point of the linker of the native protein), i.e., the hinge, into two proportions. If the proportions have the same size, take the C-terminal proportion to be the small one. An axis r is formed between the hinge and the center of mass of the small proportion. Besides, a plane P is defined by these two points and the center of mass of the large proportion. Then, two kinds of movements are made to the small proportion. First, with the hinge fixed, rotate the small proportion on plane P with a pause per 20 degrees. Second, at each pause, rotate it around axis r with a snapshot per 20 degrees. There are 162 snapshots in total (180°/20° × 360°/20°). Finally, compute the energy score of these snapshots and select the one with the lowest energy to be the refined model of the CPM.
Molecular dynamics simulations
Molecular dynamics simulations were applied as a final optimization of the model. The model was submerged in a box filled with water molecules. When necessary, a suitable amount of Na+ or Cl– ions were added into the box to neutralize the system’s charge. The neutralized system was first energy minimized before the full MD simulations, in which two rounds of annealing, each with temperature points 298 K, 320 K, and 298 K, were performed. Without applying the CP-site-hinge model refinement described above, the number of steps made in MD was set as 5 million, and the step size was two femtoseconds (total simulation time = 10 ns). If the model refinement was applied, we found that 0.5 million steps were enough to produce results with equivalent quality in terms of the alignment ratio and RMSD between the actual and modeled structures. In all modeling experiments performed in this study, both 5 and 0.5 million time steps were tested, and the reported data were based on the 0.5-million-step results. To reduce server machines’ loading, we set the default number of time steps as 0.1 million in the implemented CirPred server, and the user could change the setting.