Evaluating the accuracy of protein design using native secondary sub-structures

Background According to structure-dependent function of proteins, two main challenging problems called Protein Structure Prediction (PSP) and Inverse Protein Folding (IPF) are investigated. In spite of IPF essential applications, it has not been investigated as much as PSP problem. In fact, the ultimate goal of IPF problem or protein design is to create proteins with enhanced properties or even novel functions. One of the major computational challenges in protein design is its large sequence space, namely searching through all plausible sequences is impossible. Inasmuch as, protein secondary structure represents an appropriate primary scaffold of the protein conformation, undoubtedly studying the Protein Secondary Structure Inverse Folding (PSSIF) problem is a quantum leap forward in protein design, as it can reduce the search space. In this paper, a novel genetic algorithm which uses native secondary sub-structures is proposed to solve PSSIF problem. In essence, evolutionary information can lead the algorithm to design appropriate amino acid sequences respective to the target secondary structures. Furthermore, they can be folded to tertiary structures almost similar to their reference 3D structures. Results The proposed algorithm called GAPSSIF benefits from evolutionary information obtained by solved proteins in the PDB. Therefore, we construct a repository of protein secondary sub-structures to accelerate convergence of the algorithm. The secondary structure of designed sequences by GAPSSIF is comparable with those obtained by Evolver and EvoDesign. Although we do not explicitly consider tertiary structure features through the algorithm, the structural similarity of native and designed sequences declares acceptable values. Conclusions Using the evolutionary information of native structures can significantly improve the quality of designed sequences. In fact, the combination of this information and effective features such as solvent accessibility and torsion angles leads IPF problem to an efficient solution. GAPSSIF can be downloaded at http://bioinformatics.aut.ac.ir/GAPSSIF/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1199-y) contains supplementary material, which is available to authorized users.


Background
Proteins are building blocks of life, serving main roles in the body. Since the function of a protein is dependent on its structure, some experimental methods are applied for tertiary structure determination. Not only these methods are time-consuming and expensive but also they cannot build a proper atomic model for some proteins. Thus, computational methods have been known as favorable approaches for protein structure prediction (PSP) within the last two decades.
In PSP problem, an amino acid sequence is given as an input and the goal is to predict the best-adapted structure respective to its function. In this regard, another essential problem called protein design or inverse protein folding (IPF) [1][2][3] is defined to identify a sequence of amino acids whose tertiary structure corresponds to a given target structure. Indispensable applications of IPF in drug design, medicine and advanced disease treatment evoked scientists to develop methods for designing appropriate sequences. Unfortunately, because of IPF NP-Hardness [4], it is impossible to give an exact algorithm to solve this problem.
First attempts to solve this problem back to late 1980s which mainly focused on amino acid compositions of designed sequences [1]. In 1988 Ragan and Degrado had a somewhat successful design for a 4-helix bundle structure [5]. Later, Yue and Dill [2] developed a high simplified model, called Hydrophobic-polar, embedded in a cubic lattice. This model was developed according to the structural pattern in globular proteins where hydrophobic and polar residues, respectively form internal core and surface of the protein. Many attempts have been done to extend latticebased methods such as approximation algorithms [6,7].
In 1994, a multi-objective genetic algorithm was developed by Jones to solve IPF problem, in which the input of algorithm is a protein secondary structure [8]. However, improvements in proteomics including protein force fields [10 9, 11] and rotamer libraries [12] enabled scientists to solve this problem in the atomic level. In this era, several algorithms were developed to find the best sequences through the solution space using energy functions [13,14]. Besides, they take into account the effects of amino acid conformations, commonly in the form of "rotamer libraries". The essence of IPF solutions, up to 2012, was to find an amino acid sequence which folds to a low energy structure by means of assigning more hydrophobic residues or minimizing a protein energy function.
Due to the simplifications of folding driving forces by protein design models including discrete rotamer space and approximate energy functions, IPF problem was incapable to reach its holy grail. Until 2012, the point which had been ignored was the evolutionary information derived from protein databases. Recently, EvoDesign [3,15] has been developed to take into account the evolutionary information in form of profile collections obtained by native structures of the PDB database. As it was mentioned in [3], several methods in the literature were developed to design specific proteins, but modern methods should be able to design sequences for any protein scaffold. Despite the abovementioned de novo protein design algorithms, Evolver [16,17] has another point of view which evolves three different types of protein sequences for each input target structure using simulated annealing. The first one is the native sequence of input structure extracted from the PDB database. While, the second one is obtained by shuffling the native sequence and the last one is a random protein-like sequence.
Since IPF is the reverse procedure of protein folding, any suitable method to solve this problem should employ folding driving forces. As the folding initially involves the establishment of regular structures, in particular alpha helices and beta sheets, secondary sub-structures would be useful in solving PSSIF problem. Actually, these regular structures can make an appropriate scaffold of protein tertiary structure; furthermore, they can affect amino acid composition in primary structure through evolution process. In general, importance of PSSIF problem arises from the fact that secondary structure is one the most effective features in tertiary structure and function of proteins.
This paper involves native secondary sub-structures as evolutionary information to improve designing process. Thus, a novel genetic algorithm, named GAPSSIF, using these sub-structures is proposed to solve PSSIF problem. In other words, a precise protein repository is constructed by extracting all possible protein secondary sub-structures from PDB. In this algorithm, each individual takes advantage of a knowledge-based procedure using the substructure repository. In essence, evolutionary information can lead the algorithm to design appropriate amino acid sequences respective to the target secondary structures. Furthermore, they can be folded to tertiary structures almost similar to their reference 3D structures. GAPSSIF is compared with two well-known algorithms called EvoDesign and Evolver. The assessment of proposed algorithm on 89 non-redundant proteins confirms the strong performance in solving PSSIF. In addition, the predicted tertiary structures of designed sequences represent acceptable results.

Method
In this section, a genetic algorithm [18] called GAPSSIF is presented to solve PSSIF problem. This algorithm makes use of evolutionary information through PDB secondary structure elements. Thus, prior to explain the proposed algorithm, constructing a repository of protein secondary sub-structure elements is described.
Building up sub-structure repository A collection of 102101 proteins was derived from the PDB secondary structures file generated using DSSP [19]. It contains all existing amino acid sequences as well their secondary structures. Since PDB is highly redundant, proteins with more than 90 % sequence identity were omitted to eschew bias of designed sequences to a specified group of proteins. Afterwards, corresponding amino acid fragments were extracted for all helices (H), beta sheets (E) and all other kinds of secondary structure elements (C).
Eventually, by fetching amino acid fragments for each distinct sub-structure, an Amino Acid Fragment Repository (AFR) which highly increases the precision of proposed algorithm, is formed. This repository comprises three main clusters including helix, beta sheet and coil. Each sub-cluster contains non-identical amino acid fragments with a specified length. For example, a sub-cluster named "H11SC" includes some fragments with 11 amino acids whose secondary structures are Helix. In essence, we represent a sub-cluster with ekSC where e ∈ {E, H, C} and k assigns the length of sub-structure e. There are totally 306 sub-clusters in the repository, 38 for Beta strands, 141 for Helices and 127 for Coils. Clearly, some lengths do not exist among PDB peptides.

The proposed algorithm to solve PSSIF problem
In this subsection, we aim to describe the steps of GAPSSIF for solving PSSIF problem to design appropriate amino acid sequences folded to target secondary structures. The following mathematical definition outlines PSSIF problem: Output : S ¼ s 1 …s l ; s i ∈ X ¼ 20 standard amino acids f g : Algorithm 1 depicts an overview of the proposed method. In the first step of GAPSSIF, the input secondary structure is split into a set of sub-structures, elements, as described below: where σ, k and e indicate respectively, the start position, length and type of jth element in the l-length target structure.

Creating initial population
This subsection describes the second step of GAPSSIF (algorithm 1) to make an appropriate initial population where each individual of the population is a 2-tuple < S i , Fit i > whose S i is made up of 20 standard amino acids as follows: where F j shows the corresponding amino acid fragment for jth element, <σ, k, e > j, of sub ss and it is built up as follows: where ekSC represents a sub-cluster in AFR (see "Building up sub-structure repository") and r ∈ [0,1) is a random value. In equation (2), ExactFragment procedure is applied if the intended sub-cluster exists and the random value r ∈ [0,0.7). This procedure randomly fetches an amino acid fragment from ekSC. In contrast to the first case of equation (2), the second case occurs if the intended ekSC exists and the random value r ∈ [0.7,0.95) or intended ekSC does not exist in AFR but the random value r ∈ [0,0.9). In general, Neighboring procedure is employed to edit a shorter or longer element of the repository, see algorithm 2. Steps 1, 2 and 3 of this algorithm find ek′SC sub-cluster with the lowest difference from k. Afterwards, in step 4, ExactFragment procedure is used to fetch a fragment from ek′SC sub-cluster and then, step 5 modifies the fetched fragment to length k.
In case 3 of equation (2), if none of the two first cases is satisfied, the required fragment is generated by ChoFa-Generator procedure using ChoFaWeight (CFW) function. According to Chou-Fasman [20] analysis on secondary structure dependent propensities, amino acids have various tendencies to participate in each secondary sub-structure or element. Therefore, CFW function applies roulette wheel selection through 20 standard amino acids in order to select an appropriate residue.
Accordingly, using evolutionary information in creating amino acid sequences results in a better starting point to search through the sequence space and substantially accelerates convergence of the algorithm. In order to calculate the fitness value of generated amino acid sequence (S i ), two main steps should be taken. At first, the secondary structure of S i called PSS i = pss i1 …pss il is predicted by Reprof [21]. Secondly, the similarity is computed between the predicted secondary structure, PSS i , and target structure, SS, as described below: Eventually, |sub ss | individuals are generated using aforementioned processes to construct an initial population.

Enriching amino acid individuals
In the third step of proposed algorithm (algorithm 1), each individual is enriched using Gibbs Sampling algorithm. This method employs AFR Mutation (AFRM) operation iteratively to fortify the individuals using evolutionary information in AFR.
In the first step of Gibbs Sampling method, AFRM operation takes individual S i and its predicted secondary structure, PSS i , to generate P'SS i which specifies incorrectly predicted positions of PSS i as follows: pss ij else: ( Then, pattern P'SS i is split into a set of secondary substructures as described below: In the following, for each element in set sub P ′ SS i , a fragment is built according to the equation (2). At last, these fragments are located on the corresponding fragments in sequence S i to generate a new sequence called newS i . Then, the fitness value of newS i is computed and named newFit i .
In the second step, Gibbs Sampling method replaces sequence S i with designed sequence newS i when newFit i is greater than Fit i . The first and the second steps of Gibbs Sampling are conducted |sub ss | times.

Constructing Hit Map Repository
In step 4 of GAPSSIF (algorithm 1), Hit Map Repository (HMR) is constructed to contain all correctly designed subsequences whose structures are identical to the corresponding elements of target structure. Each identical element is represented as follows: where "key" shows the structure of subsequence s in structure PSS. For instance, <(B4, H3, C2), s > indicates that there is a subsequence s in the designed sequence whose structure consecutively contains a beta-sheet, alpha helix and coil respectively with lengths four, three and two.
In fact, hit map repository is the result of complementary collaboration between AFR and secondary prediction algorithm. It means that HMR comprises those fragments which are accepted by both evolution process and the secondary structure predictor. In other words, HMR consists of multi-structural fragments which are simulated during the algorithm using both prediction algorithm and AFR.

Mutation operations
GAPSSIF employs two mutation operations to mutate individuals in step 5. Each individual is mutated randomly using AFRM (see "Enriching amino acid individuals") or HMR Mutation (HMRM) operations. The first operation, AFRM, was described as a part of Gibbs Sampling method in "Enriching amino acid individuals". The second one, HMRM, employs hit map repository to mutate a designed sequence, S i , to generate an offspring named newS i as described in algorithm 3. HMRM operation tries to find a proper multi-structural fragment from HMR to locate in S i . Finally, the fitness values of mutated individuals are computed and added as new individuals to the population P.
Eventually, in step 6 of algorithm 1, the extended population is sorted in descending order based on the fitness values of individuals. Afterwards, in step 7, extra individuals are removed from the population till the new generation reaches the size of initial population. GAPS-SIF is repeated until a solution with identical secondary structure to the target is found or goes on for 50 iterations. According to the length of the largest substructure in the benchmark, the maximum number of iterations is set to 50.

Results and discussion
GAPSSIF was implemented using Perl and all calculations were done on an Intel core i7-3770 processor (8M Cache, 3.40GHz) with 16GB RAM in 64bit Ubuntu Linux.

GAPSSIF evaluation on a non-redundant dataset
This subsection presents GAPSSIF evaluation on 89 proteins. With reference to heuristic nature of GAPSSIF, it was executed ten times for each PDB ID of Additional file 1: Table S1. It should be noted that the best designed sequence in these 10 executions is the one with higher accuracy (Q3) and less iterations.
An investigation over Additional file 1: Table S1 shows significant success of GAPSSIF in designing amino acid sequences for the target secondary structures. Column (a), PSD-Q3, represents the percentage of similarity between target and predicted secondary structure of designed sequence [23]. In addition, column (b), SOV, illustrates the segment overlap score which is based on the average overlap between the reference and designed segments [23]. As it is shown by Q3 and SOV, the proposed algorithm successfully designed appropriate sequences for 89 proteins with different lengths and folding classes. In 88 samples, resultant sequences have identical secondary structures to the target structure. Even in 1NXM, there is just one residue with nonidentical secondary structure. Furthermore, the value of column (c) which specifies the iteration number of the algorithm demonstrates the high convergence of GAPS-SIF. Even in 1NXM, the best possible sequence was designed just through 7 iterations and it did not change till termination condition. Meanwhile, column (d) shows the execution time for making and enriching initial population using AFR. Moreover, column (e) indicates the execution time to search through the solution space using genetic algorithm operations. Thus, column (f ) refers to the total time of proposed algorithm given by the summation of columns (d) and (e) plus one second for loading AFR. Generally, the process of making initial population is in the order of O(l 2 ) and the time complexity of iteration is O(l 2 ) where l shows the length of target secondary structure. Moreover, the space complexity of this algorithm is also in the order of O(l 2 ) to save hit map repository and individuals in each generation.
The values of column (g) illustrate normalized difference of amino acid compositions between designed sequence S = s 1 …s l and reference sequence R = r 1 …r l as follows: The zero value of NDC shows that amino acids distribution in designed sequences is typical of their references. However, the rationale of having NDC value greater than zero is the one behind PAM or BLOSUM substitution matrices, namely some amino acids are mutable to one another. In this regard, the low sequence and fragment identities in columns (h) and (i) not only mitigate the conjecture of using the reference sequence from AFR in designing sequence for the corresponding structure, but also show high diversity of designed sequences. As it is marked by "#" in Additional file 1: Table S1, fortunately just five proteins have non-zero fragment identity. In fact, high sequence identity cannot validate the quality of designed sequences alone, since PDB database has been completed from structural perspective not amino acid sequences. As we know many amino acid sequences can be folded to one protein conformation. In addition, this high diversity could be more useful for practical applications such as biological or chemical purposes. Meanwhile, amino acid composition variance in column (j) demonstrates that the designed protein amino acid compositions are typical of the input scaffold folding class [8]. The number of successful hits in column (k) emphasizes that there is an appropriate designed sequence in almost all ten independent runs of the proposed algorithm. In column (l), the average value of 99.59 for all 890 designed sequences (for each of 89 proteins, ten sequences are designed by GAPSSIF) confirms remarkable achievement of GAPSSIF in solving PSSIF problem. The success of proposed algorithm is due largely to the evolutionary information and the simulation of multi-structural fragments. Column (m) indicates that although in some executions the predicted secondary structure of designed sequences is not identical to the target structure, the algorithm is able to design sequences with few incorrect residues. It is clear that the zero values in this column clarify a successful design in all ten executions. In order to better represent the simultaneous effect of AFR and HMR, the predicted secondary structure accuracy of reference sequences is shown in the last column of Additional file 1: Table S1. In fact, the limitations imposed by prediction algorithms intentionally are used to enhance the performance of GAPSSIF. To be more specific, for each secondary structure segment we have two possible repositories, the first one is authorized by nature-evolved sequences and the second includes common fragments which are acceptable by both nature and predictor. In fact, GAPSSIF uses a prediction algorithm not only to evaluate individuals as a fitness assessor but also to play an effective role in constructing amino acid sequences. Although, the prediction accuracy of a reference sequence is restricted (see column n) even in the best secondary structure predictors, threat can turn into opportunity by the complementary collaboration of evolutionary and simulated data.
It should be mentioned that in order to cross-validate evaluation procedure, PDB IDs in Additional file 1: Table  S1 marked with "*" were omitted while creating AFR. Eliminating 1Y25, 1V5I, 2WLV, 2ERb and 3FIL does not affect GAPSSIF good performance. Moreover, despite the existence of 1NXM in AFR, it does not have any exact hit.

Secondary structure assessment of designed sequences
To assess the quality of designed sequences, a comparison is held between GAPSSIF and the most recent protein design algorithms, Evolver [16,17] and EvoDesign [3,15]. In this analysis, five protein structures are extracted from [15] to evaluate the aforementioned algorithms. For each input structure, EvoDesign announces ten amino acid sequences in ten independent runs. Each run comprises a population of 29000 sequences and 30000 iterations. Also, Evolver is executed on three different types of sequences for each protein of this benchmark as it was mentioned in Background. In addition, GAPSSIF runs ten independent times on the benchmark. For each protein, the size of population is defined based on the number of sub-structures in target structure, and the algorithm is repeated almost 50 iterations.
EvoDesign benefits from a secondary structure predictor in its fitness function with comparable results to PSS-Pred [24] while GAPSSIF uses a development version of PHD [21] called Reprof. In order to have a fair comparison between GAPSSIF and EvoDesign, PSI-Pred [25] is used to have an impartial secondary structure prediction. For this, PSI-Pred, PSS-Pred and Reprof prediction results are compared on five proteins in Table 1. Since GAPSSIF uses Reprof as its fitness function, better performance of GAPSSIF draws on Reprof prediction results would be doubtful. Therefore, secondary structures of designed sequences from GAPSSIF, Evolver and Evo-Design are predicted by PSS-Pred, PSI-Pred and Reprof predictors. Since Evolver does not use any prediction algorithm, the results of PSS-Pred and Reprof are sufficient to compare the accuracy of designed sequences. Table 1 illustrates secondary structure assessment of three abovementioned designers, GAPSSIF, EvoDesign and Evolver; such that each designed sequence of each protein is evaluated using three different secondary structure predictors. In other words, the secondary structures of designed sequences obtained by independent executions were predicted by Reprof, PSS-Pred, and PSI-Pred. Thus, columns (B) and (Ave) in Table 1 respectively indicate maximum and average Q3 among all independent runs. For each protein in Table 1, ten independent runs of EvoDesign and GAPSSIF as well as three different executions of Evolver were used. Comparison in this table firmly corroborates strong performance of the proposed method in PSSIF problem.
Undoubtedly, studying the PSSIF problem is a quantum leap forward in solving protein design, since protein secondary structure represents a primary scaffold of the protein conformation. Successful solution for PSSIF problem by GAPSSIF demonstrates that evolutionary information from naturally occurring proteins can lead IPF problem to an efficient solution. Recent studies have demonstrated that PDB database has reached its completeness [26][27][28] which means that there are few structures outside PDB. Tertiary structure assessment of designed sequences In this subsection, predicted tertiary structure accuracy of designed sequences is evaluated using I-TASSER [29]. Actually, the ultimate goal of IPF problem is to create proteins with enhanced properties or even novel functions. Inasmuch as, protein structure determines its function, understanding the functional architecture enables us to study this macromolecule more practical. Thus, five designed sequences by GAPSSIF extracted from [15] are folded by I-TASSER [29] where tertiary structure results are evaluated using TM-Score [30,31], Assigned SS and RMSD [32]. TM-Score represents structural alignment score obtained from TMalign [30] and Assigned SS shows the similarity between target and secondary structures taken from DSSP, as well Root Mean Square Deviation, RMSD, measures the average distance among atoms of superimposed proteins. In Table 2, TM-Score greater than 0.3 indicates that the structural similarity is not random. Moreover, TM-Score greater than 0.5 means that 1ZZKA, 1R26A, 1XTE, 3I4O and 2VOU are in the same folding class with the input scaffold which means a relative success in solving IPF problem. The value of mean ± standard deviation (0.77 ± 0.13) for TM-Score indicates that all of the predicted tertiary structures of proteins are in the same fold with their respective native structures. In addition, the value of mean ± standard deviation of the RMSD is 2.15 ± 0.79. Moreover, the average value of Q3, 79 %, is acceptable because finding appropriate templates highly affects the precision of template-based algorithms such as I-TASSER while the sequence identity of designed sequences is low.
Despite the simplicity of fitness function of GAPSSIF in comparison to EvoDesign and Evolver, the proposed method shows a good performance in designing amino acid sequences. Evolutionary information in both GAPS-SIF and EvoDesign can significantly affect designing appropriate sequences for a target scaffold. While, EvoDesign creates a position specific scoring matrix of divergent sequences taken from homolog structures to the target structure, GAPSSIF employs fragments of secondary sub-structures which explicitly participate in building up amino acid sequences. The procedure of assembling amino acid fragments respective to the secondary sub-structures of the target generates protein-like sequences with high diversity. On account of not explicitly considering structural features and the simplicity of the fitness function in proposed method, GAPSSIF shows strong performance in solving PSSIF problem and acceptable results for IPF problem. Furthermore, unfair evaluation of GAPSSIF by homology-based folding algorithms due to low sequence identity negatively affects the evaluation designed sequences. In other words, evolutionary information lends GAPSSIF an ability that improves the designing process in this approach by imposing implicitly tertiary structure constrains which implied by natural data.

Statistical assessment of designed sequences
In this subsection, two statistical tests are applied to confirm that designed sequences share common characteristics with reference sequences. For this, Pot statistic and Pearson's chi square tests are employed respectively to measure bunching and inconsistency of the observed amino acid distribution in a designed sequence.

Bunching assessment
One of the possible issues in designing artificial sequences is bunching or grouping of a particular amino acid based on the secondary structure state, e.g. β structures are populated by Isoleucine and Valine. Thus, to exclude this possibility, a Pot statistic [16,33] test is employed to penalize the short-range bunching of particular amino acid in sequence S = s 1 …s l , as follows: where for each amino acid j, pot j and σ j assigns to the mean and the corresponding standard deviation calculated for a set of non-redundant PDB native sequences (Brylinski, personal communication). In addition, pot j is computed as below: and where O j shows the frequency of amino acid j in sequence S as well r ¼ e −O j =l .
For each protein in Additional file 1: Table S1, the E pot value of designed, reference, bunched and random proteinlike sequences are illustrated in Table 3. In fact, the E pot score of reference sequences are assessed to demonstrate that the bunching of designed sequences is typical of the native protein sequences. Moreover, in order to compare the obtained results, the maximum and minimum bunching are assessed by calculating the E pot score respectively for bunched and random protein-like sequences. To acquire the maximum bunching, the reference sequences are bunched, e.g. the bunched sequence of DCADCDA is AACCDDD. In addition, the reference sequences are shuffled to generate random protein-like sequence to obtain minimum bunching value.
Finally, mean, standard deviation, median, quartile 1 and quartile 3 of Table 3 indicate that the amino acid bunching of designed sequences is typical of the reference and random protein-like sequences as well much lower than the bunched sequences.

Pearson's chi-square assessment
Pearson's chi-square test [34] is applied to sets of categorical data to determine if there is any significant difference between the background (Uniprot [35]) and observed distributions of amino acids in a protein sequence. For each protein i with length l, we define a random l-length sequence according to the amino acid distribution in Uniprot. In the following, chi-square test is calculated on designed, reference and uniprot-distributed sequences versus background: where O j and E j are the frequency of amino acid j in a protein sequence and Uniprot database, respectively. Table 3 illustrates the obtained chi-square for each designed, reference and uniprot-distributed sequences of a protein. The mean, standard deviation, median, quartile 1 and quartile 3 indicate that the distribution of the designed sequences versus background is as significant as the reference sequences.

Conclusion
GAPSSIF algorithm performs successful design for its input secondary structure scaffold. Interestingly, the acceptable results for 3D structure in lack of crucial tertiary structure features arise from the effect of evolutionary information. On the other hand, taking into account extra important features such as solvent accessibility and torsion angles, can significantly enhance tertiary structure results.
Using the evolutionary information from proteins with known structures significantly improves the quality of designed sequences. In fact, IPF problem would be solved by applying this information for both 2D and 3D structures. Evidently, in order to have better results in 3D, some effective features such as solvent accessibility and torsion angles should be considered. Therefore, the simple fitness function of GAPSSIF would be improved by a multi-featured one to search through the sequence space more precisely.

Acknowledgement
Thanks to Javad Rezaei and Somaye Khaleghi for their contribution in testing the dataset. Special thanks to Michal Brylinski for his comments in statistical assessments.

Funding
No funding was obtained for this study. (a) Pot statistic test penalizes short-range bunching of amino acids. The E pot value of reference and protein-like sequences give the minimal bunching. On the other hand, the maximal bunching is obtained from bunched sequences. The E pot values of designed sequences confirm that their bunching is typical of the native sequences. (b) Chi-square test is applied to determine if there is any significant difference between two sets of categorical data. The χ 2 values indicate that the distribution of designed sequences versus Uniprot database is as significant as reference sequences