Volume 10 Supplement 6
Massive non-natural proteins structure prediction using grid technologies
© Minervini et al; licensee BioMed Central Ltd. 2009
Published: 16 June 2009
The number of natural proteins represents a small fraction of all the possible protein sequences and there is an enormous number of proteins never sampled by nature, the so called "never born proteins" (NBPs). A fundamental question in this regard is if the ensemble of natural proteins possesses peculiar chemical and physical properties or if it is just the product of contingency coupled to functional selection. A key feature of natural proteins is their ability to form a well defined three-dimensional structure. Thus, the structural study of NBPs can help to understand if natural protein sequences were selected for their peculiar properties or if they are just one of the possible stable and functional ensembles.
The structural characterization of a huge number of random proteins cannot be approached experimentally, thus the problem has been tackled using a computational approach. A large random protein sequences library (2 × 104 sequences) was generated, discarding amino acid sequences with significant similarity to natural proteins, and the corresponding structures were predicted using Rosetta. Given the highly computational demanding problem, Rosetta was ported in grid and a user friendly job submission environment was developed within the GENIUS Grid Portal. Protein structures generated were analysed in terms of net charge, secondary structure content, surface/volume ratio, hydrophobic core composition, etc.
The vast majority of NBPs, according to the Rosetta model, are characterized by a compact three-dimensional structure with a high secondary structure content. Structure compactness and surface polarity are comparable to those of natural proteins, suggesting similar stability and solubility. Deviations are observed in α helix-β strands relative content and in hydrophobic core composition, as NBPs appear to be richer in helical structure and aromatic amino acids with respect to natural proteins.
The results obtained suggest that the ability to form a compact, ordered and water-soluble structure is an intrinsic property of polypeptides. The tendency of random sequences to adopt α helical folds indicate that all-α proteins may have emerged early in pre-biotic evolution. Further, the lower percentage of aromatic residues observed in natural proteins has important evolutionary implications as far as tolerance to mutations is concerned.
A fundamental question in protein science is if the known natural proteins are just one of the many possible ensembles of stable and functional polypeptides or the only possible solution found by molecular evolution. In other words, is it possible to imagine many biochemical "parallel dimensions" or the one we know is the only possible one? This question has many implications in terms of our knowledge of the principles underlying the proteins sequence/structure/function relationships, and of our ability to modify the existing proteins, or design novel proteins, for biotechnological and biomedical purposes. In fact, the number of known natural protein sequences, though quite large, is infinitely small compared to the number of proteins theoretically possible with the twenty natural amino acids. Thus, there exists a huge number of protein sequences which have never been exploited by living organisms, named by Luisi and coworkers "never born proteins" (NBPs) . Just to give an example, the latest release of UniProtKB/Swiss-Prot (56.2 of 23 September 2008) contains approx. 400 thousand sequence entries , many of which are evolutionary related. On the other hand, 10020 chemically different proteins can be in principle obtained with the 20 natural amino acids considering random polypeptides of only 100 amino acids in length (the average length of natural proteins being 360 amino acids ). In this regards, a key issue is if natural protein sequences were selected during molecular evolution because they have unique physico-chemical properties or else they just represent a contingent subset of all the possible proteins with a stable and well defined fold. If the latter hypothesis were true, this would mean that the protein realm could be exploited to search for novel folds and functions of potential biotechnological and/or biomedical interest. Such a problem cannot be easily tackled experimentally as this would require the production and structural characterization of a huge number of random polypeptides. Attempts have been made in this direction , however an alternative is that of adopting a computational approach which, though yields only predictive results, allows to sample a much larger sequences space. In addition, a computational approach allows to evenly sample the protein sequences space in different regions far away enough from the ensemble of natural proteins.
In this work, we describe the study of the structural properties of a large library of random protein sequences with no significant similarity with natural proteins by means of the well known ab initio protein structure prediction software Rosetta abinitio . Rosetta abinitio has consistently been shown to yield accurate, and in some cases near-atomic resolution, predictions of protein structures even in the absence of evolutionary information , thus representing the right tool to address the problem of NBPs structure prediction.
Results obtained indicate that most of the NBPs are characterized by three-dimensional structures comparable to those of natural proteins in terms of compactness and surface polarity. However, α helix content and aromatic/aliphatic residues ratio is significantly higher in NBPs as compared to natural proteins of comparable length. The evolutionary implications of these results are discussed.
Amino acid sequences library generation
Random amino acid sequences (70 amino acids long) were generated using the utility RandomBLAST whose implementation has been described in detail elsewhere . Briefly, RandomBLAST consists of two main modules: a pseudo random sequence generation module and a BLAST software  interface module. The first module uses the Mersenne Twister 19973 pseudo-random number generation algorithm  to generate pseudo-random numbers between 0 and 19. which are translated in single character amino acid code and then concatenated to reach the desired sequence length. Each sequence generated is then given in input to the second RandomBLAST module, an interface to the blastall program which invokes the following command:
blastall -m 8 -p blastp -d database -b 1;
where database in our case stands for the NR database (the National Center for Biotechnological Information non redundant protein sequence database ), and the parameters -m 8 and -b 1 indicate the alignment format (tabular form) and the number of sequences to be returned (just the first hit), respectively. Blastall output is then retrieved by RandomBLAST and the Evalue extracted from it. If the Evalue is greater than or equals the threshold chosen by the user, the sequence is added to the output file. Note that in this case only the sequences that do not display significant similarity to any protein sequence present in the database are considered valid, so that, contrary to the normal BLAST usage, valid sequences are those displaying an Evalue higher than the threshold, set to a value of 1 . The total number of NBPs sequences generated was 20496.
NBPs three-dimensional structure prediction
NBPs three-dimensional structures have been predicted using Rosetta abinitio, an ab initio protein structure prediction software based on the assumption that in a polypeptide chain local interactions bias the conformation of sequence fragments, while global interactions determine the three-dimensional structure with minimal energy which is also compatible with the local biases [4, 10]. To derive the local sequence-structure relationships for a given amino acid sequence (the query sequence) Rosetta abinitio uses the Protein Data Bank  to extract the distribution of conformations adopted by short segments in known structures. The latter is taken as an approximation of the distribution adopted by the query sequence segments during the folding process .
Given the extent of the amino acid sequences dataset under study, a large amount of computational resources were needed to accomplish the task of their structure prediction. Thus, Rosetta abinitio has been deployed on the EUChinaGRID grid infrastructure and a user friendly job submission environment was developed within the GENIUS Grid Portal [12–14]. A detailed description of the porting of Rosetta abinitio in grid can be found elsewhere . Briefly, the application execution in grid was first tested using a shell script which registers the program executable (pFold.lnx) and the required input files on the grid file catalogue (LFC catalogue), calls the Rosetta abinitio executable and proceeds with workflow execution. A JDL (Job Description Language) file was also created to run the application on the grid working nodes which use the gLite middleware [12, 15].
Once the correct execution of the program in grid was assessed, a user friendly interface was developed within GENIUS to allow users with poor knowledge of the grid middleware to submit, monitor the execution and download the output of a high number of Rosetta abinitio predictions in grid .
Three-dimensional structures analysis
The analysis of the physico-chemical properties of the predicted protein structures was carried out using a collection of different tools. For each tool the analysis procedure was automated using ad hoc Perl scripts. In detail, the programs used were MSMS , for molecular volume calculation, SURFace Algorithms , for surface properties analysis (overall molecular surface, per residue solvent accessibility), Freqaa , for amino acid composition analysis and DSSP  for secondary structure content analysis. Surface hydrophobicity was calculated as the ratio between the solvent exposed surface of hydrophobic amino acids and the total solvent exposed surface, both calculated using SURFace Algorithms .
To compare the properties of NBPs to those of natural proteins structures, a dataset of natural proteins of length comparable to that of NBPs (55 to 95 amino acids long sequences as compared to NBPs 70 amino acids long sequences) was derived from the Protein Data Bank . The dataset was cleaned up eliminating protein fragments and proteins whose fold is determined by macromolecular complexes formation. The final natural proteins dataset was formed by 866 proteins.
Statistical analysis of the data
A first exploratory data analysis has been developed to see if there were any significant difference in the structure observed in the two data-sets. Initially few outliers in the data that could affect this analyses were removed, generating a dataset of 18465 NBPs and a dataset of 839 natural proteins. Based on the probability distributions estimated from the data, the outliers were simply the values with probability of occurrence smaller than 0.005. In order to detect in a clear and a prompt way the pattern in the data, the outliers were initially removed and their presence was not relevant in the exploratory analysis. However they were considered in the subsequent study with datasets of comparable size. For these sets measures of location, index of dispersions, correlations matrix were derived, and box-plots and scatterplots were built to compare the two data sets. This study was performed on different structure-related variables, which include: volume, surface, surface/volume ratio, net charge, secondary structure content, and surface hydrophobicity. Tests on the Gaussian distribution of the variables led to reject the hypothesis of Gaussianity for the majority of the variables investigated. With a test significance level of 0.05 almost all the variables result with statistically different mean and variance for the two data sets. The analysis has been also conducted on smaller data-sets of comparable size: a random sample of 1000 observations has been drawn from NBPs dataset and comparisons have been developed. The two analyses generated similar conclusions, presented in the following section. The statistical software used to analyse the data was R .
Amino acid composition analysis
Mean and Standard deviation values of amino acids relative content in the NBPs dataset.
Std. Dev. (%)
Hydrophobic amino acids relative content of natural proteins and NBPs.
Massive proteins structure prediction environment
Comparative structural analysis
Average valuesa of the structure-related parameters calculated for natural proteins and NBPs.
Surface (Å 2 )
Volume (Å 3 )
% α helix
% β strand
% β turn
As a general consideration, the average value of the analysed structural parameters and the corresponding standard deviation values are statistically different between NBPs and natural proteins with a significance level of 0.05. In particular natural proteins are characterised by a higher standard deviation whereas NBPs seem to be narrowly distributed around the experimental average.
Meaningful interpretation of the results described in the present work rely heavily on the validity of the structure predictions obtained using Rosetta. However, the Rosetta model has been shown to perform fairly well and even yield near-atomic resolution structures in a number of cases . Results shown in figure 4 for a sample of natural proteins confirm that Rosetta predictions are in most cases fairly accurate in terms of overall fold, secondary structure content and topology. In some cases the agreement between the experimental and predicted structures is even surprising, as is the case of the predicted structure of the protein nusa (indicated in figure 4 with the PDB code 1UL9) which displays an overall backbone r.m.s.d. of only 1.74 Å with respect to the experimentally determined structure.
Analysis of the structural properties of the predicted NBPs structures yielded several interesting and in some cases counterintuitive results. In fact one would expect that in a large population of random amino acid sequences, a large proportion would be "unfoldable" and thus unstructured. Given the assumption of the Rosetta model, our results indicate that this is not the case. Indeed most of the NBPs structures are compact and well ordered, as indicated by the average surface/volume ratio and secondary structure content (Figure 5 and Table 3). Surface polarity is similar to that of natural proteins (Figure 6) suggesting that water solubility is an intrinsic property of random polypeptides. The main differences observed between NBPs and natural proteins are the lower compactness and higher α helix content of NBPs.
The lower compactness observed for NBPs is probably related to their significantly higher aromatic/aliphatic residues ratio with respect to natural proteins (Table 2). In fact, a higher proportion of aromatic residues in NBPs results in a hydrophobic core composition more prone to packing "defects", given the rigid character of aromatic sidechains with respect to branched aliphatic residues such as Leu. Indeed, Leu is largely over represented in natural proteins while the opposite is observed for aromatic residues (Figure 1).
The latter finding has important evolutionary implications. In fact a hydrophobic core made up of branched aliphatic amino acids is probably more tolerant to mutations in that residue substitutions are more easily accommodated by conformational changes of the flexible aliphatic side chains.
Regarding secondary structure content, NBPs display a higher α helix content with respect to natural proteins and a very low β strands content (Figure 5 and Table 3). This could be related to the local nature of the interactions within the α helix. In fact a helical fold can accommodate random sequences by packing together α helical elements interrupted by loops in which bad helix forming residues are located. This is much more difficult in β sheets in which precise pairing of β strands, far away from each other along the amino acid sequence, is required to form a stable structure. From this point of view it can be hypothesized that helical folds are more tolerant to random amino acid sequences. This is a fascinating hypothesis that would be very interesting to test experimentally. In fact in a pre-biotic scenario, in which the first polypeptides were probably characterized by random amino acid sequences, α helix could have emerged early as an intrinsic structural property of polypeptides.
Results reported in this work highlight how the computational study of "never born proteins", though predictive in nature, can give a useful insight on the basic structural properties of polypeptides and on the specific properties of natural proteins. NBPs appear to be structurally very similar to natural proteins, suggesting that the enormous sequence space of NBPs could indeed be exploited for biotechnological purposes. An important difference between NBPs and natural proteins resides in the different aromatic/aliphatic amino acids content, and in particular in the lower content of aromatic amino acids observed in natural proteins. This information can be very useful in the design of directed evolution and protein engineering studies.
Finally, this study demonstrates that exploitation of grid infrastructures for massive structure prediction projects is feasible, possible applications including genome wide protein structure prediction of bacterial pathogens for target selection and drug design studies.
This work has been supported by a European Commission grant to the project "EUChinaGRID: Interconnection and Interoperability of grids between Europe and China" (contract number: 026634).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 6, 2009: European Molecular Biology Network (EMBnet) Conference 2008: 20th Anniversary Celebration. Leading applications and technologies in bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S6.
- Chiarabelli C, Vrijbloed JW, De Lucrezia D, Thomas RM, Stano P, Polticelli F, Ottone T, Papa E, Luisi PL: Investigation of de novo totally random biosequences, Part II: On the folding frequency in a totally random library of de novo proteins obtained by phage display. Chem Biodivers 2006, 3: 840–859.View ArticlePubMedGoogle Scholar
- Bairoch A, Boeckmann B, Ferro S, Gasteiger E: Swiss-Prot: Juggling between evolution and stability. Brief Bioinform 2004, 5: 39–55.View ArticlePubMedGoogle Scholar
- Rohl CA, Strauss CE, Misura KM, Baker D: Protein structure prediction using Rosetta. Methods Enzymol 2004, 383: 66–93.View ArticlePubMedGoogle Scholar
- Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D: Free modeling with Rosetta in CASP6. Proteins 2005, 61(Suppl 7):128–134.View ArticlePubMedGoogle Scholar
- Evangelista G, Minervini G, Luisi PL, Polticelli F: RandomBlast a tool to generate random "never born protein" sequences. Bio-Algorithms and Med-Systems 2007, 3: 27–31.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Matsumoto M, Nishimura T: Mersenne Twister: A 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation 1998, 8: 3–30.View ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Ostell J, Pruitt KD, Schuler GD, Shumway M, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2005, 33: D39-D45.PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 1990, 87: 2264–2268.PubMed CentralView ArticlePubMedGoogle Scholar
- Rohl CA, Strauss CE, Misura KM, Baker D: Protein structure prediction using Rosetta. Methods Enzymol 2004, 383: 66–93.View ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242.PubMed CentralView ArticlePubMedGoogle Scholar
- Minervini G, La Rocca G, Luisi PL, Polticelli F: High throughput protein structure prediction in a grid environment. Bio-Algorithms and Med-Systems 2007, 3: 39–43.Google Scholar
- Grid Enabled web eNvironment for site Inde-pendent User job Submission (GENIUS)[https://genius.ct.infn.it]
- The EUChinaGRID Project[http://www.euchinagrid.eu]
- gLite middleware[http://glite.web.cern.ch/glite/]
- Sanner MF, Olson AJ, Spehner JC: Reduced Surface: an efficient way to compute molecule surfaces. Biopolymers 1996, 38: 305–320.View ArticlePubMedGoogle Scholar
- Sridharan S, Nicholls A, Honig B: A new vertex algorithm to calculate solvent accessible surface areas. Biophys J 1992, 61: A174.Google Scholar
- Tekaia F, Yeramian E, Dujon B: Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 2002, 297: 51–60.View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637.View ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria; [http://www.R-project.org] ISBN 3-900051-07-0,
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.