Species-specific protein sequence and fold optimizations

Background An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes. Results Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at . Conclusion Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.


Background
An organism may increase its fitness in some range of environmental conditions through evolution. Fundamental to the survival of cells is the ability to modulate fluctuations in external osmotic and atmospheric pressure, tem-perature and pH via the acquisition or development of advantageous molecular mechanisms [1][2][3][4]. These mechanisms include the uptake of small molecules, osmolytes or metals via transporters as found for increased iron uptake allowing enhanced growth of Pasteurella multocida [5] and in the accumulation of high concentrations of the stabilizing K+ among halophiles [6]. Other mechanisms include modification of the atomic [7] and residue [8] composition of proteins, or the acquisition of environmental adaptive genes via lateral gene transfer as was likely the case for the thermophilic bacteria Thermotoga maritima [9] and archaea Solfolobus solfataricus P2 [10]. In other cases, the gene duplication events augment the ability of an organism to adapt to extreme environments by expanding specific protein families including additional stress response and damage control genes that provide increased protection for the radiation resistant bacteria Deinococcus radiodurans [11,12]. Interestingly, in symbionts such as Buchnera sp. APS [13], Agrobacterium tumefaciens [14] and Sinorhizobium meliloti [15], shared genetic material may increases overall fitness, but this effectively results in the loss of redundant genes and imposes hostsymbiont dependencies. In other organisms completely new and innovative mechanisms are required for adapting to the most extreme of environments.
In adaptation to the most extreme environments, it is expected that the protein complement also possesses the organism's adaptive property [6]. For instance, hyperthermophilic proteins must not only be functional, but optimized towards the host's extremely hot (>80°C) physical environment. Although in vivo protection factors have been identified that can stabilize proteins in vitro at high temperatures [1] and chaperone proteins can help refold misfolded proteins and prevent aggregation [16][17][18], the majority of foreign proteins cloned and expressed in E. coli retain all of the native enzyme's biochemical properties, including proper folding, thermostability and optimal activity consistent with the organism's optimal growth temperatures [19][20][21]. Thus, it is likely that sequence optimizations are required to ensure protein activity and folding in organisms whose growth conditions might otherwise adversely affect proteins.
Researchers have studied complete or partial genomes using bioinformatics in addition to the traditional comparative sequence-structure and structure-function mutation studies to identify stability factors. Recent studies of complete or partial genomes have identified sequence-based correlations between organisms using amino acid compositions. Lobry demonstrated the correlation between G+C content and codon usage across bacterial sequences [22] and G+C content and amino acid composition correlations have been extended to 25 complete genomes [8]. Moreover, codon usage and amino acid preferences for thermophiles are well established and have been extended to complete genomes [23][24][25][26]. However, these generalizations do not necessarily agree with comparative sequencestructure studies. Comparative studies often exploit sequence or structure based alignments to determine simi-larities and differences. Investigation of thermostability factors across 10 organisms including psychrophiles (cold-tolerant), mesophiles to hyperthermophiles with triosephosphate isomerase failed to identify significant correlations of composition with thermostability [27]. Further uncertainty arises from indications that different protein families adapt to temperature conditions by different sets of structural mechanisms [28]. How then to unify amino acid composition preferences with speciesspecific structural adaptations?
Algorithms have been designed to predict certain protein features primarily from sequence composition including low complexity regions [29], transmembrane segments [30], signal peptides [31], coiled-coils [32], secondary structure elements [33], structural classes [34], hydrophobicity [35], sub-cellular location [36] and have been used to increase remote sequence similarity searching [37,38]. Moreover, genomic base content has been used to predict open-reading frames and in-site splicing [39][40][41]. However, no algorithms have been designed to explore adaptation of proteins to their host environment, especially in a species-specific manner.
Species-specific adaptive optimizations might be expected to be subtle and hard to find in any individual sequence, yet sufficiently common across the bulk of genomic proteins that they may be detected using statistical methods. We demonstrate here that such subtle adaptive optimizations do exist in many individual organisms and that these can be extracted. We derive species-specific protein sequence and fold scoring functions from residue preferences found in predicted open reading frames and conservative structural models. The resulting scoring functions are effective in amino acid composition speciesspecific protein sequence and fold detection.

Principal Components Analysis
Principal Components Analysis (PCA) was performed with the amino acid compositions of the entire set of protein coding regions from each of the complete genomes ( Figure 1). PCA transforms a number of (possibly) correlated variables into uncorrelated variables called principal components that account for the variance in the dataset (see [http://www.statsoftinc.com/textbook/stfacan.html] for brief overview). The analysis involves plotting the original variables to the principal components (factor loadings) and can be interpreted as correlation coefficients ( Figure 1B,1D). Factor loadings of = 0.6 are considered to be strong correlations. Simultaneously, a correspondence of the mean genome amino acid compositions to the principal components may be observed in order to observe genomic usage or preference that appear to correlated factor loadings ( Figure 1A,1C).

Figure 1
Principal Components Analysis Plots of principal components 1, 2 (A, B) and 3, 4 (C, D) obtained from the amino acid composition of all their predicted open-reading frames as they correspond to the mean composition of the complete genomes (A, C) and their amino acid factor loadings (B, D). GC poor genomes (yellow), GC rich genomes (green), hyperthermophiles (red), thermophiles (orange), thermo-acidophiles (red-brown), solventogens (brown), alkalophiles (blue), extreme halophile (navy), and eukaryotes (purple). Note that there is only one genome representative for any cluster of strains or variants (i.e. Ecoli, EcoliE and EcoliH are all represented by Ecoli). In C, all remaining organisms are clustered around the number 1.   Figure 1A). Strong correlations also exist between the first component with several of the factor loadings ( Figure 1B) . This is in agreement with a previous report [8]. Consequently, genomic GC content will to a large extent determine amino acid usage as well as the choosing between small hydrophobic residues Ala/Gly or Ile, positively charged residues Arg or Lys, and large hydrophobic residues Trp or Tyr/Phe. , and most human pathogens do not correspond to this component and have an average composition with regards to these amino acids. The distribution of organisms across this component does not appear to correspond to discrete groupings of organisms that share similar environmental niches, but rather to a 'continuum of lifestyles' [26]. However, unlike previous studies that report correlations of this second principal component with growth temperatures [8,26], our results seem to indicate that this component is likely to correlate to a more complex phenomenon that incorporates growth temperature as well as other physical factors, possibly pH and solvent.
Components 3 and 4 are also significant factors in this multivariate analysis and these account for 10.3% and 7.4% of the variance. We have not determined a measurable factor that can be directly correlated to these components, but they also appear to correspond to environmental niche. However, we see species-specific preferences for Leu, Cys, Asp, Thr, Ser, and to a lesser extent Glu, Gln, His and Met residues ( Figure 1D). Component 3 strongly corresponds to several hyperthermophiles, but inversely corresponds to the extreme halophile Halobacterium, human pathogen Saur, gastro-intestinal tract colonizer Blong, and moderate halophiles and alkalophiles. Halobacterium's increased Asp usage is clearly consistent with its adaptation to intracellular and environmental conditions [42], although it differs to the hyperthermophile preference for the larger, negatively charged Glu. Component 4 has strong correspondence to the eukaryotes (Ecun, Hsap, Mmus, Atha, Cele) that correlates to Cys and Ser.
Taken together, the results from the principal component analysis suggest that amino acids that vary significantly among and between species are due to a large extent to environmental conditions.

Amino acid composition dendrogram
To compare organism amino acid composition, we performed hierarchical clustering using the complete linkage method with distances computed using the Euclidean metric on a dataset that consisted of the mean percent amino acid composition from all predicted open-reading frames for each of the 100 organisms ( Figure 2). This method generates clusters of organisms with a similar mean composition across all 20 amino acids that are maximally separated by using the farthest neighbours. The resulting dendrogram presents three large branches within 10 Euclidean difference units. The upper branch clusters genomes with low GC content (yellow), the mid branch clusters mid GC genomes and the lower branch clusters high GC content genomes (green). A feature of clustering by amino acid composition is that phylogenetically related organisms are not necessarily proximate neighbours. For instance, Hsap and Mmus are clustered together, but are separated by a significant distance from Spom, Scer, Atha and Cele as well as the eukaryote Ecun. Oddly, Ecun clusters closely to hyperthermophilic archae Aful and thermophilic Mthe and more distantly to a cluster comprised of hyperthermophilic bacteria Aaeo and Tmar and archae Paby, Phor and Pfur. However, this organism is not reported to have thermophilic qualities [43]. In another  case, hyperthermophiles Aper, Paero and Mkan are clustered together, indicating that organisms that are less phylogenetically related may form tight clusters of organisms that live in similar environments. These results significantly extend previous composition-based dendrograms [8], but differ significantly from other attempts to generate genome-based dendrograms [44,45].

Fold residue preferences
In order to address the question of whether the amino acid composition of ORFs were different that of folds as well as whether fold composition was species-specific, we generated over 57,000 conservative domain-based structure models for 95 genomes (see materials and methods). Amino acid compositions were computed across all protein coding regions for each complete genome using either genomic sequences for a given organism (C G ) or fold (C F ) for the purpose of identifying species-specific as well as pan-specific fold composition bias. Furthermore, excluded indels residues from the modeling exercise comprised <2% of all residues and these exhibited normal insertion or loop compositions richer in Pro and Gly, but poorer in the small hydrophobic residues. Figure 3 illustrates one case in which the mean composition of Asp is unvarying across all genomes, with the single exception of the extreme halophile Halobacterium. Moreover, we observe a significant increase in Asp residues in the fold regions as compared to the predicted ORF (t-test: p < 10 -38 ). Figure 3 also illustrates a case in which the mean composition of Gln varies significantly across the genomes. Virtually all genomes show a decrease of Gln (p < 10 -11 ) in the fold regions, with the startling exception of all thermophiles as well as Cper, Ecun, Halob, Scoel, Buchn, Fnucl and Mmaze. Although the mean composition of Gln is significantly lower (p < 10 -24 ) in these thermophiles than the other genomes, the increase of Gln in the fold is a surprising finding given that amidated residues are susceptible to deamidation at high temperatures [46,47]. However, others have reported that polar residues such as Gln are significantly reduced on the surface of thermophilic intracellular proteins as compared to their mesophilic counterparts, likely reducing the possibility of damaging deamidation reactions [48].
We found that small hydrophobic residues Ala, Gly and Val as well as charged residues Asp, Glu, His and Arg are consistently increased in the fold regions across all organisms ( Figure 4). Furthermore, we observed a significant decrease of amidated residues Asn and Gln as well as larger aromatic residues Phe, Trp, and Tyr, as well as Leu and Ser in the fold regions. It is possible that smaller residues in fold regions allow better packing of the core whereas charged residues are utilized for stabilizing electrostatic interactions including salt bridges. In order to exclude the possibility that our results may be biased due to low compositional complexity of ORF or fold regions, we applied  transmembrane, coiled-coil, compositional bias and low complexity region filtering using the pfilt application from David T. Jones (1997) and found few deviations from these trends (Figure 4). Since a large number of our templates are obtained via crystallography experiments, we cannot rule out the possibility that the fold composition bias may reflect a composition that is more amenable to crystallographic structure determination.

Composition-based scoring unctions
Since there exists significant amino acid variability between protein sequences from different organisms, we sought to generate a scoring function that would allow species-specific identification of protein sequences. Two scoring functions indicating the log probability of amino acid occurrence were generated for each organism. The first scoring function, C G , is based on genomic composition and was derived by taking the log of the amino acid frequency across all genomic ORFs for the given organism over the average amino acid frequency of all the genomes. The second scoring function, C F , was generated from fold composition of the aligned sequences and was derived by taking the log of the amino acid frequencies from the aligned residues of the genomic sequence divided by the template residues. In this fashion the reference state for these scoring functions is what we have termed the 'random organism' since it represents a collection of amino acid compositions from a variety of organisms. This then  provides the noise of the scoring function from which we are trying to extract a meaningful, species-specific signal. Log-odds potentials of protein substructures are considered additive [49], and in the evaluation of a sequence, the overall score for a sequence is calculated from the sum of the species-specific log-odds scores for each of its residues.
The nature of these scoring functions is such that if the composition of the organism is not particularly different than the 'random organism', then the magnitude of the scoring function values will approach 0. For instance, the magnitude of the Ecoli C G and C F scoring functions values are typically less than either the Mjan or Halob ( Figure 5). The C G and C F scoring functions are fairly similar and correlate well (86 ± 13%) with each other across all genomes. Mjan has a strong preference for Ile and Lys, but not Gln or Ala largely due to the amino acid coding due to the GC content of the genome (see PCA section). In contrast, Halob prefers the small hydrophobic Ala residue and the charged Asp residue, but not the amidated Asn nor the positively charged Lys. Thus, these scoring functions reflect the probability of observing any residue in a protein sequence or fold for some genome and are heavily influenced by the GC content of the genome and its residuebased environmental adaptations.

Cross-validation
As a preliminary test, we evaluated the performance of the C F scoring functions for their ability to detect folds in a species-specific manner. That is, the successful scoring function should identify fold sequences of the parent taxonomy from which the scoring function was derived. The performance of the scoring function was evaluated via a jackknife method in which 10% of the model-template pairs were excluded in generating the scoring function. These excluded pairs were then scored with the exclusive scoring function and success was achieved when the score obtained from the model fold was greater than the template fold. The binary species detection ability of the C F scoring functions to select between the model over the template ranged from 65% to 99% with an average of 85 ± 8% of model sequences being detected from the speciesspecific fold database (random = 50%

Prediction set
We used all 100 C G and 95 C F scoring functions to score every predicted protein sequence from all complete genomes in order to evaluate species-specificity (see Figure  9 ( Table 1). The purpose of this experiment is to evaluate the scoring function effectiveness in identifying proteins from the parent organism. Log odd scores were obtained for each protein from each of the complete genomes as evaluated by each of the scoring functions. We also recorded the overall average score obtained by each scoring function across all the ORFs in the genome. In doing so, we discovered that the self-scoring function invariably obtained the lowest overall score (data not shown). The random probability that a scoring function will obtain the best score is determined by the number of best scores included over the total number of scoring functions (i.e. for C G 10/100 for the top 10 scoring functions using a total of 100 scoring functions) and we can find the maximum value as the difference between the observed success rate and the random probability ( Figure 6). We find that the maximum success rate occurs when >20 C G scoring results are considered. However, as a more conservative estimate, one may choose to consider at least the top 5-10 scoring results to overcome the fact that similar scoring functions obtained by effectively redundant genomes will split the number of successful detections. For instance, scoring functions derived from E. coli strains and compositionally similar species (Sent and Styp and their variants) obtained comparable scores, which prevented effective detection of E. coli sequences by any of E. coli scoring functions when only the top score was considered a detection success. The effect of increasing the number of best scores included from 1 to 5, 10 and 15 can be seen for all scoring functions in Figure 7. The ability of the C G scoring functions to identify proteins from the parent organism when considering the top 15 scoring results ranged from 51% (EcoliE) to 87% (Paby) with an average 73 ± 9% success. The most effective scoring functions were derived from the low GC organism (Wbre, Buch, Bbur, Baphi, Mpul), hyperthermophiles, Halob and several high GC organisms (Ccres, Mtub, Mtub, Scoel, Smel). When including the top 5 scoring results, the success rate decreased to 49 ± 17%. Note the success rate is significantly higher than random (15/100 or 15%, 5/100 or 5%). In contrast, the effectiveness of the C F scoring functions varied more across this dataset, ranging from a low of 2% (Cele) to a high of 92% (Mpul) with an average success rate of 55 ± 24% when using the top 15 scores, which decreases to 40 ± 25% when only including the top 5. The most successful scoring functions were derived primarily from GC or AT rich organisms. Taken together, the most successful composition-based scoring functions were those derived by organisms with significant composition bias either as a result of %GC skew or from a more extreme environmental niche such as is the case for hyperthermophiles, thermophiles and halophiles. Finally, these results indicate that amino acid composition-based scoring functions may be able to identify the taxonomic origin of protein sequences.

Conclusions
In the largest study of its kind, we have identified speciesspecific amino acid composition differences across the predicted open-reading frames of 100 complete genomes. Continuously updated results are available at [http://genome.mshri.on.ca]. Our principal components analysis supports the idea that environmental niche is a major factor for the amino acid composition differences found between species. However, our results raise the possibility that this principal component corresponds more to a complex mixture of environmental influences such as pH, pressure, salt and solute concentrations and to some lesser extent, growth temperature [8,26].

Figure 6 C G Increased detection when including up to 20 top scoring results
The average success rate determined for scoring functions detecting sequences from their parent organism. The average success rate increases as a logarithmic function while increasing the number of top scoring results (blue). The random probability that a scoring function will detect the sequence is a linear function (red). The maximum difference between the observed success and the random probability occurs when 15 or 16 top scores are included for successful detection. Error bars included for average success. We observed an increased preference for small hydrophobic and charged residue over larger aromatic residues across all species after conservatively modeling 57,840 folds. Moreover, these fold composition biases also illustrate species-specific residue preferences. These biases provided an opportunity for the first time to derive and test simple yet effective species-specific scoring functions. We found that the fold scoring functions are 85 ± 8% effective in detecting a species-specific fold sequence. Moreover, we found that the genomic composition scoring function successfully identified sequences from its parent organism with a surprising 73 ± 9% overall accuracy.
The species-specific composition bias suggests that the variable amino acids are available for structural and/or environmental optimization aspects of proteins. We are currently investigating the usefulness of the species-specific composition-based scoring functions in identifying variable composition regions of protein structures and whether they correspond to structural/functional regions. We are also investigating the possibility of using these scoring functions to find proteins that are non-native to an organism, possibly indicating horizontal transfer. Scoring functions derived from this work can be used in future species-specific protein and fold identification and sequence optimization experiments.

Methods
Non-redundant protein sequences determined from each of the complete genomes ( were computed using all protein coding regions for each complete genome by software developed in our laboratory. Principal components analysis and the amino acid composition dendrogram were generated using the S-PLUS statistics package. Two-tailed paired t-tests were performed to test the null hypothesis that the ORF vs fold mean compositions for each amino acid were the same. All applications were written in ANSI C using the cross-

Conservative fold modeling
Domains are the fundamental unit of a polypeptide chain or part of a polypeptide chain that are thought to independently fold into a stable tertiary structure. Since domains are often units of function and different domains of a protein are often associated with different functions, we evaluated sequence alignments on a structural domainby-domain basis rather than by the global alignment. This provides a conservative framework to evaluate structural alignments.

Sequence and structure
For each protein sequence from a completely sequenced and annotated genome, herein referred to as a genomic sequence, we identified neighbour sequences, that is, sequences in the non-redundant protein sequence database sharing significant levels of similarity (expect value < 0.01) using NBLAST, a cluster-computer variant of BLAST [52] (Table 1). No efforts were made to minimize a possible bias contributed by paralogous genomic sequences. Neighbour sequences with 3D structures, herein referred to as templates, were identified using SeqHound [51], in a similar fashion to the NCBI's genome annotation service [53]. These genomic sequences and their corresponding templates are then used to generate hi-fidelity sequence to structure alignments.

Figure 8
Sequence to structural domain alignment Sequence to structural domain alignments (A, B). A genomic sequence (SEQ) is aligned to a homologous sequence with a 3D structure (STR) using a secondary structure profile using ClustalW. Note the insertion of gaps (denoted by -, red) in non-structured regions of the 3D structure. In the MERGE step, gaps in the structure are masked out, and eliminated in the compression step (COMP). At this point, the number of identical residues and the number of residues in the genomic sequence occupying a domain position in the structure are counted. Since domain 1 alignment passes the minimal 25% identity and 75% occupancy, it is used for further analysis. However, the %identity in the domain 2 alignment (B) is lower than the threshold of 25%, and the entire domain alignment is masked out and not used in any further analyses.  Hi-fidelity sequence to structure alignment We modified the ClustalW software package [54] to initiate a global alignment of two neighbour sequences using the PAM series substitution matrices and apply position specific gap penalties by virtue of a secondary structure profile. The profile is derived from the structure's annotation information provided by the authors of the published structure as well as from NCBI's vector alignment search tool, VAST [55]. A greater weight is placed when the two sources agree, and this effectively forces gaps into unstructured regions lacking alpha helices and beta strands.
To create conservative, fold-based alignments, gaps that were added to the genomic sequence are masked out since there is no correspondence to the structure and gaps inserted into the structure template to accommodate query insertions are eliminated (Figure 8). This gap-handling procedure had no visible effect on composition analyses that are later described.
To reject poor alignments and enhance the fidelity of the global alignment, both the sequence identity and structural position occupancy are determined over each VASTidentified structural domain. Various threshold levels were tested, although an alignment sequence identity of 25% and domain occupancy of 75% was found to provide optimal compromise between sensitivity and specificity (data not shown). If less than 25% of the aligned residues are identical and less than 75% of the aligned residues occupy residue positions in the domain, the domain is masked out completely and not used in any further computations ( Figure 8). These selection criteria generate relevant domain homologues and provide the ability to discriminate subtle sequence changes that are independent of fold in a statistically observable manner. When an alignment across a domain is found to satisfy the minimum constraints specified above, a structural model is generated for the genomic sequence by virtue of a sequence-to-structure alignment, herein referred to as the model.
Since a genomic sequence may make many models using different templates, only the single best model is selected to minimize sampling bias. The selection criterion emphasizes the use of multi-domain structural models by using a scoring function derived from residue length of the non-masked out aligned domain region(s), the fraction of residues that are identical and the fraction of residues occupying domain positions. The model with the best score is then selected to represent that genomic sequence.
For each representative model, the sequence alignments between the genomic sequence and the template, along with the corresponding secondary structure are written to a database, herein known as the species-specific fold database. This fold database is the source of model and tem-plate sequences for determining fold composition and deriving species-specific fold scoring functions.

Model quality
Our method exploits species-specific optimizations at the sequence level by making accurate structural-based alignment for genomic sequences. We generated models for 95 of the 100 genomes with 5 genomes having been very recent additions. Initially, there are as many 3D neighbours as genomic sequences (Table 1). However, 24 ± 10% of genomic sequences make structural models and only 19 ± 6% settle with a single representative model structure that pass our structural domain alignment criteria. The representative models are 168 ± 10 residues in length and possess 41 ± 5% sequence identity and 96.7 ± 0.3% domain occupancy with the template structure, which is clearly higher than our set minimum requirements. Furthermore, template structures are used 1.4 ± 0.4 times for model building, thereby minimizing structure over-sampling and providing more unique templates. Interestingly, at least 36 to as many as 287 different organisms contribute 5.8 ± 3.5 template structures to each genome modeling exercise. 30 ± 12% of the templates are obtained from E. coli and 23 ± 9% of structures are obtained from thermophilic species. Our models hold properties of 'good' models since they are based on at least 30% sequence identity are shorter than 200 residues and are aligned along template domains, in agreement with other published criteria [56].
In general, the number of final models generated for complete genomes reported elsewhere is greater than the number generated with our method. For example, the NCBI provides a substantially larger set of 3D structure neighbours for complete genomes, in which as many as 39% of sequences are reported to have structure neighbours [http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/ PDB_bact.html]. ModBase has on average between 2 to 4 models per sequence in which they claim roughly 44% are reliable [http://pipe.rockefeller.edu/modbase]. Since our comparative modeling method is more conservative in that it does not attempt to model side-chains, loops, or regions with no template and our alignments are evaluated over smaller, domain-focused regions, we expect fewer errors [57].

Authors' contributions
KM provided the framework for complete genome analysis with her development of the SeqHound sequence and structure database management system. MD carried out the statistical analysis, derived and tested the scoring functions and drafted the manuscript. MD and CWVH jointly conceived of the study, and participated in its design and coordination. Table 1 -Summary statistics for complete genomes modeling and scoring  in sequence to structure alignments, and the average number of residues per model (Res). The taxonomic contribution is listed by the number of organisms that contributed template structures (OC), the average number of structures contributed by each (OF), the percentage of templates that were from E. coli (%E) and thermophiles (%T). Finally, the percentage of correctly identified sequences in jackknifing for the CF scoring function (CF(JK)) and the percentage of correctly identified sequences using the top 15 scoring scores for the CG scoring function (CG(P)) and for the CF scoring function (CF(P)). NA -not available.