- Research article
- Open Access
Species-specific protein sequence and fold optimizations
BMC Bioinformaticsvolume 3, Article number: 39 (2002)
An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.
Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca.
Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.
An organism may increase its fitness in some range of environmental conditions through evolution. Fundamental to the survival of cells is the ability to modulate fluctuations in external osmotic and atmospheric pressure, temperature and pH via the acquisition or development of advantageous molecular mechanisms [1–4]. These mechanisms include the uptake of small molecules, osmolytes or metals via transporters as found for increased iron uptake allowing enhanced growth of Pasteurella multocida  and in the accumulation of high concentrations of the stabilizing K+ among halophiles . Other mechanisms include modification of the atomic  and residue  composition of proteins, or the acquisition of environmental adaptive genes via lateral gene transfer as was likely the case for the thermophilic bacteria Thermotoga maritima  and archaea Solfolobus solfataricus P2 . In other cases, the gene duplication events augment the ability of an organism to adapt to extreme environments by expanding specific protein families including additional stress response and damage control genes that provide increased protection for the radiation resistant bacteria Deinococcus radiodurans [11, 12]. Interestingly, in symbionts such as Buchnera sp. APS , Agrobacterium tumefaciens  and Sinorhizobium meliloti , shared genetic material may increases overall fitness, but this effectively results in the loss of redundant genes and imposes host-symbiont dependencies. In other organisms completely new and innovative mechanisms are required for adapting to the most extreme of environments.
In adaptation to the most extreme environments, it is expected that the protein complement also possesses the organism's adaptive property . For instance, hyperthermophilic proteins must not only be functional, but optimized towards the host's extremely hot (>80°C) physical environment. Although in vivo protection factors have been identified that can stabilize proteins in vitro at high temperatures  and chaperone proteins can help refold misfolded proteins and prevent aggregation [16–18], the majority of foreign proteins cloned and expressed in E. coli retain all of the native enzyme's biochemical properties, including proper folding, thermostability and optimal activity consistent with the organism's optimal growth temperatures [19–21]. Thus, it is likely that sequence optimizations are required to ensure protein activity and folding in organisms whose growth conditions might otherwise adversely affect proteins.
Researchers have studied complete or partial genomes using bioinformatics in addition to the traditional comparative sequence-structure and structure-function mutation studies to identify stability factors. Recent studies of complete or partial genomes have identified sequence-based correlations between organisms using amino acid compositions. Lobry demonstrated the correlation between G+C content and codon usage across bacterial sequences  and G+C content and amino acid composition correlations have been extended to 25 complete genomes . Moreover, codon usage and amino acid preferences for thermophiles are well established and have been extended to complete genomes [23–26]. However, these generalizations do not necessarily agree with comparative sequence-structure studies. Comparative studies often exploit sequence or structure based alignments to determine similarities and differences. Investigation of thermostability factors across 10 organisms including psychrophiles (cold-tolerant), mesophiles to hyperthermophiles with triosephosphate isomerase failed to identify significant correlations of composition with thermostability . Further uncertainty arises from indications that different protein families adapt to temperature conditions by different sets of structural mechanisms . How then to unify amino acid composition preferences with species-specific structural adaptations?
Algorithms have been designed to predict certain protein features primarily from sequence composition including low complexity regions , transmembrane segments , signal peptides , coiled-coils , secondary structure elements , structural classes , hydrophobicity , sub-cellular location  and have been used to increase remote sequence similarity searching [37, 38]. Moreover, genomic base content has been used to predict open-reading frames and in-site splicing [39–41]. However, no algorithms have been designed to explore adaptation of proteins to their host environment, especially in a species-specific manner.
Species-specific adaptive optimizations might be expected to be subtle and hard to find in any individual sequence, yet sufficiently common across the bulk of genomic proteins that they may be detected using statistical methods. We demonstrate here that such subtle adaptive optimizations do exist in many individual organisms and that these can be extracted. We derive species-specific protein sequence and fold scoring functions from residue preferences found in predicted open reading frames and conservative structural models. The resulting scoring functions are effective in amino acid composition species-specific protein sequence and fold detection.
Results and Discussion
Principal Components Analysis
Principal Components Analysis (PCA) was performed with the amino acid compositions of the entire set of protein coding regions from each of the complete genomes (Figure 1). PCA transforms a number of (possibly) correlated variables into uncorrelated variables called principal components that account for the variance in the dataset (see http://www.statsoftinc.com/textbook/stfacan.html for brief overview). The analysis involves plotting the original variables to the principal components (factor loadings) and can be interpreted as correlation coefficients (Figure 1B,1D). Factor loadings of = 0.6 are considered to be strong correlations. Simultaneously, a correspondence of the mean genome amino acid compositions to the principal components may be observed in order to observe genomic usage or preference that appear to correlated factor loadings (Figure 1A,1C).
The most significant principal component accounted for 47.5% of the variance and showed a strong correlation to DNA base pair content (94%). The left of this component corresponds to low GC organisms such as buchnera sp. (~27%), Mpul (~27%), Bbur (29%), Uure (26%), Wbre (23%) whereas the right of the component corresponds to high GC organisms including Mtub (66%), various plant pathogens (Xanthomonas sp., Mloti), soil bacterium Scoel (72%) and radiation-resistant Drad (66.6%) (Figure 1A). Strong correlations also exist between the first component with several of the factor loadings (Figure 1B). The correlated factor loadings have either [G|C] or [A|T] in the first two codon positions for some codon. The effect for the standard codon table is that GC rich codons [C|G] [C|G] [X] encode amino acids Pro, Arg, Gly, Ala, Trp and GC poor codons [A|T] [A|T] [X] encoding Phe, Leu, Ile, Asn, Lys, and Tyr (as well as Met and 2 stop codons). This is in agreement with a previous report . Consequently, genomic GC content will to a large extent determine amino acid usage as well as the choosing between small hydrophobic residues Ala/Gly or Ile, positively charged residues Arg or Lys, and large hydrophobic residues Trp or Tyr/Phe.
The second largest principal component accounts for 15.5% of the variance and appears to correspond to the environmental niche (Figure 1A). Hyperthermophiles (Mkan, Paby, Pfur, Phor, Aful, Aaeo, Tmar, Tten, and Mjan), thermophile Mthe, extreme halophile Halo, thermo-acidophiles (Taci, Tvo, Ssol, Stok), and solventogenic bacteria (Cace, Cper and Fnucl) correspond strongly to weakly, respectively, to component 2. Strong correlations to this component exist for Glu and Val, although opposite correlations exist for Gln, His, Thr, Ser and Cys, thereby suggesting the preferential usage of these amino acids by those organisms. A discussion regarding amino acid preferences for hyperthermophiles can be found elsewhere . Eukaryotes (Hsap, Mmus, Scer, Cele and Atha), with the exception of the obligate intracellular eukaryote parasite Ecun, have a strong, but opposite correspondence to component 2. These cluster with chlamydias/chlamydophilas (Cmur, Ctra, Cpne) and the inverse correspondence also indicates a significant increase in the genomic amino acid usage of Gln, His, Thr, Ser, and Cys, and decrease of Glu or Val. Interestingly, plant pathogens (Xaxon, Xcamp, Mloti, Rsol, Xfas, AtumC and AtumU), moderate halophiles and alkalophiles (Bsub, Bhal, Linn, Lact, Lmono, Oihey), and most human pathogens do not correspond to this component and have an average composition with regards to these amino acids. The distribution of organisms across this component does not appear to correspond to discrete groupings of organisms that share similar environmental niches, but rather to a 'continuum of lifestyles' . However, unlike previous studies that report correlations of this second principal component with growth temperatures [8, 26], our results seem to indicate that this component is likely to correlate to a more complex phenomenon that incorporates growth temperature as well as other physical factors, possibly pH and solvent.
Components 3 and 4 are also significant factors in this multivariate analysis and these account for 10.3% and 7.4% of the variance. We have not determined a measurable factor that can be directly correlated to these components, but they also appear to correspond to environmental niche. However, we see species-specific preferences for Leu, Cys, Asp, Thr, Ser, and to a lesser extent Glu, Gln, His and Met residues (Figure 1D). Component 3 strongly corresponds to several hyperthermophiles, but inversely corresponds to the extreme halophile Halobacterium, human pathogen Saur, gastro-intestinal tract colonizer Blong, and moderate halophiles and alkalophiles. Halobacterium's increased Asp usage is clearly consistent with its adaptation to intracellular and environmental conditions , although it differs to the hyperthermophile preference for the larger, negatively charged Glu. Component 4 has strong correspondence to the eukaryotes (Ecun, Hsap, Mmus, Atha, Cele) that correlates to Cys and Ser.
Taken together, the results from the principal component analysis suggest that amino acids that vary significantly among and between species are due to a large extent to environmental conditions.
Amino acid composition dendrogram
To compare organism amino acid composition, we performed hierarchical clustering using the complete linkage method with distances computed using the Euclidean metric on a dataset that consisted of the mean percent amino acid composition from all predicted open-reading frames for each of the 100 organisms (Figure 2). This method generates clusters of organisms with a similar mean composition across all 20 amino acids that are maximally separated by using the farthest neighbours. The resulting dendrogram presents three large branches within 10 Euclidean difference units. The upper branch clusters genomes with low GC content (yellow), the mid branch clusters mid GC genomes and the lower branch clusters high GC content genomes (green). A feature of clustering by amino acid composition is that phylogenetically related organisms are not necessarily proximate neighbours. For instance, Hsap and Mmus are clustered together, but are separated by a significant distance from Spom, Scer, Atha and Cele as well as the eukaryote Ecun. Oddly, Ecun clusters closely to hyperthermophilic archae Aful and thermophilic Mthe and more distantly to a cluster comprised of hyperthermophilic bacteria Aaeo and Tmar and archae Paby, Phor and Pfur. However, this organism is not reported to have thermophilic qualities . In another case, hyperthermophiles Aper, Paero and Mkan are clustered together, indicating that organisms that are less phylogenetically related may form tight clusters of organisms that live in similar environments. These results significantly extend previous composition-based dendrograms , but differ significantly from other attempts to generate genome-based dendrograms [44, 45].
Fold residue preferences
In order to address the question of whether the amino acid composition of ORFs were different that of folds as well as whether fold composition was species-specific, we generated over 57,000 conservative domain-based structure models for 95 genomes (see materials and methods). Amino acid compositions were computed across all protein coding regions for each complete genome using either genomic sequences for a given organism (CG) or fold (CF) for the purpose of identifying species-specific as well as pan-specific fold composition bias. Furthermore, excluded indels residues from the modeling exercise comprised <2% of all residues and these exhibited normal insertion or loop compositions richer in Pro and Gly, but poorer in the small hydrophobic residues. Figure 3 illustrates one case in which the mean composition of Asp is unvarying across all genomes, with the single exception of the extreme halophile Halobacterium. Moreover, we observe a significant increase in Asp residues in the fold regions as compared to the predicted ORF (t-test: p < 10-38). Figure 3 also illustrates a case in which the mean composition of Gln varies significantly across the genomes. Virtually all genomes show a decrease of Gln (p < 10-11) in the fold regions, with the startling exception of all thermophiles as well as Cper, Ecun, Halob, Scoel, Buchn, Fnucl and Mmaze. Although the mean composition of Gln is significantly lower (p < 10-24) in these thermophiles than the other genomes, the increase of Gln in the fold is a surprising finding given that amidated residues are susceptible to deamidation at high temperatures [46, 47]. However, others have reported that polar residues such as Gln are significantly reduced on the surface of thermophilic intracellular proteins as compared to their mesophilic counterparts, likely reducing the possibility of damaging deamidation reactions .
We found that small hydrophobic residues Ala, Gly and Val as well as charged residues Asp, Glu, His and Arg are consistently increased in the fold regions across all organisms (Figure 4). Furthermore, we observed a significant decrease of amidated residues Asn and Gln as well as larger aromatic residues Phe, Trp, and Tyr, as well as Leu and Ser in the fold regions. It is possible that smaller residues in fold regions allow better packing of the core whereas charged residues are utilized for stabilizing electrostatic interactions including salt bridges. In order to exclude the possibility that our results may be biased due to low compositional complexity of ORF or fold regions, we applied transmembrane, coiled-coil, compositional bias and low complexity region filtering using the pfilt application from David T. Jones (1997) and found few deviations from these trends (Figure 4). Since a large number of our templates are obtained via crystallography experiments, we cannot rule out the possibility that the fold composition bias may reflect a composition that is more amenable to crystallographic structure determination.
Composition-based scoring unctions
Since there exists significant amino acid variability between protein sequences from different organisms, we sought to generate a scoring function that would allow species-specific identification of protein sequences. Two scoring functions indicating the log probability of amino acid occurrence were generated for each organism. The first scoring function, CG, is based on genomic composition and was derived by taking the log of the amino acid frequency across all genomic ORFs for the given organism over the average amino acid frequency of all the genomes. The second scoring function, CF, was generated from fold composition of the aligned sequences and was derived by taking the log of the amino acid frequencies from the aligned residues of the genomic sequence divided by the template residues. In this fashion the reference state for these scoring functions is what we have termed the 'random organism' since it represents a collection of amino acid compositions from a variety of organisms. This then provides the noise of the scoring function from which we are trying to extract a meaningful, species-specific signal. Log-odds potentials of protein substructures are considered additive , and in the evaluation of a sequence, the overall score for a sequence is calculated from the sum of the species-specific log-odds scores for each of its residues.
The nature of these scoring functions is such that if the composition of the organism is not particularly different than the 'random organism', then the magnitude of the scoring function values will approach 0. For instance, the magnitude of the Ecoli CG and CF scoring functions values are typically less than either the Mjan or Halob (Figure 5). The CG and CF scoring functions are fairly similar and correlate well (86 ± 13%) with each other across all genomes. Mjan has a strong preference for Ile and Lys, but not Gln or Ala largely due to the amino acid coding due to the GC content of the genome (see PCA section). In contrast, Halob prefers the small hydrophobic Ala residue and the charged Asp residue, but not the amidated Asn nor the positively charged Lys. Thus, these scoring functions reflect the probability of observing any residue in a protein sequence or fold for some genome and are heavily influenced by the GC content of the genome and its residue-based environmental adaptations.
As a preliminary test, we evaluated the performance of the CF scoring functions for their ability to detect folds in a species-specific manner. That is, the successful scoring function should identify fold sequences of the parent taxonomy from which the scoring function was derived. The performance of the scoring function was evaluated via a jackknife method in which 10% of the model-template pairs were excluded in generating the scoring function. These excluded pairs were then scored with the exclusive scoring function and success was achieved when the score obtained from the model fold was greater than the template fold. The binary species detection ability of the CF scoring functions to select between the model over the template ranged from 65% to 99% with an average of 85 ± 8% of model sequences being detected from the species-specific fold database (random = 50%). The best CF detections (>95%) were made with scoring functions derived from those organism found to vary the most in composition including Mpul (99.4%), Buch, Bbur, Halob, Hpylo, Mjan, Mgen, Uure (96.2%). In contrast, the poorest CF detections were made by common bacteria and pathogen scoring functions from Ecoli variants, Cele, Hsap, Nmeni and Sent. The poor results from these scoring functions reflect the similar model-template composition. In fact, the Ecoli variants obtained ~50% of their template structures from E. coli, Cele obtained ~40% of template structures from human, Hsap obtained ~25% of its structures from mouse and 15% from rat and Mmus obtained 46% of template structures from human. The exclusion, or at least the limit of these structure templates would increase the difference in model-template composition and likely generate a more useful scoring function. Thus, these results indicate the admirable species-specific detection ability of the CF scoring functions on short species-specific domain sequences. Cross-validation was not performed for the CG scoring functions.
We used all 100 CG and 95 CF scoring functions to score every predicted protein sequence from all complete genomes in order to evaluate species-specificity (see Figure 9 (Table 1). The purpose of this experiment is to evaluate the scoring function effectiveness in identifying proteins from the parent organism. Log odd scores were obtained for each protein from each of the complete genomes as evaluated by each of the scoring functions. We also recorded the overall average score obtained by each scoring function across all the ORFs in the genome. In doing so, we discovered that the self-scoring function invariably obtained the lowest overall score (data not shown). The random probability that a scoring function will obtain the best score is determined by the number of best scores included over the total number of scoring functions (i.e. for CG 10/100 for the top 10 scoring functions using a total of 100 scoring functions) and we can find the maximum value as the difference between the observed success rate and the random probability (Figure 6). We find that the maximum success rate occurs when >20 CG scoring results are considered. However, as a more conservative estimate, one may choose to consider at least the top 5–10 scoring results to overcome the fact that similar scoring functions obtained by effectively redundant genomes will split the number of successful detections. For instance, scoring functions derived from E. coli strains and compositionally similar species (Sent and Styp and their variants) obtained comparable scores, which prevented effective detection of E. coli sequences by any of E. coli scoring functions when only the top score was considered a detection success. The effect of increasing the number of best scores included from 1 to 5, 10 and 15 can be seen for all scoring functions in Figure 7. The ability of the CG scoring functions to identify proteins from the parent organism when considering the top 15 scoring results ranged from 51% (EcoliE) to 87% (Paby) with an average 73 ± 9% success. The most effective scoring functions were derived from the low GC organism (Wbre, Buch, Bbur, Baphi, Mpul), hyperthermophiles, Halob and several high GC organisms (Ccres, Mtub, Mtub, Scoel, Smel). When including the top 5 scoring results, the success rate decreased to 49 ± 17%. Note the success rate is significantly higher than random (15/100 or 15%, 5/100 or 5%). In contrast, the effectiveness of the CF scoring functions varied more across this dataset, ranging from a low of 2% (Cele) to a high of 92% (Mpul) with an average success rate of 55 ± 24% when using the top 15 scores, which decreases to 40 ± 25% when only including the top 5. The most successful scoring functions were derived primarily from GC or AT rich organisms. Taken together, the most successful composition-based scoring functions were those derived by organisms with significant composition bias either as a result of %GC skew or from a more extreme environmental niche such as is the case for hyperthermophiles, thermophiles and halophiles. Finally, these results indicate that amino acid composition-based scoring functions may be able to identify the taxonomic origin of protein sequences.
In the largest study of its kind, we have identified species-specific amino acid composition differences across the predicted open-reading frames of 100 complete genomes. Continuously updated results are available at http://genome.mshri.on.ca. Our principal components analysis supports the idea that environmental niche is a major factor for the amino acid composition differences found between species. However, our results raise the possibility that this principal component corresponds more to a complex mixture of environmental influences such as pH, pressure, salt and solute concentrations and to some lesser extent, growth temperature [8, 26].
We observed an increased preference for small hydrophobic and charged residue over larger aromatic residues across all species after conservatively modeling 57,840 folds. Moreover, these fold composition biases also illustrate species-specific residue preferences. These biases provided an opportunity for the first time to derive and test simple yet effective species-specific scoring functions. We found that the fold scoring functions are 85 ± 8% effective in detecting a species-specific fold sequence. Moreover, we found that the genomic composition scoring function successfully identified sequences from its parent organism with a surprising 73 ± 9% overall accuracy.
The species-specific composition bias suggests that the variable amino acids are available for structural and/or environmental optimization aspects of proteins. We are currently investigating the usefulness of the species-specific composition-based scoring functions in identifying variable composition regions of protein structures and whether they correspond to structural/functional regions. We are also investigating the possibility of using these scoring functions to find proteins that are non-native to an organism, possibly indicating horizontal transfer. Scoring functions derived from this work can be used in future species-specific protein and fold identification and sequence optimization experiments.
Non-redundant protein sequences determined from each of the complete genomes (Table 1) was obtained from the National Center for Biotechnology Information (NCBI – http://www.ncbi.nlm.nih.gov)  via SeqHound, our integrated sequence and structure database manager http://seqhound.mshri.on.ca. Amino acid compositions were computed using all protein coding regions for each complete genome by software developed in our laboratory. Principal components analysis and the amino acid composition dendrogram were generated using the S-PLUS statistics package. Two-tailed paired t-tests were performed to test the null hypothesis that the ORF vs fold mean compositions for each amino acid were the same. All applications were written in ANSI C using the cross-platform NCBI Toolkit http://www.ncbi.nlm.nih.gov/IEB and have been compiled and tested on Windows 98/ME/NT/2000/XP, MacOsX, Linux, HP-UX, PA-RISC Linux, Compaq Tru64, IRIX, Solaris, QNX, FreeBSD and PowerPC-Linux operating systems. Protein sequence and fold scoring functions are available as additional files.
Conservative fold modeling
Domains are the fundamental unit of a polypeptide chain or part of a polypeptide chain that are thought to independently fold into a stable tertiary structure. Since domains are often units of function and different domains of a protein are often associated with different functions, we evaluated sequence alignments on a structural domain-by-domain basis rather than by the global alignment. This provides a conservative framework to evaluate structural alignments.
Sequence and structure
For each protein sequence from a completely sequenced and annotated genome, herein referred to as a genomic sequence, we identified neighbour sequences, that is, sequences in the non-redundant protein sequence database sharing significant levels of similarity (expect value < 0.01) using NBLAST, a cluster-computer variant of BLAST  (Table 1). No efforts were made to minimize a possible bias contributed by paralogous genomic sequences. Neighbour sequences with 3D structures, herein referred to as templates, were identified using SeqHound , in a similar fashion to the NCBI's genome annotation service . These genomic sequences and their corresponding templates are then used to generate hi-fidelity sequence to structure alignments.
Hi-fidelity sequence to structure alignment
We modified the ClustalW software package  to initiate a global alignment of two neighbour sequences using the PAM series substitution matrices and apply position specific gap penalties by virtue of a secondary structure profile. The profile is derived from the structure's annotation information provided by the authors of the published structure as well as from NCBI's vector alignment search tool, VAST . A greater weight is placed when the two sources agree, and this effectively forces gaps into unstructured regions lacking alpha helices and beta strands. To create conservative, fold-based alignments, gaps that were added to the genomic sequence are masked out since there is no correspondence to the structure and gaps inserted into the structure template to accommodate query insertions are eliminated (Figure 8). This gap-handling procedure had no visible effect on composition analyses that are later described.
To reject poor alignments and enhance the fidelity of the global alignment, both the sequence identity and structural position occupancy are determined over each VAST-identified structural domain. Various threshold levels were tested, although an alignment sequence identity of 25% and domain occupancy of 75% was found to provide optimal compromise between sensitivity and specificity (data not shown). If less than 25% of the aligned residues are identical and less than 75% of the aligned residues occupy residue positions in the domain, the domain is masked out completely and not used in any further computations (Figure 8). These selection criteria generate relevant domain homologues and provide the ability to discriminate subtle sequence changes that are independent of fold in a statistically observable manner. When an alignment across a domain is found to satisfy the minimum constraints specified above, a structural model is generated for the genomic sequence by virtue of a sequence-to-structure alignment, herein referred to as the model.
Since a genomic sequence may make many models using different templates, only the single best model is selected to minimize sampling bias. The selection criterion emphasizes the use of multi-domain structural models by using a scoring function derived from residue length of the non-masked out aligned domain region(s), the fraction of residues that are identical and the fraction of residues occupying domain positions. The model with the best score is then selected to represent that genomic sequence.
For each representative model, the sequence alignments between the genomic sequence and the template, along with the corresponding secondary structure are written to a database, herein known as the species-specific fold database. This fold database is the source of model and template sequences for determining fold composition and deriving species-specific fold scoring functions.
Our method exploits species-specific optimizations at the sequence level by making accurate structural-based alignment for genomic sequences. We generated models for 95 of the 100 genomes with 5 genomes having been very recent additions. Initially, there are as many 3D neighbours as genomic sequences (Table 1). However, 24 ± 10% of genomic sequences make structural models and only 19 ± 6% settle with a single representative model structure that pass our structural domain alignment criteria. The representative models are 168 ± 10 residues in length and possess 41 ± 5% sequence identity and 96.7 ± 0.3% domain occupancy with the template structure, which is clearly higher than our set minimum requirements. Furthermore, template structures are used 1.4 ± 0.4 times for model building, thereby minimizing structure over-sampling and providing more unique templates. Interestingly, at least 36 to as many as 287 different organisms contribute 5.8 ± 3.5 template structures to each genome modeling exercise. 30 ± 12% of the templates are obtained from E. coli and 23 ± 9% of structures are obtained from thermophilic species. Our models hold properties of 'good' models since they are based on at least 30% sequence identity are shorter than 200 residues and are aligned along template domains, in agreement with other published criteria .
In general, the number of final models generated for complete genomes reported elsewhere is greater than the number generated with our method. For example, the NCBI provides a substantially larger set of 3D structure neighbours for complete genomes, in which as many as 39% of sequences are reported to have structure neighbours http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/PDB_bact.html. ModBase has on average between 2 to 4 models per sequence in which they claim roughly 44% are reliable http://pipe.rockefeller.edu/modbase. Since our comparative modeling method is more conservative in that it does not attempt to model side-chains, loops, or regions with no template and our alignments are evaluated over smaller, domain-focused regions, we expect fewer errors .
Table 1 – Summary statistics for complete genomes modeling and scoring function results. Each species is represented by a short abbreviation (Abbr.), a unique GenBank taxonomy identifier (TaxID), Class (A – Archae, B – Bacteria, C – Eukaryote), environmental optimization (Hyperthermophile HT, thermophile (T), halophile (H), acidophile (D), ureaphile (U), radiation resistant (R), intracellular pathogen (I), pathogen (P), solventogenic (S), symbionts (Y) and plant pathogen (PP)). The modeling statistics include the number of predicted open-reading frames (ORFs), the GC content of the predicted open-reading frames (%GC), number of sequence neighbours with 3D structures (N3D), the number of sequences with a potential to make a model (PM), the number of representative models selected (SM), the percentage of ORFs modeled (OM), the number of unique structure templates (UT), the number of times a template was used (TU), the average identity (%ID) and domain occupancy (%OCC) in sequence to structure alignments, and the average number of residues per model (Res). The taxonomic contribution is listed by the number of organisms that contributed template structures (OC), the average number of structures contributed by each (OF), the percentage of templates that were from E. coli (%E) and thermophiles (%T). Finally, the percentage of correctly identified sequences in jackknifing for the CF scoring function (CF(JK)) and the percentage of correctly identified sequences using the top 15 scoring scores for the CG scoring function (CG(P)) and for the CF scoring function (CF(P)). NA – not available.
Martin DD, Ciulla RA, Roberts MF: Osmoadaptation in archaea. Appl Environ Microbiol 1999, 65: 1815–25.
Gross M, Jaenicke R: Proteins under pressure. The influence of high hydrostatic pressure on structure, function and assembly of proteins and protein complexes. Eur J Biochem 1994, 221: 617–30.
Vieille C, Zeikus GJ: Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol Mol Biol Rev 2001, 65: 1–43. 10.1128/MMBR.65.1.1-43.2001
Audia JP, Webb CC, Foster JW: Breaking through the acid barrier: an orchestrated response to proton stress by enteric bacteria. Int J Med Microbiol 2001, 291: 97–106.
May BJ, Zhang Q, Li LL, Paustian ML, Whittam TS, Kapur V: Complete genomic sequence of Pasteurella multocida, Pm70. Proc Natl Acad Sci U S A 2001, 98: 3460–5. 10.1073/pnas.051634598
Oren A: Bioenergetic aspects of halophilism. Microbiol Mol Biol Rev 1999, 63: 334–48.
Baudouin-Cornu P, Surdin-Kerjan Y, Marliere P, Thomas D: Molecular evolution of protein atomic composition. Science 2001, 293: 297–300. 10.1126/science.1061052
Kreil DP, Ouzounis CA: Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res 2001, 29: 1608–15. 10.1093/nar/29.7.1608
Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, McDonald L, Utterback TR, Malek JA, KD Linher, Garrett MM, Stewart AM, Cotton MD, Pratt MS, Phillips CA, D Richardson, Heidelberg J, Sutton GG, Fleischmann RD, Eisen JA, Fraser CM, et al.: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 1999, 399: 323–9. 10.1038/20601
She Q, Singh RK, Confalonieri F, Zivanovic Y, Allard G, Awayez MJ, Chan-Weiher CC, Clausen IG, Curtis BA, De Moors A, Erauso G, Fletcher C, Gordon PM, Heikamp-de Jong I, Jeffries AC, Kozera CJ, Medina N, Peng X, Thi-Ngoc HP, Redder P, Schenk ME, Theriault C, Tolstrup N, Charlebois RL, Doolittle WF, Duguet M, Gaasterland T, Garrett RA, Ragan MA, Sensen CW, Van der Oost J: The complete genome of the crenarchaeon Sulfolobus solfataricus P2. Proc Natl Acad Sci U S A 2001, 98: 7835–40. 10.1073/pnas.141222098
White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, Moffat KS, Qin H, Jiang L, Pamphile W, Crosby M, Shen M, Vamathevan JJ, Lam P, McDonald L, Utterback T, Zalewski C, Makarova KS, Aravind L, Daly MJ, Fraser CM, et al.: Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science 1999, 286: 1571–7. 10.1126/science.286.5444.1571
Makarova KS, Aravind L, Wolf YI, Tatusov RL, Minton KW, Koonin EV, Daly MJ: Genome of the extremely radiation-resistant bacterium Deinococcus radiodurans viewed from the perspective of comparative genomics. Microbiol Mol Biol Rev 2001, 65: 44–79. 10.1128/MMBR.65.1.44-79.2001
Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H: Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 2000, 407: 81–6. 10.1038/35024074
Goodner B, Hinkle G, Gattung S, Miller N, Blanchard M, Qurollo B, Goldman BS, Cao Y, Askenazi M, Halling C, Mullin L, Houmiel K, Gordon J, Vaudin M, Iartchouk O, Epp A, Liu F, Wollam C, Allinger M, Doughty D, Scott C, Lappas C, Markelz B, Flanagan C, Crowell C, Gurson J, Lomo C, Sear C, Strub G, Cielo C, Slater S: Genome sequence of the plant pathogen and biotechnology agent Agrobacterium tumefaciens C58. Science 2001, 294: 2323–8. 10.1126/science.1066803
Galibert F, Finan TM, Long SR, Puhler A, Abola P, Ampe F, Barloy-Hubler F, Barnett MJ, Becker A, Boistard P, Bothe G, Boutry M, Bowser L, Buhrmester J, Cadieu E, Capela D, Chain P, Cowie A, Davis RW, Dreano S, Federspiel NA, Fisher RF, Gloux S, Godrie T, Goffeau A, Golding B, Gouzy J, Gurjal M, Hernandez-Lucas I, Hong A, Huizar L, Hyman RW, Jones T, Kahn D, Kahn ML, Kalman S, Keating DH, Kiss E, Komp C, Lelaure V, Masuy D, Palm C, Peck MC, Pohl TM, Portetelle D, Purnelle B, Ramsperger U, Surzycki R, Thebault P, Vandenbol M, Vorholter FJ, Weidner S, Wells DH, Wong K, Yeh KC, Batut J: The composite genome of the legume symbiont Sinorhizobium meliloti. Science 2001, 293: 668–72.
Houry WA: Mechanism of substrate recognition by the chaperonin GroEL. Biochem Cell Biol 2001, 79: 569–77. 10.1139/bcb-79-5-569
Kim R, Kim KK, Yokota H, Kim SH: Small heat shock protein of Methanococcus jannaschii, a hyperthermophile. Proc Natl Acad Sci U S A 1998, 95: 9129–33. 10.1073/pnas.95.16.9129
Mogk A, Tomoyasu T, Goloubinoff P, Rudiger S, Roder D, Langen H, Bukau B: Identification of thermolabile Escherichia coli proteins: prevention and reversion of aggregation by DnaK and ClpB. Embo J 1999, 18: 6934–49. 10.1093/emboj/18.24.6934
Kowalski JM, Kelly RM, Konisky J, Clark DS, Wittrup KD: Purification and functional characterization of a chaperone from Methanococcus jannaschii. Syst Appl Microbiol 1998, 21: 173–8.
Bock AK, Glasemacher J, Schmidt R, Schonheit P: Purification and characterization of two extremely thermostable enzymes, phosphate acetyltransferase and acetate kinase, from the hyperthermophilic eubacterium Thermotoga maritima. J Bacteriol 1999, 181: 1861–7.
Russell RJ, Ferguson JM, Hough DW, Danson MJ, Taylor GL: The crystal structure of citrate synthase from the hyperthermophilic archaeon pyrococcus furiosus at 1.9 A resolution. Biochemistry 1997, 36: 9983–94. 10.1021/bi9705321
Lobry JR: Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 1997, 205: 309–16. 10.1016/S0378-1119(97)00403-4
Lynn DJ, Singer GA, Hickey DA: Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res 2002, 30: 4272–7. 10.1093/nar/gkf546
Chakravarty S, Varadarajan R: Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett 2000, 470: 65–9. 10.1016/S0014-5793(00)01267-9
Chakravarty S, Varadarajan R: Elucidation of factors responsible for enhanced thermal stability of proteins: a structural genomics based study. Biochemistry 2002, 41: 8152–61. 10.1021/bi025523t
Tekaia F, Yeramian E, Dujon B: Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 2002, 297: 51. 10.1016/S0378-1119(02)00871-5
Maes D, Zeelen JP, Thanki N, Beaucamp N, Alvarez M, Thi MH, Backmann J, Martial JA, Wyns L, Jaenicke R, Wierenga RK: The crystal structure of triosephosphate isomerase (TIM) from Thermotoga maritima: a comparative thermostability structural analysis of ten different TIM structures. Proteins 1999, 37: 441–53. 10.1002/(SICI)1097-0134(19991115)37:3<441::AID-PROT11>3.0.CO;2-7
Szilagyi A, Zavodszky P: Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure Fold Des 2000, 8: 493–504.
Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266: 554–71.
Tusnady GE, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics 2001, 17: 849–50. 10.1093/bioinformatics/17.9.849
Nielsen H, Engelbrecht J, S Brunak, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 1997, 10: 1–6. 10.1093/protein/10.1.1
Lupas A, Van Dyke M, Stock J: Predicting coiled coils from protein sequences. Science 1991, 252: 1162–4.
Rost B, Fariselli P, Casadio R: Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996, 5: 1704–18.
Chou KC, Maggiora GM: Domain structural class prediction. Protein Eng 1998, 11: 523–38. 10.1093/protein/11.7.523
Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105–32.
Cai YD, Liu XJ, Xu XB, Chou KC: Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J Cell Biochem 2002, 84: 343–8. 10.1002/jcb.10030
Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez JC, Yan JX, Gooley AA, Hughes G, Humphery-Smith I, Williams KL, Hochstrasser DF: From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (N Y) 1996, 14: 61–5.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–402. 10.1093/nar/25.17.3389
Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: 78–94. 10.1006/jmbi.1997.0951
Uberbacher EC, Mural RJ: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A 1991, 88: 11261–5.
Gelfand MS: Prediction of function in DNA sequence analysis. J Comput Biol 1995, 2: 87–115.
Dennis PP, Shimmin LC: Evolutionary divergence and salinity-mediated selection in halophilic archaea. Microbiol Mol Biol Rev 1997, 61: 90–104.
Katinka MD, Duprat S, Cornillot E, Metenier G, Thomarat F, Prensier G, Barbe V, Peyretaillade E, Brottier P, Wincker P, Delbac F, El Alaoui H, Peyret P, Saurin W, Gouy M, Weissenbach J, Vivares CP: Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 2001, 414: 450–3. 10.1038/35106579
Clarke GD, Beiko RG, Ragan MA, Charlebois RL: Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol 2002, 184: 2072–80. 10.1128/JB.184.8.2072-2080.2002
Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001, 1: 8. 10.1186/1471-2148-1-8
Ahern TJ, Klibanov AM: The mechanisms of irreversible enzyme inactivation at 100C. Science 1985, 228: 1280–4.
Tomazic SJ, Klibanov AM: Mechanisms of irreversible thermal inactivation of Bacillus alpha-amylases. J Biol Chem 1988, 263: 3086–91.
Fukuchi S, Nishikawa K: Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol 2001, 309: 835–43. 10.1006/jmbi.2001.4718
Bryant SH, Lawrence CE: The frequency of ion-pair substructures in proteins is quantitatively related to electrostatic potential: a statistical model for nonbonded interactions. Proteins 1991, 9: 108–19.
Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res 2002, 30: 13–6. 10.1093/nar/30.1.13
Michalickova K, Bader GD, Dumontier M, Lieu HC, Betel D, Isserlin R, Hogue CW: SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics 2002, 3: 32. 10.1186/1471-2105-3-32
Dumontier M, Hogue CWV: NBLAST: a Cluster Variant of BLAST for NxN Comparisons. BMC Bioinformatics 2002, 3: 13. 10.1186/1471-2105-3-13
Wang Y, Bryant S, Tatusov R, Tatusova T: Links from genome proteins to known 3-D structures. Genome Res 2000, 10: 1643–7. 10.1101/gr.143200
Higgins DG, Sharp PM: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 1988, 73: 237–44. 10.1016/0378-1119(88)90330-7
Hogue CW, Ohkawa H, Bryant SH: A dynamic look at structures: WWW-Entrez and the Molecular Modeling Database. Trends Biochem Sci 1996, 21: 226–9. 10.1016/0968-0004(96)30017-0
Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11: 430–48. 10.1110/ps.25502
Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291
We would like to thank our colleagues at the Samuel Lunenfeld Research Institute for their support in our work. We would like to thank Gary Bader and Doron Betel for critical reading of the manuscript. We would also like to thank our reviewers for providing excellent feedback that enhanced the quality of our manuscript. This research was supported by grants to C.W.V Hogue and M. Dumontier by the Natural Sciences and Engineering Research Council of Canada.
KM provided the framework for complete genome analysis with her development of the SeqHound sequence and structure database management system. MD carried out the statistical analysis, derived and tested the scoring functions and drafted the manuscript. MD and CWVH jointly conceived of the study, and participated in its design and coordination.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.