Species-specific protein sequence and fold optimizations
© Dumontier et al; licensee BioMed Central Ltd. 2002
Received: 11 July 2002
Accepted: 17 December 2002
Published: 17 December 2002
An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes.
Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archae, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archae and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 ± 8% whereas the CG detected 73 ± 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca.
Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.
An organism may increase its fitness in some range of environmental conditions through evolution. Fundamental to the survival of cells is the ability to modulate fluctuations in external osmotic and atmospheric pressure, temperature and pH via the acquisition or development of advantageous molecular mechanisms [1–4]. These mechanisms include the uptake of small molecules, osmolytes or metals via transporters as found for increased iron uptake allowing enhanced growth of Pasteurella multocida  and in the accumulation of high concentrations of the stabilizing K+ among halophiles . Other mechanisms include modification of the atomic  and residue  composition of proteins, or the acquisition of environmental adaptive genes via lateral gene transfer as was likely the case for the thermophilic bacteria Thermotoga maritima  and archaea Solfolobus solfataricus P2 . In other cases, the gene duplication events augment the ability of an organism to adapt to extreme environments by expanding specific protein families including additional stress response and damage control genes that provide increased protection for the radiation resistant bacteria Deinococcus radiodurans [11, 12]. Interestingly, in symbionts such as Buchnera sp. APS , Agrobacterium tumefaciens  and Sinorhizobium meliloti , shared genetic material may increases overall fitness, but this effectively results in the loss of redundant genes and imposes host-symbiont dependencies. In other organisms completely new and innovative mechanisms are required for adapting to the most extreme of environments.
In adaptation to the most extreme environments, it is expected that the protein complement also possesses the organism's adaptive property . For instance, hyperthermophilic proteins must not only be functional, but optimized towards the host's extremely hot (>80°C) physical environment. Although in vivo protection factors have been identified that can stabilize proteins in vitro at high temperatures  and chaperone proteins can help refold misfolded proteins and prevent aggregation [16–18], the majority of foreign proteins cloned and expressed in E. coli retain all of the native enzyme's biochemical properties, including proper folding, thermostability and optimal activity consistent with the organism's optimal growth temperatures [19–21]. Thus, it is likely that sequence optimizations are required to ensure protein activity and folding in organisms whose growth conditions might otherwise adversely affect proteins.
Researchers have studied complete or partial genomes using bioinformatics in addition to the traditional comparative sequence-structure and structure-function mutation studies to identify stability factors. Recent studies of complete or partial genomes have identified sequence-based correlations between organisms using amino acid compositions. Lobry demonstrated the correlation between G+C content and codon usage across bacterial sequences  and G+C content and amino acid composition correlations have been extended to 25 complete genomes . Moreover, codon usage and amino acid preferences for thermophiles are well established and have been extended to complete genomes [23–26]. However, these generalizations do not necessarily agree with comparative sequence-structure studies. Comparative studies often exploit sequence or structure based alignments to determine similarities and differences. Investigation of thermostability factors across 10 organisms including psychrophiles (cold-tolerant), mesophiles to hyperthermophiles with triosephosphate isomerase failed to identify significant correlations of composition with thermostability . Further uncertainty arises from indications that different protein families adapt to temperature conditions by different sets of structural mechanisms . How then to unify amino acid composition preferences with species-specific structural adaptations?
Algorithms have been designed to predict certain protein features primarily from sequence composition including low complexity regions , transmembrane segments , signal peptides , coiled-coils , secondary structure elements , structural classes , hydrophobicity , sub-cellular location  and have been used to increase remote sequence similarity searching [37, 38]. Moreover, genomic base content has been used to predict open-reading frames and in-site splicing [39–41]. However, no algorithms have been designed to explore adaptation of proteins to their host environment, especially in a species-specific manner.
Species-specific adaptive optimizations might be expected to be subtle and hard to find in any individual sequence, yet sufficiently common across the bulk of genomic proteins that they may be detected using statistical methods. We demonstrate here that such subtle adaptive optimizations do exist in many individual organisms and that these can be extracted. We derive species-specific protein sequence and fold scoring functions from residue preferences found in predicted open reading frames and conservative structural models. The resulting scoring functions are effective in amino acid composition species-specific protein sequence and fold detection.
Results and Discussion
Principal Components Analysis
The most significant principal component accounted for 47.5% of the variance and showed a strong correlation to DNA base pair content (94%). The left of this component corresponds to low GC organisms such as buchnera sp. (~27%), Mpul (~27%), Bbur (29%), Uure (26%), Wbre (23%) whereas the right of the component corresponds to high GC organisms including Mtub (66%), various plant pathogens (Xanthomonas sp., Mloti), soil bacterium Scoel (72%) and radiation-resistant Drad (66.6%) (Figure 1A). Strong correlations also exist between the first component with several of the factor loadings (Figure 1B). The correlated factor loadings have either [G|C] or [A|T] in the first two codon positions for some codon. The effect for the standard codon table is that GC rich codons [C|G] [C|G] [X] encode amino acids Pro, Arg, Gly, Ala, Trp and GC poor codons [A|T] [A|T] [X] encoding Phe, Leu, Ile, Asn, Lys, and Tyr (as well as Met and 2 stop codons). This is in agreement with a previous report . Consequently, genomic GC content will to a large extent determine amino acid usage as well as the choosing between small hydrophobic residues Ala/Gly or Ile, positively charged residues Arg or Lys, and large hydrophobic residues Trp or Tyr/Phe.
The second largest principal component accounts for 15.5% of the variance and appears to correspond to the environmental niche (Figure 1A). Hyperthermophiles (Mkan, Paby, Pfur, Phor, Aful, Aaeo, Tmar, Tten, and Mjan), thermophile Mthe, extreme halophile Halo, thermo-acidophiles (Taci, Tvo, Ssol, Stok), and solventogenic bacteria (Cace, Cper and Fnucl) correspond strongly to weakly, respectively, to component 2. Strong correlations to this component exist for Glu and Val, although opposite correlations exist for Gln, His, Thr, Ser and Cys, thereby suggesting the preferential usage of these amino acids by those organisms. A discussion regarding amino acid preferences for hyperthermophiles can be found elsewhere . Eukaryotes (Hsap, Mmus, Scer, Cele and Atha), with the exception of the obligate intracellular eukaryote parasite Ecun, have a strong, but opposite correspondence to component 2. These cluster with chlamydias/chlamydophilas (Cmur, Ctra, Cpne) and the inverse correspondence also indicates a significant increase in the genomic amino acid usage of Gln, His, Thr, Ser, and Cys, and decrease of Glu or Val. Interestingly, plant pathogens (Xaxon, Xcamp, Mloti, Rsol, Xfas, AtumC and AtumU), moderate halophiles and alkalophiles (Bsub, Bhal, Linn, Lact, Lmono, Oihey), and most human pathogens do not correspond to this component and have an average composition with regards to these amino acids. The distribution of organisms across this component does not appear to correspond to discrete groupings of organisms that share similar environmental niches, but rather to a 'continuum of lifestyles' . However, unlike previous studies that report correlations of this second principal component with growth temperatures [8, 26], our results seem to indicate that this component is likely to correlate to a more complex phenomenon that incorporates growth temperature as well as other physical factors, possibly pH and solvent.
Components 3 and 4 are also significant factors in this multivariate analysis and these account for 10.3% and 7.4% of the variance. We have not determined a measurable factor that can be directly correlated to these components, but they also appear to correspond to environmental niche. However, we see species-specific preferences for Leu, Cys, Asp, Thr, Ser, and to a lesser extent Glu, Gln, His and Met residues (Figure 1D). Component 3 strongly corresponds to several hyperthermophiles, but inversely corresponds to the extreme halophile Halobacterium, human pathogen Saur, gastro-intestinal tract colonizer Blong, and moderate halophiles and alkalophiles. Halobacterium's increased Asp usage is clearly consistent with its adaptation to intracellular and environmental conditions , although it differs to the hyperthermophile preference for the larger, negatively charged Glu. Component 4 has strong correspondence to the eukaryotes (Ecun, Hsap, Mmus, Atha, Cele) that correlates to Cys and Ser.
Taken together, the results from the principal component analysis suggest that amino acids that vary significantly among and between species are due to a large extent to environmental conditions.
Amino acid composition dendrogram
Fold residue preferences
Composition-based scoring unctions
Since there exists significant amino acid variability between protein sequences from different organisms, we sought to generate a scoring function that would allow species-specific identification of protein sequences. Two scoring functions indicating the log probability of amino acid occurrence were generated for each organism. The first scoring function, CG, is based on genomic composition and was derived by taking the log of the amino acid frequency across all genomic ORFs for the given organism over the average amino acid frequency of all the genomes. The second scoring function, CF, was generated from fold composition of the aligned sequences and was derived by taking the log of the amino acid frequencies from the aligned residues of the genomic sequence divided by the template residues. In this fashion the reference state for these scoring functions is what we have termed the 'random organism' since it represents a collection of amino acid compositions from a variety of organisms. This then provides the noise of the scoring function from which we are trying to extract a meaningful, species-specific signal. Log-odds potentials of protein substructures are considered additive , and in the evaluation of a sequence, the overall score for a sequence is calculated from the sum of the species-specific log-odds scores for each of its residues.
As a preliminary test, we evaluated the performance of the CF scoring functions for their ability to detect folds in a species-specific manner. That is, the successful scoring function should identify fold sequences of the parent taxonomy from which the scoring function was derived. The performance of the scoring function was evaluated via a jackknife method in which 10% of the model-template pairs were excluded in generating the scoring function. These excluded pairs were then scored with the exclusive scoring function and success was achieved when the score obtained from the model fold was greater than the template fold. The binary species detection ability of the CF scoring functions to select between the model over the template ranged from 65% to 99% with an average of 85 ± 8% of model sequences being detected from the species-specific fold database (random = 50%). The best CF detections (>95%) were made with scoring functions derived from those organism found to vary the most in composition including Mpul (99.4%), Buch, Bbur, Halob, Hpylo, Mjan, Mgen, Uure (96.2%). In contrast, the poorest CF detections were made by common bacteria and pathogen scoring functions from Ecoli variants, Cele, Hsap, Nmeni and Sent. The poor results from these scoring functions reflect the similar model-template composition. In fact, the Ecoli variants obtained ~50% of their template structures from E. coli, Cele obtained ~40% of template structures from human, Hsap obtained ~25% of its structures from mouse and 15% from rat and Mmus obtained 46% of template structures from human. The exclusion, or at least the limit of these structure templates would increase the difference in model-template composition and likely generate a more useful scoring function. Thus, these results indicate the admirable species-specific detection ability of the CF scoring functions on short species-specific domain sequences. Cross-validation was not performed for the CG scoring functions.
In the largest study of its kind, we have identified species-specific amino acid composition differences across the predicted open-reading frames of 100 complete genomes. Continuously updated results are available at http://genome.mshri.on.ca. Our principal components analysis supports the idea that environmental niche is a major factor for the amino acid composition differences found between species. However, our results raise the possibility that this principal component corresponds more to a complex mixture of environmental influences such as pH, pressure, salt and solute concentrations and to some lesser extent, growth temperature [8, 26].
We observed an increased preference for small hydrophobic and charged residue over larger aromatic residues across all species after conservatively modeling 57,840 folds. Moreover, these fold composition biases also illustrate species-specific residue preferences. These biases provided an opportunity for the first time to derive and test simple yet effective species-specific scoring functions. We found that the fold scoring functions are 85 ± 8% effective in detecting a species-specific fold sequence. Moreover, we found that the genomic composition scoring function successfully identified sequences from its parent organism with a surprising 73 ± 9% overall accuracy.
The species-specific composition bias suggests that the variable amino acids are available for structural and/or environmental optimization aspects of proteins. We are currently investigating the usefulness of the species-specific composition-based scoring functions in identifying variable composition regions of protein structures and whether they correspond to structural/functional regions. We are also investigating the possibility of using these scoring functions to find proteins that are non-native to an organism, possibly indicating horizontal transfer. Scoring functions derived from this work can be used in future species-specific protein and fold identification and sequence optimization experiments.
Non-redundant protein sequences determined from each of the complete genomes (Table 1) was obtained from the National Center for Biotechnology Information (NCBI – http://www.ncbi.nlm.nih.gov)  via SeqHound, our integrated sequence and structure database manager http://seqhound.mshri.on.ca. Amino acid compositions were computed using all protein coding regions for each complete genome by software developed in our laboratory. Principal components analysis and the amino acid composition dendrogram were generated using the S-PLUS statistics package. Two-tailed paired t-tests were performed to test the null hypothesis that the ORF vs fold mean compositions for each amino acid were the same. All applications were written in ANSI C using the cross-platform NCBI Toolkit http://www.ncbi.nlm.nih.gov/IEB and have been compiled and tested on Windows 98/ME/NT/2000/XP, MacOsX, Linux, HP-UX, PA-RISC Linux, Compaq Tru64, IRIX, Solaris, QNX, FreeBSD and PowerPC-Linux operating systems. Protein sequence and fold scoring functions are available as additional files.
Conservative fold modeling
Domains are the fundamental unit of a polypeptide chain or part of a polypeptide chain that are thought to independently fold into a stable tertiary structure. Since domains are often units of function and different domains of a protein are often associated with different functions, we evaluated sequence alignments on a structural domain-by-domain basis rather than by the global alignment. This provides a conservative framework to evaluate structural alignments.
Sequence and structure
For each protein sequence from a completely sequenced and annotated genome, herein referred to as a genomic sequence, we identified neighbour sequences, that is, sequences in the non-redundant protein sequence database sharing significant levels of similarity (expect value < 0.01) using NBLAST, a cluster-computer variant of BLAST  (Table 1). No efforts were made to minimize a possible bias contributed by paralogous genomic sequences. Neighbour sequences with 3D structures, herein referred to as templates, were identified using SeqHound , in a similar fashion to the NCBI's genome annotation service . These genomic sequences and their corresponding templates are then used to generate hi-fidelity sequence to structure alignments.
Hi-fidelity sequence to structure alignment
To reject poor alignments and enhance the fidelity of the global alignment, both the sequence identity and structural position occupancy are determined over each VAST-identified structural domain. Various threshold levels were tested, although an alignment sequence identity of 25% and domain occupancy of 75% was found to provide optimal compromise between sensitivity and specificity (data not shown). If less than 25% of the aligned residues are identical and less than 75% of the aligned residues occupy residue positions in the domain, the domain is masked out completely and not used in any further computations (Figure 8). These selection criteria generate relevant domain homologues and provide the ability to discriminate subtle sequence changes that are independent of fold in a statistically observable manner. When an alignment across a domain is found to satisfy the minimum constraints specified above, a structural model is generated for the genomic sequence by virtue of a sequence-to-structure alignment, herein referred to as the model.
Since a genomic sequence may make many models using different templates, only the single best model is selected to minimize sampling bias. The selection criterion emphasizes the use of multi-domain structural models by using a scoring function derived from residue length of the non-masked out aligned domain region(s), the fraction of residues that are identical and the fraction of residues occupying domain positions. The model with the best score is then selected to represent that genomic sequence.
For each representative model, the sequence alignments between the genomic sequence and the template, along with the corresponding secondary structure are written to a database, herein known as the species-specific fold database. This fold database is the source of model and template sequences for determining fold composition and deriving species-specific fold scoring functions.
Our method exploits species-specific optimizations at the sequence level by making accurate structural-based alignment for genomic sequences. We generated models for 95 of the 100 genomes with 5 genomes having been very recent additions. Initially, there are as many 3D neighbours as genomic sequences (Table 1). However, 24 ± 10% of genomic sequences make structural models and only 19 ± 6% settle with a single representative model structure that pass our structural domain alignment criteria. The representative models are 168 ± 10 residues in length and possess 41 ± 5% sequence identity and 96.7 ± 0.3% domain occupancy with the template structure, which is clearly higher than our set minimum requirements. Furthermore, template structures are used 1.4 ± 0.4 times for model building, thereby minimizing structure over-sampling and providing more unique templates. Interestingly, at least 36 to as many as 287 different organisms contribute 5.8 ± 3.5 template structures to each genome modeling exercise. 30 ± 12% of the templates are obtained from E. coli and 23 ± 9% of structures are obtained from thermophilic species. Our models hold properties of 'good' models since they are based on at least 30% sequence identity are shorter than 200 residues and are aligned along template domains, in agreement with other published criteria .
In general, the number of final models generated for complete genomes reported elsewhere is greater than the number generated with our method. For example, the NCBI provides a substantially larger set of 3D structure neighbours for complete genomes, in which as many as 39% of sequences are reported to have structure neighbours http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/PDB_bact.html. ModBase has on average between 2 to 4 models per sequence in which they claim roughly 44% are reliable http://pipe.rockefeller.edu/modbase. Since our comparative modeling method is more conservative in that it does not attempt to model side-chains, loops, or regions with no template and our alignments are evaluated over smaller, domain-focused regions, we expect fewer errors .
Table 1 – Summary statistics for complete genomes modeling and scoring function results. Each species is represented by a short abbreviation (Abbr.), a unique GenBank taxonomy identifier (TaxID), Class (A – Archae, B – Bacteria, C – Eukaryote), environmental optimization (Hyperthermophile HT, thermophile (T), halophile (H), acidophile (D), ureaphile (U), radiation resistant (R), intracellular pathogen (I), pathogen (P), solventogenic (S), symbionts (Y) and plant pathogen (PP)). The modeling statistics include the number of predicted open-reading frames (ORFs), the GC content of the predicted open-reading frames (%GC), number of sequence neighbours with 3D structures (N3D), the number of sequences with a potential to make a model (PM), the number of representative models selected (SM), the percentage of ORFs modeled (OM), the number of unique structure templates (UT), the number of times a template was used (TU), the average identity (%ID) and domain occupancy (%OCC) in sequence to structure alignments, and the average number of residues per model (Res). The taxonomic contribution is listed by the number of organisms that contributed template structures (OC), the average number of structures contributed by each (OF), the percentage of templates that were from E. coli (%E) and thermophiles (%T). Finally, the percentage of correctly identified sequences in jackknifing for the CF scoring function (CF(JK)) and the percentage of correctly identified sequences using the top 15 scoring scores for the CG scoring function (CG(P)) and for the CF scoring function (CF(P)). NA – not available.
We would like to thank our colleagues at the Samuel Lunenfeld Research Institute for their support in our work. We would like to thank Gary Bader and Doron Betel for critical reading of the manuscript. We would also like to thank our reviewers for providing excellent feedback that enhanced the quality of our manuscript. This research was supported by grants to C.W.V Hogue and M. Dumontier by the Natural Sciences and Engineering Research Council of Canada.
- Martin DD, Ciulla RA, Roberts MF: Osmoadaptation in archaea. Appl Environ Microbiol 1999, 65: 1815–25.PubMed CentralPubMedGoogle Scholar
- Gross M, Jaenicke R: Proteins under pressure. The influence of high hydrostatic pressure on structure, function and assembly of proteins and protein complexes. Eur J Biochem 1994, 221: 617–30.View ArticlePubMedGoogle Scholar
- Vieille C, Zeikus GJ: Hyperthermophilic enzymes: sources, uses, and molecular mechanisms for thermostability. Microbiol Mol Biol Rev 2001, 65: 1–43. 10.1128/MMBR.65.1.1-43.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Audia JP, Webb CC, Foster JW: Breaking through the acid barrier: an orchestrated response to proton stress by enteric bacteria. Int J Med Microbiol 2001, 291: 97–106.View ArticlePubMedGoogle Scholar
- May BJ, Zhang Q, Li LL, Paustian ML, Whittam TS, Kapur V: Complete genomic sequence of Pasteurella multocida, Pm70. Proc Natl Acad Sci U S A 2001, 98: 3460–5. 10.1073/pnas.051634598PubMed CentralView ArticlePubMedGoogle Scholar
- Oren A: Bioenergetic aspects of halophilism. Microbiol Mol Biol Rev 1999, 63: 334–48.PubMed CentralPubMedGoogle Scholar
- Baudouin-Cornu P, Surdin-Kerjan Y, Marliere P, Thomas D: Molecular evolution of protein atomic composition. Science 2001, 293: 297–300. 10.1126/science.1061052View ArticlePubMedGoogle Scholar
- Kreil DP, Ouzounis CA: Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res 2001, 29: 1608–15. 10.1093/nar/29.7.1608PubMed CentralView ArticlePubMedGoogle Scholar
- Nelson KE, Clayton RA, Gill SR, Gwinn ML, Dodson RJ, Haft DH, Hickey EK, Peterson JD, Nelson WC, Ketchum KA, McDonald L, Utterback TR, Malek JA, KD Linher, Garrett MM, Stewart AM, Cotton MD, Pratt MS, Phillips CA, D Richardson, Heidelberg J, Sutton GG, Fleischmann RD, Eisen JA, Fraser CM, et al.: Evidence for lateral gene transfer between Archaea and bacteria from genome sequence of Thermotoga maritima. Nature 1999, 399: 323–9. 10.1038/20601View ArticlePubMedGoogle Scholar
- She Q, Singh RK, Confalonieri F, Zivanovic Y, Allard G, Awayez MJ, Chan-Weiher CC, Clausen IG, Curtis BA, De Moors A, Erauso G, Fletcher C, Gordon PM, Heikamp-de Jong I, Jeffries AC, Kozera CJ, Medina N, Peng X, Thi-Ngoc HP, Redder P, Schenk ME, Theriault C, Tolstrup N, Charlebois RL, Doolittle WF, Duguet M, Gaasterland T, Garrett RA, Ragan MA, Sensen CW, Van der Oost J: The complete genome of the crenarchaeon Sulfolobus solfataricus P2. Proc Natl Acad Sci U S A 2001, 98: 7835–40. 10.1073/pnas.141222098PubMed CentralView ArticlePubMedGoogle Scholar
- White O, Eisen JA, Heidelberg JF, Hickey EK, Peterson JD, Dodson RJ, Haft DH, Gwinn ML, Nelson WC, Richardson DL, Moffat KS, Qin H, Jiang L, Pamphile W, Crosby M, Shen M, Vamathevan JJ, Lam P, McDonald L, Utterback T, Zalewski C, Makarova KS, Aravind L, Daly MJ, Fraser CM, et al.: Genome sequence of the radioresistant bacterium Deinococcus radiodurans R1. Science 1999, 286: 1571–7. 10.1126/science.286.5444.1571PubMed CentralView ArticlePubMedGoogle Scholar
- Makarova KS, Aravind L, Wolf YI, Tatusov RL, Minton KW, Koonin EV, Daly MJ: Genome of the extremely radiation-resistant bacterium Deinococcus radiodurans viewed from the perspective of comparative genomics. Microbiol Mol Biol Rev 2001, 65: 44–79. 10.1128/MMBR.65.1.44-79.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Shigenobu S, Watanabe H, Hattori M, Sakaki Y, Ishikawa H: Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 2000, 407: 81–6. 10.1038/35024074View ArticlePubMedGoogle Scholar
- Goodner B, Hinkle G, Gattung S, Miller N, Blanchard M, Qurollo B, Goldman BS, Cao Y, Askenazi M, Halling C, Mullin L, Houmiel K, Gordon J, Vaudin M, Iartchouk O, Epp A, Liu F, Wollam C, Allinger M, Doughty D, Scott C, Lappas C, Markelz B, Flanagan C, Crowell C, Gurson J, Lomo C, Sear C, Strub G, Cielo C, Slater S: Genome sequence of the plant pathogen and biotechnology agent Agrobacterium tumefaciens C58. Science 2001, 294: 2323–8. 10.1126/science.1066803View ArticlePubMedGoogle Scholar
- Galibert F, Finan TM, Long SR, Puhler A, Abola P, Ampe F, Barloy-Hubler F, Barnett MJ, Becker A, Boistard P, Bothe G, Boutry M, Bowser L, Buhrmester J, Cadieu E, Capela D, Chain P, Cowie A, Davis RW, Dreano S, Federspiel NA, Fisher RF, Gloux S, Godrie T, Goffeau A, Golding B, Gouzy J, Gurjal M, Hernandez-Lucas I, Hong A, Huizar L, Hyman RW, Jones T, Kahn D, Kahn ML, Kalman S, Keating DH, Kiss E, Komp C, Lelaure V, Masuy D, Palm C, Peck MC, Pohl TM, Portetelle D, Purnelle B, Ramsperger U, Surzycki R, Thebault P, Vandenbol M, Vorholter FJ, Weidner S, Wells DH, Wong K, Yeh KC, Batut J: The composite genome of the legume symbiont Sinorhizobium meliloti. Science 2001, 293: 668–72.View ArticlePubMedGoogle Scholar
- Houry WA: Mechanism of substrate recognition by the chaperonin GroEL. Biochem Cell Biol 2001, 79: 569–77. 10.1139/bcb-79-5-569View ArticlePubMedGoogle Scholar
- Kim R, Kim KK, Yokota H, Kim SH: Small heat shock protein of Methanococcus jannaschii, a hyperthermophile. Proc Natl Acad Sci U S A 1998, 95: 9129–33. 10.1073/pnas.95.16.9129PubMed CentralView ArticlePubMedGoogle Scholar
- Mogk A, Tomoyasu T, Goloubinoff P, Rudiger S, Roder D, Langen H, Bukau B: Identification of thermolabile Escherichia coli proteins: prevention and reversion of aggregation by DnaK and ClpB. Embo J 1999, 18: 6934–49. 10.1093/emboj/18.24.6934PubMed CentralView ArticlePubMedGoogle Scholar
- Kowalski JM, Kelly RM, Konisky J, Clark DS, Wittrup KD: Purification and functional characterization of a chaperone from Methanococcus jannaschii. Syst Appl Microbiol 1998, 21: 173–8.View ArticlePubMedGoogle Scholar
- Bock AK, Glasemacher J, Schmidt R, Schonheit P: Purification and characterization of two extremely thermostable enzymes, phosphate acetyltransferase and acetate kinase, from the hyperthermophilic eubacterium Thermotoga maritima. J Bacteriol 1999, 181: 1861–7.PubMed CentralPubMedGoogle Scholar
- Russell RJ, Ferguson JM, Hough DW, Danson MJ, Taylor GL: The crystal structure of citrate synthase from the hyperthermophilic archaeon pyrococcus furiosus at 1.9 A resolution. Biochemistry 1997, 36: 9983–94. 10.1021/bi9705321View ArticlePubMedGoogle Scholar
- Lobry JR: Influence of genomic G+C content on average amino-acid composition of proteins from 59 bacterial species. Gene 1997, 205: 309–16. 10.1016/S0378-1119(97)00403-4View ArticlePubMedGoogle Scholar
- Lynn DJ, Singer GA, Hickey DA: Synonymous codon usage is subject to selection in thermophilic bacteria. Nucleic Acids Res 2002, 30: 4272–7. 10.1093/nar/gkf546PubMed CentralView ArticlePubMedGoogle Scholar
- Chakravarty S, Varadarajan R: Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett 2000, 470: 65–9. 10.1016/S0014-5793(00)01267-9View ArticlePubMedGoogle Scholar
- Chakravarty S, Varadarajan R: Elucidation of factors responsible for enhanced thermal stability of proteins: a structural genomics based study. Biochemistry 2002, 41: 8152–61. 10.1021/bi025523tView ArticlePubMedGoogle Scholar
- Tekaia F, Yeramian E, Dujon B: Amino acid composition of genomes, lifestyles of organisms, and evolutionary trends: a global picture with correspondence analysis. Gene 2002, 297: 51. 10.1016/S0378-1119(02)00871-5View ArticlePubMedGoogle Scholar
- Maes D, Zeelen JP, Thanki N, Beaucamp N, Alvarez M, Thi MH, Backmann J, Martial JA, Wyns L, Jaenicke R, Wierenga RK: The crystal structure of triosephosphate isomerase (TIM) from Thermotoga maritima: a comparative thermostability structural analysis of ten different TIM structures. Proteins 1999, 37: 441–53. 10.1002/(SICI)1097-0134(19991115)37:3<441::AID-PROT11>3.0.CO;2-7View ArticlePubMedGoogle Scholar
- Szilagyi A, Zavodszky P: Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure Fold Des 2000, 8: 493–504.View ArticlePubMedGoogle Scholar
- Wootton JC, Federhen S: Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996, 266: 554–71.View ArticlePubMedGoogle Scholar
- Tusnady GE, Simon I: The HMMTOP transmembrane topology prediction server. Bioinformatics 2001, 17: 849–50. 10.1093/bioinformatics/17.9.849View ArticlePubMedGoogle Scholar
- Nielsen H, Engelbrecht J, S Brunak, von Heijne G: Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 1997, 10: 1–6. 10.1093/protein/10.1.1View ArticlePubMedGoogle Scholar
- Lupas A, Van Dyke M, Stock J: Predicting coiled coils from protein sequences. Science 1991, 252: 1162–4.View ArticlePubMedGoogle Scholar
- Rost B, Fariselli P, Casadio R: Topology prediction for helical transmembrane proteins at 86% accuracy. Protein Sci 1996, 5: 1704–18.PubMed CentralView ArticlePubMedGoogle Scholar
- Chou KC, Maggiora GM: Domain structural class prediction. Protein Eng 1998, 11: 523–38. 10.1093/protein/11.7.523View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105–32.View ArticlePubMedGoogle Scholar
- Cai YD, Liu XJ, Xu XB, Chou KC: Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J Cell Biochem 2002, 84: 343–8. 10.1002/jcb.10030View ArticlePubMedGoogle Scholar
- Wilkins MR, Pasquali C, Appel RD, Ou K, Golaz O, Sanchez JC, Yan JX, Gooley AA, Hughes G, Humphery-Smith I, Williams KL, Hochstrasser DF: From proteins to proteomes: large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Biotechnology (N Y) 1996, 14: 61–5.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268: 78–94. 10.1006/jmbi.1997.0951View ArticlePubMedGoogle Scholar
- Uberbacher EC, Mural RJ: Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc Natl Acad Sci U S A 1991, 88: 11261–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Gelfand MS: Prediction of function in DNA sequence analysis. J Comput Biol 1995, 2: 87–115.View ArticlePubMedGoogle Scholar
- Dennis PP, Shimmin LC: Evolutionary divergence and salinity-mediated selection in halophilic archaea. Microbiol Mol Biol Rev 1997, 61: 90–104.PubMed CentralPubMedGoogle Scholar
- Katinka MD, Duprat S, Cornillot E, Metenier G, Thomarat F, Prensier G, Barbe V, Peyretaillade E, Brottier P, Wincker P, Delbac F, El Alaoui H, Peyret P, Saurin W, Gouy M, Weissenbach J, Vivares CP: Genome sequence and gene compaction of the eukaryote parasite Encephalitozoon cuniculi. Nature 2001, 414: 450–3. 10.1038/35106579View ArticlePubMedGoogle Scholar
- Clarke GD, Beiko RG, Ragan MA, Charlebois RL: Inferring genome trees by using a filter to eliminate phylogenetically discordant sequences and a distance matrix based on mean normalized BLASTP scores. J Bacteriol 2002, 184: 2072–80. 10.1128/JB.184.8.2072-2080.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV: Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001, 1: 8. 10.1186/1471-2148-1-8PubMed CentralView ArticlePubMedGoogle Scholar
- Ahern TJ, Klibanov AM: The mechanisms of irreversible enzyme inactivation at 100C. Science 1985, 228: 1280–4.View ArticlePubMedGoogle Scholar
- Tomazic SJ, Klibanov AM: Mechanisms of irreversible thermal inactivation of Bacillus alpha-amylases. J Biol Chem 1988, 263: 3086–91.PubMedGoogle Scholar
- Fukuchi S, Nishikawa K: Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol 2001, 309: 835–43. 10.1006/jmbi.2001.4718View ArticlePubMedGoogle Scholar
- Bryant SH, Lawrence CE: The frequency of ion-pair substructures in proteins is quantitatively related to electrostatic potential: a statistical model for nonbonded interactions. Proteins 1991, 9: 108–19.View ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Lash AE, Leipe DD, Madden TL, Pontius JU, Schuler GD, Schriml LM, Tatusova TA, Wagner L, Rapp BA: Database resources of the National Center for Biotechnology Information: 2002 update. Nucleic Acids Res 2002, 30: 13–6. 10.1093/nar/30.1.13PubMed CentralView ArticlePubMedGoogle Scholar
- Michalickova K, Bader GD, Dumontier M, Lieu HC, Betel D, Isserlin R, Hogue CW: SeqHound: biological sequence and structure database as a platform for bioinformatics research. BMC Bioinformatics 2002, 3: 32. 10.1186/1471-2105-3-32PubMed CentralView ArticlePubMedGoogle Scholar
- Dumontier M, Hogue CWV: NBLAST: a Cluster Variant of BLAST for NxN Comparisons. BMC Bioinformatics 2002, 3: 13. 10.1186/1471-2105-3-13PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Y, Bryant S, Tatusov R, Tatusova T: Links from genome proteins to known 3-D structures. Genome Res 2000, 10: 1643–7. 10.1101/gr.143200PubMed CentralView ArticlePubMedGoogle Scholar
- Higgins DG, Sharp PM: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene 1988, 73: 237–44. 10.1016/0378-1119(88)90330-7View ArticlePubMedGoogle Scholar
- Hogue CW, Ohkawa H, Bryant SH: A dynamic look at structures: WWW-Entrez and the Molecular Modeling Database. Trends Biochem Sci 1996, 21: 226–9. 10.1016/0968-0004(96)30017-0View ArticlePubMedGoogle Scholar
- Melo F, Sanchez R, Sali A: Statistical potentials for fold assessment. Protein Sci 2002, 11: 430–48. 10.1110/ps.25502PubMed CentralView ArticlePubMedGoogle Scholar
- Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.