Insertions and the emergence of novel protein structure: a structure-based phylogenetic study of insertions
© Jiang and Blouin; licensee BioMed Central Ltd. 2007
Received: 30 August 2007
Accepted: 15 November 2007
Published: 15 November 2007
In protein evolution, the mechanism of the emergence of novel protein domain is still an open question. The incremental growth of protein variable regions, which was produced by stochastic insertions, has the potential to generate large and complex sub-structures. In this study, a deterministic methodology is proposed to reconstruct phylogenies from protein structures, and to infer insertion events in protein evolution. The analysis was performed on a broad range of SCOP domain families.
Phylogenies were reconstructed from protein 3D structural data. The phylogenetic trees were used to infer ancestral structures with a consensus method. From these ancestral reconstructions, 42.7% of the observed insertions are nested insertions, which locate in previous insert regions. The average size of inserts tends to increase with the insert rank or total number of insertions in the variable regions. We found that the structures of some nested inserts show complex or even domain-like fold patterns with helices, strands and loops. Furthermore, a basal level of structural innovation was found in inserts which displayed a significant structural similarity exclusively to themselves. The β-Lactamase/D-ala carboxypeptidase domain family is provided as an example to illustrate the inference of insertion events, and how the incremental growth of a variable region is capable to generate novel structural patterns.
Using 3D data, we proposed a method to reconstruct phylogenies. We applied the method to reconstruct the sequences of insertion events leading to the emergence of potentially novel structural elements within existing protein domains. The results suggest that structural innovation is possible via the stochastic process of insertions and rapid evolution within variable regions where inserts tend to be nested. We also demonstrate that the structure-based phylogeny enables the study of new questions relating to the evolution of protein domain and biological function.
The majority of protein folds descend from a relatively small set of ancestral domains through divergent evolution [1–4]. The mechanism by which new structures emerge or evolve from existing proteins is still an open question. Unlike sequence evolution, the drift of the core of a domain structure is unlikely be stable and functional. Therefore, it is reasonable to postulate that structural innovation is more likely to be the result of evolution at the periphery of the conserved core of domains. A recent study about the indels in protein sequences found that the fraction of domains and individual unaligned regions increasing in size is almost twofold larger than the fraction decreasing in size . Ancient domain families show bias towards insertions in the variable region which grow in size . In comparison with deletions, a succession of insertions and rapid evolution appears to be a reasonable process that could lead to the emergence of novel protein architectures [2, 6, 7]. Determining the extent and the mechanism of emergence of protein structure is difficult because observations are limited to extant protein folds. The evolution of sub-structures over time must then be inferred from this limited information.
Because of a high degree of inter-residue dependence, protein structures evolve much more slowly than their sequences . These structural constraints are relaxed in some parts of protein structures; these are typically surface loops where mutations, insertions, and deletions can occur with lesser consequences to the biological function for which a gene is selected. It was observed that the probability of observing insertions and deletions in a pairwise alignment of protein sequences correlates with their evolutionary distance [9, 10]. The study of structural similarity of loop regions in homologous proteins also observed a linear correlation between sequence and structural similarities . These observations are consistent with the assumption that insertion and deletion events are continuous processes that parallel the better characterized process of substitution.
Incremental change in loops through insertion or deletion is a possible mechanism that can generate new polypeptidic folds . The proposed model of emergence assumes that regions with faster evolutionary rates, such as surface loops, are the most probable locations for the occurrence of rare events. As proteins evolve, insertions can accumulate in these loop regions without affecting the folding of the core structure. Unless an insert is eliminated by purifying selection, multiple and nested insertions will make surface loops appear to grow over time. These variable regions have the ability to explore the conformational space and thus acquire novel sub-structures. Subsequently, a fraction of novel structural features generated via this process can be positively selected, and eventually become independently folding units (e.g. domains).
The key element leading to structural emergence under the proposed model is thus the extension of surface loops into nascent substructures. Testing this model requires to infer the sequence of events (phylogenies) leading to structural variability in protein structures. A series of efficient and robust tools were developed to produce structure-based phylogenies of protein domains and high quality structure-based multiple sequence alignments. Phylogenetics usually relies on the signal in biological sequences to infer the evolution of a gene. The tertiary structure of a protein evolves much slower than its sequence, and potentially contains a phylogenetic signal which is likely to persist beyond the timeframe where sequence signal becomes saturated.
The structure-based phylogenetic method utilizes a distance measure Q H [12, 13] to compute trees using the Neighbor-Joining algorithm . We also developed a method to build the structure-based multiple sequence alignments. This tool is derived from a multiple sequence alignment method proposed by Casbon and Saqi , which generates multiple structure-based alignment by running T-Coffee  to perform hierarchical alignment using information from the pairwise structural alignments. In this work, we used the application Flexible structure AlignmentT by Chaining Aligned fragment pairs with Twists (FATCAT)  to produce the pairwise alignments. According to the results presented in this work, the trees inferred with Q H distances are consistent with results of sequence-based methods. The structure-based phylogenetic method is efficient and robust. Because the sequence identity amongst domain families is often low, the structure-based phylogenies are also more suitable for this study than sequence-based phylogenetic methods.
This work used the structure-based phylogenies to infer a possible sequence of insertion events leading to the extant domain structures. The objective was to assess whether complex and novel protein structures can arise through the incremental growth of variable regions in protein domains. The analyses were performed on a large test set of homologous proteins built from Structural Classification of Proteins (SCOP) database . The study revealed a large portion of insertions are bounded by earlier insertions in the variable regions of protein domains. We demonstrate that the average size of inserts created by nested insertions is substantially larger than block inserts. We analyzed the conformations of inserts, and found some structures of nested inserts show complex or even domain-like fold patterns including helices, strands and loops. The β-Lactamase/D-ala carboxypeptidase domain family was used as an example to illustrate the inference of insertion events, and how the incremental growth of a variable region is capable to generate novel structures.
Statistics of insertions
Statistics of insertions and nested insertions in the test set
N SF a
N FA b
N domain c
N FA_NI d
N I e
N NI f
Our method detected 7356 insert-containing regions, including 5555 block insertions and 1801 nested insertions. There are a total of 9691 insertions with size >1 observed in the 7356 inserts, in which 5555 are block insertions and 4136 insertions are nested. A total of 211 out of 447 families contained lineages with nested insertions. Consequently, our analysis of SCOP family test sets found that 24.5% of all insert-containing regions have at least one nested insertion while the remaining proportion are inserts that appear to have been inserted in a single step (block insertion). Overall, 42.7% of the inferred insertion events are those nested into an insert region. The addition of more homologous domains in a given family is likely to divide some of the block insertions into nested insertions. This, in turn, will increase the number of observations of nested insertions, although it is impossible to determine how much this would affect the net proportion of these two types of inserts.
Insertions produce complex sub-structure
Properties of the complex block inserts and the complex nested Inserts
It is much more difficult to demonstrate that existing domains have emerged via this mechanism. Evidence that a given domain in a gene is significantly similar to an insert-nesting variable region in an unrelated gene would constitute a demonstration that these novel folds can be propagated through the proteome. However, it is fair to assume that most common domains are relatively ancient and that structural intermediates are thus unlikely to exist to this day. The analysis of distant but evolutionarily related protein folds may provide further insight into this possibility.
An example of the growth of variable region via stepwise nested insertions
Phylogenetic consistency of signal
The EstB has a structural homology to β-lactamase, but shows no β-lactamase activity even though the nature and arrangement of active-site residues is very similar between EstB and the homologous β-lactamase. Modeling studies suggested steric factors account for the enzyme's selectivity for ester hydrolysis versus β-lactam cleavage . One of the steric factors comes from the nested insert. The insert hairpin covers part of the active site entrance in EstB, which may affect the enzyme's selectivity by narrowing the access path to the active site tunnel. The stepwise nested insertions observed in the evolution of EstB demonstrate that the stepwise nested insertions can create novel complex sub-structure and affect the function of protein.
Applicability of the structure-based phylogenetic method
For sequence-based phylogeny, manual editing of alignments is necessary to remove variable/gapped regions, and most methods cannot provide a reliable tree when the similarity amongst sequences is low. In contrast with sequence phylogenies, the substructures of a protein which are not suitable for sequence-based phylogeny actually constitute a source of phylogenetic signal. The structure-based phylogeny is thus expected to be more robust than sequence-based methods when domains in a family have low to very-low sequence similarities and regions which cannot be unambiguously aligned.
The SCOP database contains families with low sequence identity (<30%) despite having very close structures and functions . These families are appropriate for tertiary structure phylogeny because more distant homologs usually have more insertions and structural variations. Several works have utilized structure-based methods to study these SCOP families with low sequence homology, including the β-Lactamase/D-ala carboxypeptidase family , the Class II aminoacyl-tRNA synthetase (aaRS)-like family , and the short-chain alcohol dehydrogenases family .
The structure-based distance metrics used in this work provide a reasonable estimate of phylogenetic distance. Several protein structural distance measures, including Root Mean Square Distances (RMSD) , and Hausdorff distance of loops [11, 24], and Q H [12, 13], have been applied to phylogenetic analysis. The distance metric Q H adopted in this work, which considers the differences of the aligned segments and the non-aligned gap regions simultaneously, has been successfully applied to study the evolution of structures in aminoacyl-tRNA synthetases , and has been built into the molecular modeling software VMD since version 1.8.3. Based on previous work and our results, the topologies of trees inferred with Q H distance are generally consistent with results of sequence-based methods. However, it is important to note that consistency with sequence-based phylogeny in difficult cases does not constitute a definitive proof of the suitability of Q H at capturing phylogenetic distances.
Effect of flexible structural alignment
Many programs for the alignment of protein structures have been developed. Most structural alignment methods, such as CE  and DALI , assume that protein molecules are rigid bodies. This assumption is made despite the knowledge that many proteins are flexible molecules. The validity of the rigid body assumption is further questionable for variable regions since it is usually more flexible than the conserved core of the structure. When flexible molecules in different conformations are compared to each other as rigid bodies, even strong structural similarities can be missed. Several flexible structural alignment algorithms including FlexProt  and FATCAT  have been developed to solve the problem. In this study, flexible structural alignment method FATCAT was used to generate the pairwise structural alignment.
Accuracy of structure-based multiple sequence alignment
The accuracy of the structure-based multiple sequence alignments produced with the alignment tool developed for this study has been tested with a hard benchmark SABmark1.65 . The average developer scores of our method for the superfamily and twilight sets in SABmark are 82.6 and 64.3, respectively. The average developer scores of T-Coffee are 57.7 and 27.4, respectively. The accuracy of the method proposed in this paper is thus sufficient to perform subsequent phylogenetic analyses on more difficult, sequence-based problems. The effect of using more accurate alignment methods on the phylogeny of distantly related sequence will be tested in the future.
There are a few databases providing protein structure-based alignments for homologous families, including HOMSTRAD  and PALI . The structure alignment methods used by HOMSTRAD and PALI were COMPARER [32, 33] and STAMP , respectively. In general, the structure-based sequence alignment can improve the accuracy of sequence alignment especially when sequences are distantly related. Both COMPARER and STAMP are rigid-body superposition programs. In this study, the multiple structure-based alignments of homologous families were produced by flexible alignment algorithm FATCAT. As discussed in the "Effect of flexible structural alignment" section, the flexible alignment method is very important for improving the accuracy of alignments.
The signal provided by the tertiary structure of protein domains was used to infer phylogenies. It is important to contrast the nature of sequence and structural signals. The signal from structures is generated by the similarity of the shared regions (core) and the presence and magnitude of differences of their variable regions (loops). It is unclear whether the signal provided by the distance between the structures of the conserved cores reflects evolutionary distances. As cores are assumed to remain constant over time, the difference between sequences may be restricted to the presence and structure of the variable regions. An important caveat, our reconstruction of sequences of insertion did not account for the influence of deletion. Deletions in this context would appear as simultaneous insertions events in all other lineages. This would systematically increase the distance to all other related structures that are conserving the deleted segment. The effect of deletions on phylogenies would, therefore, bias this lineage toward the root of the tree.
It is clear that the underlying process of evolution of sequence and structure are different: homologous positions in sequences are subjected to a substitution process while their geometry is assumed to remain constant. The relative stability of structure over time, however, suggests that some of the tertiary structures contain phylogenetic signals that persist beyond the saturation of the substitution process of their coding sequence. The consequence to this is that 3D signal can be used to tackle questions spanning at unreachable evolutionary depths up to now.
One of these questions is whether a parsimonious process of innovation can be inferred from existing domain structures. As structural innovation is expected to be infrequent and spanning evolutionary distances which are not tractable at the sequence level, we constructed tertiary structure-based phylogenies. We reported that inserts which appear to have evolved iteratively are longer, more complex and some appear to be novel. For this reason, we propose that the methodology presented in this work is an early step to use tertiary structure phylogenetic analysis to study the evolution of structures and functional diversification.
Construction of test set
The test set of protein domain families was built from ASTRAL SCOP 1.69 domain subset (<95% sequence homology) . The domains were sampled from the first six SCOP classes, including all alpha proteins, all beta proteins, alpha and beta proteins (a/b), alpha and beta proteins (a+b), multi-domain proteins, and the class of membrane and cell surface proteins and peptides. Other classes with either smaller size peptides or not a true class were excluded. The protein structure files in the domain list were then downloaded from the Protein Data Bank . Domain structures were produced by extracting the polypeptide chains according to the domain definition of SCOP. When a SCOP domain was found discontinuous on one chain by definition, the segments in the middle of domain regions were kept.
Furthermore, domains with missing residues were excluded. In the final domain structure file, the amino acid sequence in the SEQRES records linearly corresponds to the ATOM records. To generate a rooted phylogenetic tree for the quantitative study of insertion history, a final filtering was performed on the domain set by selecting superfamilies that contained at least two families and at least one family with three or more domains. The immunoglobulin superfamily, which has 782 domains, was discarded because it is difficult to generate multiple sequence alignment for such large number of domains. After the filtering, the test set includes 3716 domains, belonging to 222 superfamilies, and 975 families in which there are 447 families with at least 3 domains.
Flexible structural alignment was performed on every pair of domains in the superfamily using FATCAT . From these pairs of aligned structures, a pairwise distance matrix was derived for each family using the structural distance measure Q H [12, 13]. The structural distance measure Q H considers the structural distances of both the aligned core structures and the unaligned variable regions [12, 13]. The structure-based tree was then generated using the Neighbor Joining algorithm . To root the phylogenetic tree of each family, an outgroup domain was chosen from an adjacent structural family within the same superfamily.
Structure-based multiple alignment
To produce an accurate multiple sequence alignment of a domain family for pinpointing insertion events, the structure-based multiple sequence alignment of each family was constructed with a similar method proposed by Casbon and Saqi in building S4, a database of structure-based sequence alignments of SCOP superfamilies . This method generates high quality multiple structure-based alignment by running T-Coffee to perform hierarchical alignment using information from the pairwise structural alignments. T-Coffee has been successfully applied to incorporate sequence and structural information in building structure-based multiple sequence alignment [15, 37, 38]. In order to align the flexible regions of domains together, we used FATCAT  instead of SAP . Our method includes the following procedures: 1. Run FATCAT to generate pairwise structural alignment for each pair of domains in a family; 2. Generate a T-Coffee library file from the pairwise structural alignments using the formula introduced by Casbon and Saqi . A T-Coffee library file consists of the weights of equivalent residues in each pairwise structural alignment of all pairs of domains; 3. Run T-Coffee (version 2.66)  to produce a multiple sequence alignment from the library file.
Detect insertions and nested insertions
Insert rank is a value to define the nesting depth of insertions. The insert rank of a nested insert is a value greater than 1. The insert rank of the nested insert var3 in Figure 8(b) is 3, which indicates three insertions are observed. For the non-nested inserts of var1 and var3, the insert rank is 1.
Mapping of insertions into structures
Insertions were mapped into domain's original 3D coordinates for visualization. The temperature factors/beta values of insert residues in the PDB format file were modified with scaled values to be consistent with the insert rank. After mapping, the color of residues in the insert region shifts from blue (rank 1) to red (the highest rank).
Determine complex and novel inserted sub-structure
To study the potential of an insert to be a novel fold unit, we define the structural complexity as follows: a sub-structure is complex if the similarity to itself is considered significant. Sub-structures whose self similarity is not significant are too simple to be considered (e.g. a small segment of helix, a very short polypeptide, etc.). We restricted the evaluation of complexity to block inserts with rank = 1, length ≥ 10 and nested inserts with rank >2, length ≥ 10. Every insert extracted from the SCOP domain was aligned against its source domain using FATCAT. An insert was then considered complex if the P-value of the structural alignment was <0.05. The P-value determined by FATCAT is a reasonable threshold to assign significance in structural similarity . The assignments of secondary structures for the complex inserts were analyzed with Stride [40, 41].
To investigate whether there was any direct evidence to show insertions created novel structures, all the complex inserts were compared to domains from other superfamilies in the ASTRAL SCOP 1.69 domain subset (<95% sequence homology) using FATCAT. The alignment of a significant match is defined as P-value < 0.05 and aligned length >80% of the insert size. An insert will be considered as a novel structural unit if there is no significant match of the nested insert in the SCOP domain subset.
The statistical distributions of inserts (Figure 1 and Figure 2) were generated using R . The phylogenetic trees were rendered with TreeView 1.6.6 . The illustrations of protein structures were prepared with VMD  and Pymol . Sequence-based phylogenetic analysis was performed with both distance-based method provided by PHYLIP  and maximum likelihood method PHYML [47, 48]. The sequence-based phylogenies were built with PROTDIST program and JTT substitution matrix. The programs SEQBOOT and CONSENSE were used to estimate the confidence of branches from 1000 bootstrap replicates. The maximum likelihood phylogeny was built with PHYML using the JTT substitution matrix and keeping all other options to the default setting using the PHYML web server . The annotated multiple sequence alignment of Figure 4(a) was generated using JalView .
This work was supported by Genome Atlantic under the Prokaryotic Evolution and Diversity grant and the NSERC Discovery grant 298397-04 (CB). The authors thank Yuzhen Ye and Adam Godzik for providing the flexible structural alignment program FATCAT.
- Koonin EV, Wolf YI, Karev GP: The structure of the protein universe and genome evolution. Nature 2002, 420: 218–223. 10.1038/nature01256View ArticlePubMedGoogle Scholar
- Aravind L, Mazumder R, Vasudevan S, Koonin EV: Trends in protein evolution inferred from sequence and structure analysis. Curr Opin Struct Biol 2002, 12: 392–399. 10.1016/S0959-440X(02)00334-2View ArticlePubMedGoogle Scholar
- Dokholyan NV, Shakhnovich B, Shakhnovich EI: Expanding protein universe and its origin from the biological Big Bang. Proc Natl Acad Sci USA 2002, 99: 14132–14136. 10.1073/pnas.202497999PubMed CentralView ArticlePubMedGoogle Scholar
- Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya FS, Koonin EV: Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol Biol 2002, 2: 18. 10.1186/1471-2148-2-18PubMed CentralView ArticlePubMedGoogle Scholar
- Wolf Y, Madej T, Babenko V, Shoemaker B, Panchenko AR: Long-term trends in evolution of indels in protein sequences. BMC Evol Biol 2007, 7: 19. 10.1186/1471-2148-7-19PubMed CentralView ArticlePubMedGoogle Scholar
- Blouin C, Butt D, Roger AJ: Rapid evolution in conformational space: a study of loop regions in a ubiquitous GTP binding domain. Protein Sci 2004, 13: 608–616. 10.1110/ps.03299804PubMed CentralView ArticlePubMedGoogle Scholar
- Grishin NV: Fold change in evolution of protein structures. J Struct Biol 2001, 134: 167–185. 10.1006/jsbi.2001.4335View ArticlePubMedGoogle Scholar
- Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J 1986, 5: 823–826.PubMed CentralPubMedGoogle Scholar
- Benner SA, Cohen MA, Gonnet GH: Empirical and structural models for insertions and deletions in the divergent evolution of protein. J Mol Biol 1993, 229: 1065–1082. 10.1006/jmbi.1993.1105View ArticlePubMedGoogle Scholar
- Pascarella S, Argos P: Analysis of insertions/deletions in protein structures. J Mol Biol 1992, 224: 461–471. 10.1016/0022-2836(92)91008-DView ArticlePubMedGoogle Scholar
- Panchenko AR, Madej T: Structural similarity of loops in protein families: toward the understanding of protein evolution. BMC Evol Biol 2005, 5: 10. 10.1186/1471-2148-5-10PubMed CentralView ArticlePubMedGoogle Scholar
- O'Donoghue P, Luthey-Schulten Z: On the evolution of structure in aminoacyl-tRNA synthetases. Microbiol Mol Biol Rev 2003, 67: 550–573. 10.1128/MMBR.67.4.550-573.2003PubMed CentralView ArticlePubMedGoogle Scholar
- O'Donoghue P, Luthey-Schulten Z: Evolutionary profiles derived from the QR factorization of multiple structural alignments gives an economy of information. J Mol Biol 2005, 346: 875–894. 10.1016/j.jmb.2004.11.053View ArticlePubMedGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4: 406–425.PubMedGoogle Scholar
- Casbon J, Saqi MA: S4: structure-based sequence alignments of SCOP superfamilies. Nucleic Acids Res 2005, 33: D219-D222. 10.1093/nar/gki043PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–217. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
- Ye Y, Godzik A: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 2003, 19(Suppl 2):ii246-ii255.View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247: 536–540. 10.1006/jmbi.1995.0159Google Scholar
- Hall BG, Barlow M: Structure-based phylogenies of the serine β-Lactamases. J Mol Evol 2003, 57: 255–260. 10.1007/s00239-003-2473-yView ArticlePubMedGoogle Scholar
- Petersen EI, Valinger G, Solkner B, Stubenrauch G, Schwab H: A novel esterase from Burkholderia gladioli which shows high deacetylation activity on cephalosporins is related to beta-lactamases and DD-peptidases. J Biotechnol 2001, 89: 11–25. 10.1016/S0168-1656(01)00284-XView ArticlePubMedGoogle Scholar
- Wagner UG, Petersen EI, Schwab H, Kratky C: EstB from Burkholderia gladioli: a novel esterase with a beta-lactamase fold reveals steric factors to discriminate between esterolytic and beta-lactam cleaving activity. Protein Sci 2002, 11: 467–478. 10.1110/ps.33002PubMed CentralView ArticlePubMedGoogle Scholar
- Ribas De Pouplana L, Brown JR, Schimmel P: Structure-based phylogeny of Class IIa tRNA synthetases in relation to an unusual biochemistry. J Mol Evol 2001, 53: 261–268. 10.1007/s002390010216View ArticlePubMedGoogle Scholar
- Breitling R, Laubner D, Adamski J: Structure-based phylogenetic analysis of short-chain alcohol dehydrogenases and reclassification of the 17beta-hydroxysteroid dehydrogenases family. Mol Biol Evol 2001, 18: 2154–2161.View ArticlePubMedGoogle Scholar
- Panchenko AR, Madej T: Analysis of protein homology by assessing the (dis)similarity in protein loop regions. Proteins 2004, 57: 539–547. 10.1002/prot.20237PubMed CentralView ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engng 1998, 11: 739–747. 10.1093/protein/11.9.739View ArticleGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol 1993, 233: 123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
- Shatsky M, Nussinov R, Wolfson HJ: FlexProt: Alignment of flexible protein structures without a predefinition of hinge regions. J Comput Biol 2004, 11: 83–106. 10.1089/106652704773416902View ArticlePubMedGoogle Scholar
- Ye Y, Godzik A: Database search by flexible protein structure alignment. Protein Sci 2004, 13: 1841–1850. 10.1110/ps.03602304PubMed CentralView ArticlePubMedGoogle Scholar
- Van Walle I, Lasters I, Wyns L: SABmark – a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21: 1267–1268. 10.1093/bioinformatics/bth493View ArticlePubMedGoogle Scholar
- Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci 1998, 7: 2469–71.PubMed CentralView ArticlePubMedGoogle Scholar
- Balaji S, Sujatha S, Kumar SS, Srinivasan N: PALI-a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res 2001, 29: 61–5. 10.1093/nar/29.1.61PubMed CentralView ArticlePubMedGoogle Scholar
- Sali A, Blundell TL: The definition of topological equivalence in homologous and analogous structures: A procedure involving comparison of local properties and structural relationships through dynamic programming and simulated annealing. J Mol Biol 1990, 212: 403–428. 10.1016/0022-2836(90)90134-8View ArticlePubMedGoogle Scholar
- Zhu ZY, Sali A, Blundell TL: A variable gap penalty function and feature weights for protein 3-D structure comparisons. Protein Eng 1992, 5: 43–51. 10.1093/protein/5.1.43View ArticlePubMedGoogle Scholar
- Russell RB, Barton GJ: Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 1992, 14: 309–23. 10.1002/prot.340140216View ArticlePubMedGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL compendium in 2004. Nucl Acids Res 2004, 32: D189-D192. 10.1093/nar/gkh034PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Bhat TN, Bourne PE, Feng Z, Gilliland G, Weissig H, Westbrook J: The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol 2000, (Suppl 7):957–959. 10.1038/80734Google Scholar
- O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C: 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. J Mol Biol 2004, 340: 385–395. 10.1016/j.jmb.2004.04.058View ArticlePubMedGoogle Scholar
- Poirot O, Suhre K, Abergel C, O'Toole E, Notredame C: 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Res 2004, 32: W37-W40. 10.1093/nar/gkh382PubMed CentralView ArticlePubMedGoogle Scholar
- Taylor WR: Protein structure comparison using SAP. Methods Mol Biol 2000, 143: 19–32.PubMedGoogle Scholar
- Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins 1995, 23: 566–579. 10.1002/prot.340230412View ArticlePubMedGoogle Scholar
- Heinig M, Frishman D: STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 2004, 32: W500–2. 10.1093/nar/gkh429PubMed CentralView ArticlePubMedGoogle Scholar
- R Development Core Team: R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria; 2005. [http://www.R-project.org] ISBN 3-900051-07-0Google Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Humphrey W, Dalke A, Schulten K: VMD-visual molecular dynamics. J Mol Graph 1996, 14: 33–38. 10.1016/0263-7855(96)00018-5View ArticlePubMedGoogle Scholar
- DeLano WL: The PyMOL Molecular Graphics System.DeLano Scientific, San Carlos, CA, USA; 2002. [http://www.pymol.org]Google Scholar
- Felsentein J: Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet 1988, 22: 521–565. 10.1146/annurev.ge.22.120188.002513View ArticleGoogle Scholar
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biolog 2003, 52: 696–704. 10.1080/10635150390235520View ArticleGoogle Scholar
- Guindon S, Lethiec F, Duroux P, Gascuel O: PHYML Online – a web server for fast maximum likelihood-based phylogenetic inference. Nucleic Acids Res 2005, 33: W557–9. 10.1093/nar/gki352PubMed CentralView ArticlePubMedGoogle Scholar
- Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java Alignment Editor. Bioinformatics 2004, 20: 426–7. 10.1093/bioinformatics/btg430View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.