- Methodology article
- Open Access
Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets
© Albayrak et al; licensee BioMed Central Ltd. 2010
Received: 21 January 2010
Accepted: 18 August 2010
Published: 18 August 2010
Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.
We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.
The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences.
Proteins that evolve from a common ancestor can change functionality over time  and produce highly divergent protein families that can be divided into subfamilies with similar but distinct functions (i.e., functional subfamilies or subtypes) . Identification of subfamilies using protein sequence information can be carried out using phylogenetic methods that can reveal the evolutionary relationship between proteins by clustering similar proteins together in a phylogenetic tree [3–5]. The most common method for identifying similarities in sequences through phylogenetic analysis starts with the construction of a multiple alignment of homologous sequences using a substitution matrix. Multiple alignment scores are then transformed into a distance matrix to construct a phylogenetic tree. Often the branching order of a phylogenetic tree exactly matches the known functional split between proteins  and branch lengths are proportional to the extent of evolutionary changes since the last common ancestor.
Multiple sequence alignment (MSA) is constructed using a scoring scheme which reward or penalize each substitution, insertion and deletion to get an optimum alignment of the given sequences. The quality of an MSA is connected to the chosen parameters that are entered manually and an expert handling is almost always required to maintain alignment integrity by observing general trends in each protein family. As such different alignment parameters may yield different phylogenetic trees that are only as good as the MSA that the trees are derived from [6, 7].
Phylogenetic analysis is broadly divided into two groups of methods. Algorithms in the first group calculate a matrix representing the distance between each pair of sequences and then transform this matrix into a tree using a tree-clustering algorithm. Algorithms in the first category utilize various distance measures with different models to account for nucleotide or amino acid substitutions. In the second group, the tree that can best explain the observed sequences under the chosen evolutionary model is found by evaluating the fitness of different tree topologies [6, 8]. The second category can further be divided into two groups based on the optimality criterion used in tree evaluation: maximum parsimony and maximum likelihood. Under maximum parsimony , the preferred phylogenetic tree is the tree that requires the least evolutionary change to explain the observed data whereas under maximum likelihood [9, 10], it is the most probable tree under the chosen evolutionary assumption.
The prediction of subfamilies from protein MSAs have been carried out previously by comparing subfamily hidden Markov models, subfamily specific sequence profiles, analyzing positional entropies in an alignment, and ascending hierarchical method [4, 5, 11, 12]. All of these methods require an alignment of biological sequences that assume some sort of an evolutionary model. Computational complexity and the inherent ambiguity of the alignment cost criteria are two major problems in MSA along with controversial evolutionary models that are used to explain them.
A novel approach for phylogenetic analysis based on Relative Complexity Measure (RCM) of whole genomic sequences have been previously proposed by Otu et al, that eliminates the need for MSA and produces successful phylogenies on real and simulated datasets . The algorithm employs Lempel-Ziv (LZ) complexity  and produces a score for each sequence pair that can be interpreted as the "closeness" of the sequence pairs. Unequal sequence length or different positioning of similar regions along sequences (such as different gene order in genomes) is not an issue as the method has been shown to handle both cases naturally. Moreover, RCM does not use any approximations and assumptions in calculating the distance between sequences. Therefore, RCM utilizes the information contained in sequences and requires no human intervention.
Application of RCM to genomic sequences for phylogenetic analysis was successfully carried out on various datasets containing genomic sequences [8, 14]. Moreover, Liu et al  extended this method further to integrate the hydropathy profile and a different LZ-based distance measure for phylogenetic analysis of protein sequences while Russell et al integrated a merged amino acid alphabet containing 11 characters to represent all amino acids to reduce complexity prior to calculating a pairwise distance measure to be used as a pairwise scoring function in determining the order with which sequences should be joined in a multiple sequence alignment problem .
Application of RCM to evaluate genomic sequences is relatively straight forward since RCM based on Lempel-Ziv complexity scores can capture each mutation in DNA sequences and register it as an increase in the complexity scores of compared sequences. However, substitution of one residue into another in proteins is tolerable as long as the substituted residue is not highly conserved and physicochemical and structural properties of the substituted and the native residues are not fundamentally different [17–19]. Employment of hydropathy-index-based grouping of residues is one way of a preprocessing requirement to capture only the mutations that would not be tolerated in a protein sequence since LZ algorithm is not capable of accounting for amino acid substitution frequencies and similarity scores. Hence, any application that uses RCM to generate a distance matrix of protein sequences should be linked to treating the sequence with a reduced amino acid alphabet (RAAA) prior to calculating their RCMs.
In this paper, we utilize RCM with different reduced amino acid alphabets and assess RCM's potential in clustering protein families into functional subtypes based solely on sequence data. This method clustered seven well-characterized protein families into their functional subtypes with 92% - 100% accuracy.
JTT-dcmut  was chosen as the amino acid substitution model.
Power law insertion/deletion length distribution model with a = 1.7 and maximum allowed insertion/deletion length of 500 were used.
Both insertion and deletion rates were set to the default parameter of 0.1 relative to average substitution rate of 1%.
Length of the root protein sequence was set to 500.
The rooted tree with 10 taxa that reflects the true phylogenetic evolution of the sequences was generated along with the true MSA from which the true tree was inferred.
The true MSA was then inputted into ClustalW2  and the bootstrap tree was generated (1000 bootstrap trials, including positions with gaps, and correcting for multiple substitutions)
General Properties of the Datasets
# of sequences
# of subfamilies
Vicinal oxygen chelates
Reduced Amino Acid Alphabets
Sequence space of proteins is redundant and generates only a limited number of folds, domains, and structures . Various strategies have been devised that take a coarse-grained approach to account for the degeneracy of sequences by grouping similar amino acids together [17–19, 27–30]. Grouping is usually carried out based on structural and physiochemical similarities of amino acids . Grouping of amino acids in sequence space can help develop prediction methods for various sequence determinants and decrease the amount of search space in procedures employed in directed evolution experiments [26, 31]. One of the finest examples is the reduction of amino acid alphabet into a binary code that is composed of characters representing polar and non-polar amino acid residues . Grouping of amino acid residues has also been used extensively in Hydrophobic-Polar (HP) lattice model to explain the hydrophobic collapse theory of protein folding .
A recent study was carried out by Peterson et al to test the performance of over 150 RAAAs on the sequence library from DALIpdb90 database and showed that RAAAs improves sensitivity and specificity in fold prediction between protein sequence pairs with high structural similarity and low sequence identity .
Amino acids that are within the same group in a RAAA are considered identical . Substitution matrices that assign the same similarity score to each amino acid within the same group were obtained from reference . For those RAAAs in the EB scheme and the three random RAAAs, new substitution matrices were created from BLOSUM62 frequency counts using the same procedure outlined in reference .
In this paper, a normalized distance measure that was previously used for phylogenetic tree construction of whole genome sequences was employed. The distance measure was based on Lempel-Ziv  complexity and was known to accurately cluster all related genomic sequences under one branch of the tree .
Sequence X = AAILNAIIANNL
H E (X)
C(X) = 7
where c(XY) and c(YX) are RCM of X appended to Y and Y appended to X, respectively. Remaining four LZ-based distance measures defined in Out et al performed slightly worse than the above distance (data not shown). Although in performance between five measures were not significant, we adopted the aforementioned distance for its ability to account for length variance.
Distance Matrix & Phylogenetic Tree
The relative complexity measure (RCM) for creation of the distance matrix was utilized as previously described . Phylogenetic trees were generated from distance matrices using neighbor-joining  program of the phylogeny inference package, PHYLIP 3.68 . Un-rooted trees were rooted with midpoint rooting by placing the root halfway between the two most distinct taxa. Midpoint-rooted trees were converted to cladograms (i.e., branch lengths are discarded) using the Retree program of PHYLIP package . Phylogenetic trees for all protein families and RAAAs are shown in supplementary materials (Additional File 2) in Newick format and can be visualized with a tree-drawing program.
Protein sequences in each family were aligned using ClustalW2  for comparison with RCM. MSAs were performed using updated substitution matrices with gap extension and gap opening penalties provided in Table 2. Bootstrap analyses were carried out 100 times and trees containing bootstrap values were created using ClustalW2 with the neighbor-joining clustering algorithm. For convenience, MSAs that were carried out using ClustalW2 will be referred as the MSA or the MSA method for the rest of the article.
Tree Based Classification (TBC)
TBC algorithm  was used to check the accuracy of each tree in separating protein families into subfamilies. TBC divides a tree into disjoint subtrees and assigns a protein subfamily to a subtree that maximizes the number of true positives when the proportions of fp/(tp+fp) and fn/(tp+fn) are both equal to 0.5 for a given subtree, where fp is the number of false positives, fn is the number of false negatives and tp is the number of true positives. Above proportions correspond to the "maximal allowed contamination" level that minimizes the TBC error over the whole tree.
TBC requires a bifurcating tree of sequences in a protein family and an attribute file that contains expert curated assignment of each sequence to a particular subfamily. TBC accuracy (i.e., the percentage of correctly classified sequences) is the primary performance measure to evaluate the division of protein families into subtypes using the TBC algorithm. TBC accuracy is equal to 1- %TBC error where %TBC error is the total number of fp, fn, and unclassified sequences divided by the total number of sequences. For a detailed analysis of the TBC algorithm, refer to reference .
The proposed algorithm operates on a set of sequences in FASTA format. After one of the alphabets given in Table 1 is applied to all the sequences in the dataset, RCMs are calculated and used to obtain the distance between each pair for the neighbor-joining clustering to create a phylogenetic tree. For each RAAA, a single tree based on RCM is generated and analyzed using TBC algorithm to determine how well it clusters different subfamilies under different branches of the tree.
For simulated dataset, three phylogenetic trees were compared: The true tree generated by INDELible, the bootstrap tree and the RCM tree. INDELible creates a true MSA of the simulated protein sequences. This alignment was used in ClustalW2 and bootstrapped 1000 times and the resulting tree was called the bootstrap tree. The third tree is the RCM tree that was generated by the proposed approach.
For seven protein datasets, first, the original fasta sequences were used to calculate RCMs and their associated RCM trees. Second, the original fasta sequences were re-coded using different RAAAs (Table 2) and the reduced sequences were used to calculate their RCMs and the associated RCM trees.
A similar procedure was applied to the phylogenetic trees using the MSA method. For each protein family, MSA was carried out using the corresponding substitution matrices and gap penalties provided in Table 2. MSA-based trees were created following bootstrap analysis (100 replicates) with ClustalW2.
Results and Discussion
Performance of the RCM approach
Members of crotonase family contain 467 protein sequences from 13 different subfamilies and catalyze diverse metabolic reactions with certain family members displaying dehalogenase, hydratase, and isomerase activities. TBC accuracy varied between 96.4% and 100% for RCM. The top performing RAAA with the smallest size was GBMR4 that resulted in 100% TBC accuracy. TBC accuracy was 100% for all RAAAs tested with MSA.
The mandelate racemase dataset contains 184 sequences that are assigned to 8 expert curated subfamilies. All mandelate racemases contain a conserved histidine, presumably acting as an active site base . When the RCM approach was tested on mandelate racemases, all resulting trees showed correct assignment of functional subfamilies into 8 different clusters with 100% accuracy using all alphabets except GBMR4 that resulted in 96.7% TBC accuracy.
Vicinal oxygen chelates (VOC)
VOC family contains 309 sequences from 18 different subfamilies. The number of TBC accuracy varied between 77.7% and 92% for RCM and 81.9% to 91.3% for MSA. Members of VOC have an average sequence length of 294 amino acids and a mean PID of 14% (Table 1). The low PID and the highly divergent nature of this family make its subfamilies susceptible to misclassification more than other families based on sequence information alone. In this dataset, EB8 performed better than 20-letter alphabet (92.2% vs. 91.3%) with RCM while GBMR4, ML4, EB8, EB, EB13 and 20-letter alphabets resulted in 91.3% TBC errors with MSA.
TBC errors for top performing RAAA
Statistics for top performing
Top performing RAAAs
Nucleotidyl cyclase family has two functional subfamilies, adenylate and guanylate cyclases that correspond to use of the substrates ATP and GTP respectively. The nucleotidyl cyclase family with 33 adenylate cyclases and 42 guanylate cyclases was clustered into two distinct subfamilies with 100% accuracy using both methods and all RAAAs except EB5 and EB8 for RCM and ML4 and EB5 for MSA, all of which resulted in 98.7% accuracy (Table 4). Moreover, the clustering result for the nucleotidyl cyclases are in agreement with the result obtained previously by the MSA-dependent clustering algorithm that uses the residues with the highest evolutionary split statistic to split protein families into functional subfamilies .
Acyl transferases (AT)
The AT domains of Type I modular polyketide synthases are responsible for the substrate selection. Most incorporate either a C2 unit (malonyl-CoA substrate) or a C3 unit (methylmalonyl-CoA substrate). The choice of substrate can be deduced from the chemical structure of the polyketide product . In the acyl transferase dataset, 99 of the 177 sequences use C2 units whereas 78 use C3 units as substrate.
Previously, Goldstein et al  used evolutionary split statistic and clustered the AT domains into 2 subfamilies with 2 false assignments for the 5 residue-long motif. The number of false assignments increased to 5 with increasing motif length (up to 30-residue long) suggesting that the utilization of a larger motif increases the noise and error rate. As such, inclusion of only 5 residues (less noise) with high split statistics increases the assignment accuracy (5 vs. 2 false assignments).
A similar trend is observed in the case of RCM. While the TBC accuracy for AT domains was only 91% (15 false assignments) with the 20-letter alphabet (Table 4), the accuracy increased to 97% (5 false assignments) with the utilization of the ML4, ML8, EB9, ML10, EB11, SDM12, EB13, and HSDM17 alphabets. Furthermore, 4 of the 5 misclassified sequences using the above reduced alphabets are contained in the 2, 3 and 4 false assignments produced by the Goldstein et al 's approach using the 5,10 and 15 residue-long motifs, respectively. Although the accuracy was higher previously, it should be noted that the RCM approach did neither require an MSA of sequences nor any other sequence-based statistics. The accuracy was 97.2% for MSA using the top performing RAAAs. There was no immediate evidence suggesting a specific characteristic for incorrectly classified sequences.
Glycoside hydrolase family 2 (GH2)
The final dataset contains 33 members of the GH2 family with a (β/α)8 fold. The subfamilies and the number of sequences from each subfamily are β-galactosidases (6), β-mannosidases (12), β-glucuronidases (7) and exo-β-D-glucosaminidases (8). This dataset was used previously and chosen because it was cited as a "hard-to-align" dataset by classical alignment approaches . The GH2 family was clustered into 4 functional subfamilies with 100% accuracy using ML4 and GBMR4 - the two top performing RAAAs - with RCM (Table 4). TBC accuracy was 100% for all RAAAs tested with MSA.
The effect of the size of the RAAA on clustering performance
The comparison of RCM with MSA in terms of TBC accuracy and the percentage of TBC error are summarized in Table 4 for the 20-letter alphabet and the top performing RAAA with the minimum size. In cases where two RAAAs of the same size give identical TBC results, both of them are reported. Three trends can be observed from the data in Table 4.
First, for five of the seven families (crotonases, mandelate racemases, nucleotidyl cyclases, acyl transferases, and GH2 hydrolases), both methods perform equally well comparably. For VOC, RCM outperforms MSA while for haloacid dehalogenases, MSA slightly outperforms RCM. It is important to note that both VOCs and dehalogenases have the two lowest mean PIDs (12% vs. 14%) and low mean sequence lengths with large standard deviation. Low PID and low sequence length are two features in alignments that render inference of relationship based only on sequence information difficult. Nonetheless, TBC accuracies of both families with their respective top performing RAAAs are comparable to the results obtained from the protein families with higher mean PIDs and longer mean sequence lengths.
Second, either ML4 or GBMR4 is sufficient to obtain high TBC accuracy for all datasets except VOCs and haloacid dehalogenases. Indeed, apart from the aforementioned families, ML4 and GBMR4 can produce either identical or better results than all other alphabets using either RCM or MSA, implying that as little as an alphabet size of 4 would be sufficient to capture most of the sequence information that might yield considerable improvements in inferring relationship based on sequence information when both mean PID and the length of the aligned regions in an MSA is above a certain threshold.
Third, for the datasets with low mean PIDs and average sequence lengths, a larger RAAA size may be required to obtain identical or better results than the 20-letter alphabet using both RCM and MSA. This is especially evident with the RCM approach. While the minimum RAAA size of the top performer was 4 for 5 datasets that have relatively higher average sequence lengths and mean PIDs, it increases to 8 (EB8) for VOCs and 15 (ML15) for haloacid dehalogenases that have mean PIDs of 14% and 12%, respectively. Moreover, a subtle but a similar trend is also evident in the case of MSA. While the alphabet size of the top performer was 4 (GBMR4, ML4) for VOCs, it increased to 8 (ML8) for haloacid dehalogenases, implying that a larger RAAA size may perform better on sequences with lower sequence identities.
It is also interesting to note that the average TBC error for mandelate racemases, nucleotidyl cyclases and hydrolases with three random alphabets of size 4 varied between 0% and 15.6% for the MSA method. While the groupings of amino acids in the random alphabets do not have any physicochemical or structural significance that can justify this overall performance, the low percent TBC error may suggest that some subfamilies of these protein families may be very tight with small distances between their sequences while larger distance between different subfamilies. This scenario coupled with the relatively longer sequences (top three families in terms of mean sequence length) within these families may generate sufficiently long aligned regions with enough informative sites that can result in a tree that correctly assigns subfamilies even the reduced alphabet groupings do not have any structural or biological meaning.
However, the trend of low TBC error is not apparent using RCM with random alphabets. TBC errors of different protein families using random RAAAs (average of three random alphabets) were significantly higher than TBC errors using biologically meaningful reduced alphabets for all the families except racemases and nucleotidyl cyclases, both of which overlap with the results obtained with MSA.
Performance of RCM approach with different RAAAs to cluster protein families into functional subfamilies is eminent. Yet, it must be noted that there is no uniformly superior algorithm for tree-based subfamily clustering and that simple protein similarity measures combined with hierarchical clustering produce trees with reasonable and often high accuracy . Furthermore, if much time has passed since the evolution of different subfamilies, then sequences may have diverged beyond the point where simple phylogenetic analysis cannot easily give a clear distinction of subfamilies.
The application of RCM in generating meaningful phylogenetic trees has been previously tested on genomic sequences and made RCM a good alternative to MSA-based phylogenetic analysis. However, integration of RCM to measure the closeness of protein sequences was simply problematic due to the lack and difficulty of accounting for amino acid substitutions. In this paper, we introduced an RAAA-based approach as a preprocessing of protein sequences prior to calculating pairwise RCMs. Utilization of an RAAA that is consistent with the structure and function of the proteins or an RAAA that reflects the general trends in specific protein families under study can result in successful phylogenies that can cluster each protein superfamily into functional subfamilies.
In finding functional subtypes of a protein family, it is often of interest to find out if the mechanisms that manipulate a certain clustering are of evolutionary or functional origin. Although these two signals may be overlapping and hard to separate, RCM could be used to address this issue by finding differences in exhaustive histories in two sequences when they are concatenated. The "words" that result in an observed difference can then be analyzed and correlated to a functional and/or evolutionary origin. We believe future work can focus in this direction building on the current approach that does not attempt to trace back the origin of differentiating sequence signals but provides a powerful clustering method of protein families into functional subtypes without using multiple sequence alignment.
The authors would like to thank Cem Meydan and Ozgur Gul for helpful discussions, Eric Peterson for supplying the perl script for the generation of substitution matrices. HHO is partially supported by a grant from The Dubai Harvard Foundation for Medical Research.
- Wallace IM, Higgins DG: Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics 2007, 8: 135. 10.1186/1471-2105-8-135View ArticlePubMedPubMed CentralGoogle Scholar
- Georgi B, Schultz J, Schliep A: Partially-supervised protein subclass discovery with simultaneous annotation of functional residues. BMC Struct Biol 2009, 9: 68. 10.1186/1472-6807-9-68View ArticlePubMedPubMed CentralGoogle Scholar
- Kelil A, Wang S, Brzezinski R, Fleury A: CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 2007, 8: 286. 10.1186/1471-2105-8-286View ArticlePubMedPubMed CentralGoogle Scholar
- Lazareva-Ulitsky B, Diemer K, Thomas PD: On the quality of tree-based protein classification. Bioinformatics 2005, 21(9):1876–1890. 10.1093/bioinformatics/bti244View ArticlePubMedGoogle Scholar
- Wicker N, Perrin GR, Thierry JC, Poch O: Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol 2001, 18(8):1435–1441.View ArticlePubMedGoogle Scholar
- Brocchieri L: Phylogenetic inferences from molecular sequences: review and critique. Theor Popul Biol 2001, 59(1):27–40. 10.1006/tpbi.2000.1485View ArticlePubMedGoogle Scholar
- Baldauf SL: Phylogeny for the faint of heart: a tutorial. Trends Genet 2003, 19(6):345–351. 10.1016/S0168-9525(03)00112-4View ArticlePubMedGoogle Scholar
- Otu HH, Sayood K: A new sequence distance measure for phylogenetic tree construction. Bioinformatics 2003, 19(16):2122–2130. 10.1093/bioinformatics/btg295View ArticlePubMedGoogle Scholar
- Felsenstein J: Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 1981, 17(6):368–376. 10.1007/BF01734359View ArticlePubMedGoogle Scholar
- Nei M: Phylogenetic analysis in molecular evolutionary genetics. Annu Rev Genet 1996, 30: 371–403. 10.1146/annurev.genet.30.1.371View ArticlePubMedGoogle Scholar
- Hannenhalli SS, Russell RB: Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 2000, 303(1):61–76. 10.1006/jmbi.2000.4036View ArticlePubMedGoogle Scholar
- Brown DP, Krishnamurthy N, Sjolander K: Automated protein subfamily identification and classification. PLoS Comput Biol 2007, 3(8):e160. 10.1371/journal.pcbi.0030160View ArticlePubMedPubMed CentralGoogle Scholar
- Ziv J, Lempel A: A universal algorithm for sequential data compression. IEEE Trans Inf Theory 1977, 23: 337–343. 10.1109/TIT.1977.1055714View ArticleGoogle Scholar
- Bastola DR, Otu HH, Doukas SE, Sayood K, Hinrichs SH, Iwen PC: Utilization of the relative complexity measure to construct a phylogenetic tree for fungi. Mycol Res 2004, 108(Pt 2):117–125. 10.1017/S0953756203009079View ArticlePubMedGoogle Scholar
- Liu N, Wang T: Protein-based phylogenetic analysis by using hydropathy profile of amino acids. FEBS Lett 2006, 580(22):5321–5327. 10.1016/j.febslet.2006.08.086View ArticlePubMedGoogle Scholar
- Russell DJ, Otu HH, Sayood K: Grammar-based distance in progressive multiple sequence alignment. BMC Bioinformatics 2008, 9: 306. 10.1186/1471-2105-9-306View ArticlePubMedPubMed CentralGoogle Scholar
- Wang J, Wang W: A computational approach to simplifying the protein folding alphabet. Nat Struct Biol 1999, 6(11):1033–1038. 10.1038/14918View ArticlePubMedGoogle Scholar
- Etchebest C, Benros C, Bornot A, Camproux AC, de Brevern AG: A reduced amino acid alphabet for understanding and designing protein adaptation to mutation. Eur Biophys J 2007, 36(8):1059–1069. 10.1007/s00249-007-0188-5View ArticlePubMedGoogle Scholar
- Li T, Fan K, Wang J, Wang W: Reduction of protein sequence complexity by residue grouping. Protein Eng 2003, 16(5):323–330. 10.1093/protein/gzg044View ArticlePubMedGoogle Scholar
- Fletcher W, Yang Z: INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 2009, 26(8):1879–1888. 10.1093/molbev/msp098View ArticlePubMedPubMed CentralGoogle Scholar
- Kosiol C, Goldman N: Different versions of the Dayhoff rate matrix. Mol Biol Evol 2005, 22(2):193–199. 10.1093/molbev/msi005View ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al.: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23(21):2947–2948. 10.1093/bioinformatics/btm404View ArticlePubMedGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755–763. 10.1093/bioinformatics/14.9.755View ArticlePubMedGoogle Scholar
- Pegg SC, Brown SD, Ojha S, Seffernick J, Meng EC, Morris JH, Chang PJ, Huang CC, Ferrin TE, Babbitt PC: Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. Biochemistry (Mosc) 2006, 45(8):2545–2555. 10.1021/bi052101lView ArticleGoogle Scholar
- Goldstein P, Zucko J, Vujaklija D, Krisko A, Hranueli D, Long PF, Etchebest C, Basrak B, Cullum J: Clustering of protein domains for functional and evolutionary studies. BMC Bioinformatics 2009, 10: 335. 10.1186/1471-2105-10-335View ArticlePubMedPubMed CentralGoogle Scholar
- Strelets VB, Shindyalov IN, Lim HA: Analysis of peptides from known proteins: clusterization in sequence space. J Mol Evol 1994, 39(6):625–630. 10.1007/BF00160408View ArticlePubMedGoogle Scholar
- Dill KA: Theory for the folding and stability of globular proteins. Biochemistry (Mosc) 1985, 24(6):1501–1509. 10.1021/bi00327a032View ArticleGoogle Scholar
- Murphy LR, Wallqvist A, Levy RM: Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng 2000, 13(3):149–152. 10.1093/protein/13.3.149View ArticlePubMedGoogle Scholar
- Prlic A, Domingues FS, Sippl MJ: Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng 2000, 13(8):545–550. 10.1093/protein/13.8.545View ArticlePubMedGoogle Scholar
- Solis AD, Rackovsky S: Optimized representations and maximal information in proteins. Proteins 2000, 38(2):149–164. 10.1002/(SICI)1097-0134(20000201)38:2<149::AID-PROT4>3.0.CO;2-#View ArticlePubMedGoogle Scholar
- Munoz E, Deem MW: Amino acid alphabet size in protein evolution experiments: better to search a small library thoroughly or a large library sparsely? Protein Eng Des Sel 2008, 21(5):311–317. 10.1093/protein/gzn007View ArticlePubMedPubMed CentralGoogle Scholar
- Lau KF, Dill KA: A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 1989, 22(10):3986–3997. 10.1021/ma00200a030View ArticleGoogle Scholar
- Peterson EL, Kondev J, Theriot JA, Phillips R: Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics 2009, 25(11):1356–1362. 10.1093/bioinformatics/btp164View ArticlePubMedPubMed CentralGoogle Scholar
- Lempel A, Ziv J: On the Complexity of Finite Sequences. IEEE Trans Inf Theory 1976, 22(1):75–81. 10.1109/TIT.1976.1055501View ArticleGoogle Scholar
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987, 4(4):406–425.PubMedGoogle Scholar
- Felsenstein J: PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Holmes S: Bootstrapping Phylogenetic Trees: Theory and Methods. Stat Sci 2003, 18(2):241–255. 10.1214/ss/1063994979View ArticleGoogle Scholar
- Gerlt JA, Babbitt PC: Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. Annu Rev Biochem 2001, 70: 209–246. 10.1146/annurev.biochem.70.1.209View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.