Supervised multivariate analysis of sequence groups to identify specificity determining residues
© Wallace and Higgins. 2007
Received: 14 December 2006
Accepted: 23 April 2007
Published: 23 April 2007
Skip to main content
© Wallace and Higgins. 2007
Received: 14 December 2006
Accepted: 23 April 2007
Published: 23 April 2007
Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments.
We demonstrate the usefulness of this method on three different test cases. Two of these test cases, the Lactate/Malate dehydrogenase family and Nucleotidyl Cyclases, consist of two functional groups. The other family, Serine Proteases consists of three groups. BGA was used to analyse and visualise these three families using two different encoding schemes for the amino acids.
This overall combination of methods in this paper is powerful and flexible while being computationally very fast and simple. BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
Proteins that evolve from a common ancestor can change functionality over time. For example, different related enzymes can bind to different substrates. Identifying the residues that cause these changes in specificity is important as this could allow one to alter the substrate specificity  or to predict the effect of mutations on the protein. The most common method for identifying SDP (specificity determining positions) starts with the construction of a multiple sequence alignment of the homologous proteins. Often the branching order of a phylogenetic tree exactly matches the known functional split between the proteins. Several methods have been used to identify the SDP's using only the tree and alignment [2, 3], or combined with prediction of ancestral sequences  or with analysis of structures [5–7]. The evolutionary trace (ET) method, for example, identifies SDPs responsible for subgroup specificity by partitioning the tree or dendrogram into an increasing number of subgroups of similar sequences with subsequent analysis of their three-dimensional (3D) structures 
In cases when the functional split does not correspond to a clear phylogenetic split in the tree, other methods for identifying the SDPs have to be used. This situation can arise in highly divergent families, as proteins have multiple features that co-evolve along with specificity, such as the sub-cellar location or interaction with other proteins which may give a larger phylogenetic signal than the functional difference . These methods normally require a multiple alignment, and a classification for the sequences.
Livingstone and Barton implemented a method called AMAPS that highlighted positions in an alignment that had different amino acid properties conserved between different sub-groups . A similar idea is implemented in the Conserved Property Difference Locator method  which is available as a web server . Mutual information has also been used as a measure to identify residues that confer specificity . It is used to measure of how often a particular position in a sequence is conserved in one sub-family, and varies in another. Improvements were made to this algorithm which included taking into account the non-uniformity of amino acid substitutions via amino acid substitution matrices and a method for automatically setting cut-off thresholds . The method of Hannenhalli and Russell  identifies the functional residues by comparing hidden Markov model profiles generated for each subgroup and calculating the relative entropy for each position. Positions with high relative entropy have very different amino acid distributions between the subgroups, and as such are considered possible SDPs. Pirovano et al.  showed that relative entropy exhibited undesirable asymptotic behaviour, and then reformulated it as "sequence harmony". Sequence logos  have also been used to visualise differences between two subfamilies [16, 17].
A variety of multivariate statistical approaches have been applied to sequence analysis over the years. Principal coordinates analysis  was used to plot protein sequences in a low dimensional space that preserved the distances between them . Later, principal component analysis (PCA) was used on multiple sequence alignments to identify possible functional residues . The columns in the alignment were converted into a vector of binary variables of length 20, which represented the absence/presence of an amino acid at this position. PCA analysis of this matrix was then used to project the sequences onto 2 or 3 dimensions, which allowed identification of possible sequence clusters. The residue variables were also projected onto the same space, so that group specific residues could be identified. This method was implemented as a package called SequenceSpace. More recently this method has been automated so that the user does not have to manually identify the sequence groups  and made available as a web server . Correspondence analysis (CA) has also been used to identify SDPs . CA is used to decompose a chi-squared table between residues and sequences and has the attraction of automatically plotting both in the same space. The correspondences between residues and sequences (or groups of sequences) can then be seen from their proximity on the scatter plots.
CA and PCA are usually considered to be "unsupervised" techniques, to use the terminology of the field of machine learning and classification. This means that groups of objects that are of a-priori interest are not specified in advance. The axes are derived as those that account for maximum variance among the objects when these are plotted on them. Often, this is sufficient to obtain the information that one is interested in. The groups of interest may separate out in a useful manner along the first 2 or 3 axes or they may not; it will depend on the data set. Particular splits in the data may not show up in a useful manner on the plots. The most interesting splits can be masked by stronger trends of a less interesting nature such as amino acid composition. The analysis can be "supervised" by specifying groups and forcing the analysis to focus on these. In the case of Pazos et al , the relationships between the groups of sequences and residues are determined by an analysis of CA plots. Groups of sequences are represented as mean vectors and the residues that are closest to each vector are the ones that are most related to that group of sequences.
In this paper, we demonstrate the use of Between Group Analysis (BGA) for identifying SDPs from a sequence alignment. BGA is a supervised multivariate analysis method for sample discrimination and class prediction  and has been recently used to identify genes of interest from microarray data sets . BGA is supervised by labelling each object (sequence) in advance as belonging to one of a small number of groups and forcing the axes to be those that split these groups as much as possible. Technically, the analysis is accomplished by finding those axes that maximise the between group variances. Informally, this is accomplished by treating the group centroids as the objects to be analysed and by carrying out a CA or PCA on these centroids. This produces results that are similar in appearance to those derived from multiple discriminant analysis but with the bonus of being robust when the number of variables exceeds the number of objects. BGA carried out using PCA is very similar to partial least squares discriminant analysis (PLS DA). In practice, BGA is carried out in two stages. These are automatically done using the MADE4 package . Firstly, an ordination is computed using either PCA or CA on a data set in which the different groups are defined. BGA then finds linear combinations of the axes that maximise between-group variances and minimise within-group variances. This combination is very flexible as it can be used to examine any number of pre-specified groups. It has a further advantage in that it can be combined with a variety of data types from binary to continuous because of the way in which it can use PCA or CA.
In both sets of results there is a very clear separation of the two groups of sequences. This is a useful indicator to tell if the method was able to successfully separate the two groups. The inter-group separation is much more than the intra-group separation. There are no obvious outliers in either analysis.
The positions at the end of the axis are either strongly conserved in one group but not the other, or are strongly conserved in both groups but different from each other. If each group has a different strongly conserved residue the position will show up at both ends of the axis. This is true for the top position in the AAP encoding scheme, 152, which is a fully conserved D in the LDH group and N in MDH. The top position in the Binary encoding scheme is position 14, a fully conserved M in LDH and Q in MDH. Positions that are strongly conserved in one group and not the other will show up with a residue and position on the end of the axis corresponding to the group of sequences that they are conserved in, but there won't be a corresponding variable at the other end. One example of this is alignment position 106 which is an Arginine (R) in the lactate sequences but can be either A or P in the Malate case.
The five factors for each of the amino acids as calculated by Altchely et al 
Results from SDPpred web server  for the Lactate/Malate dehydrogenase sequences ranked by Z-Score
Results from SDPpred web server  for the Nucleotidyl cyclases sequences ranked by Z-Score.
The most striking aspect of the results is that many residues are strongly associated with the elastase group. There are 7 residues plotted in the same position, furthest along axis 1. They are 130V, 137I, 145Y, 168Q, 219G, 229H, 255V, and they are all fully conserved for the elastases and not for the other two groups. The residue most associated with trypsin is D in position 218, which was also identified by Hannenhalli and Russell  as is the next residue, A in position 254. These two residues are defined as part of the binding pocket . The other residue in the pocket, G, in position 246 in the alignment is located at the negative end of axis 1, on the opposite end of the elastase group. This is because it is column 246, in this alignment, for the elastase is blank, whereas the other two groups have a G in this position. This residue highlights the critical importance of the alignment for this type of analysis, as this position didn't show up as significant in the Hannenhalli and Russell  analysis. In the alignment used by Hannenhalli and Russell  this position isn't blank for the elastases. It in fact contains 50% of G, and as a result didn't show up as significant. Position 254 is also strongly associated with the chymotrypsin group, with a C at that position. The elastases have a fully conserved N in this position too, however this doesn't show up strongly in this analysis, as there is also an N present in one of the chymotrypsins.
The inter relationship between amino acid residues and the functional properties of a protein is of great importance in understanding how that protein acts. Investigating how amino acids vary or are conserved at particular positions, depending on the function of the protein provides valuable insight into this relationship. One approach is to take a collection of sequences from homologous proteins and to represent them as a multiple alignment. This is followed by clustering the sequences into functional or phylogenetic groups and a search for residue/property correlations . Unfortunately, it is then challenging to disentangle patterns of residue occurrence that are due to functional differences between the groups and patterns that are merely due to the process of amino acid substitution over time down different phylogenetic lineages. The fact that phylogenetic and functional groupings often disagree makes the situation even more difficult.
Multivariate analysis provides a logical alternative to traditional hierarchical clustering and has been used a number of times to analyse residue/property relationships [20, 22]. This has proven to be useful for informal data exploration and visualisation. Most recently, Pazos et al  showed how to use Correspondence Analysis for this purpose, as an alternative to the more traditional PCA. They also showed how to analyse specific trends, of a priori interest, in the data. In this paper, we took this a stage further and showed how to use BGA as a general-purpose technique for protein alignment data exploration. We can combine it with CA in which case it becomes a way of carrying out a supervised CA, for any number of groups. With two groups, you get a single discriminating axis, which can be used like a discriminant function for classifying new sequences. This also ranks the variables (amino acids at each position) in terms of their discriminating power.
BGA can also be combined with PCA in which case there is a choice of how to encode the alignment information and measure amino acid occurrence in columns. Strict binary coding with CA enforces a strict view of presence or absence and is useful for emphasising clear trends in the data. It is, however, vulnerable to being misled by outlier amino acid occurrences. For example, if one sequence has one unusual amino acid in one position, the analysis will interpret this as potentially useful discriminating information. This is greatly improved by the use of pseudocounts. PCA has the advantage of allowing quantitative amino acid information to be used, such as amino acid physico-chemical properties . In both case, sequence weighting is easy to apply to BGA using the MADE4 and ADE4 packages. This overall combination is powerful and flexible while being computationally very fast and simple. Finally, BGA is especially useful because it can be used to analyse any number of functional classes. In the examples we used in this paper, we have only used 2 or 3 classes for demonstration purposes but any number can be used and visualised.
In summary the method presented here is complementary to the other methods for identifying SDP's. It produces similar results, but provides an alternative method for viewing the results. This makes the method very suitable for data exploration.
A combination of R packages from the Bioconductor package  were used to read in a sequence alignment, transform it into a data matrix, and to carry out the Between Group Analysis. They were:
A general purpose data analysis package for ecological data sets . It contains a very large variety of clustering, ordination and discriminant data analysis methods.
A package that facilitates the analysis of microarray data using the ADE4 package . It also includes graphical and visualisation tools which, for example, are used to display the results of a BGA analysis.
A package for reading different types of sequence files including Fasta and ClustalW format into R 
Two different encoding schemes were used to represent the multiple alignments for BGA.
Each column in the sequence alignment is represented by a binary vector of length 20 which represents the presence or absence of a particular amino acid at this position . This results in a data matrix of dimension N×L×20 where N is the number of sequences and L is the length of the alignment. Pseudocounts can also be included.
Each residue in the alignment is represented by a vector of five continuous variables as given by Atchley et al . They applied a multivariate statistic approach to reduce the information in 494 amino acid attributes into a set of five factors for each amino acid. These factors are shown in Table 1
Factor A is termed the polarity index. It correlates well with a large variety of descriptors including the number of hydrogen bond donors, polarity versus nonpolarity, and hydrophobicity versus hydrophilicity. Factor B is a secondary structure index. It represents the propensity of an amino acid to be in a particular type of secondary structure, such as a coil, turn or bend versus the frequency of it in an α-helix. Factor C is correlated with molecular size, volume and molecular weight. Factor D reflects the number of codons coding for an amino acid and amino acid composition. These attributes are related to various physical properties including refractivity and heat capacity. Factor E is related to the electrostatic charge.
Columns in the alignment containing more than 80% gaps were removed. When using the AAP encoding, the remaining gaps were replaced with the average value for the subfamily in that column.
PCA is the most suitable ordination technique for the AAP encoding as it can be used to ordinate data that contains continuous variables. CA, on the other hand, can only be used to ordinate data that contains positive integer values, such as the binary representation of a sequence alignment.
A sequence-weighting scheme is useful to minimize the redundancy and emphasize the diversity each of the sub-family alignments. Position-based sequence weights  were calculated for the sequences in each of the sub-families and used as weights in the BGA calculation. These weights were used, as they are intuitive, easy to calculate and have previously been shown to perform well for profile searches.
In this analysis it was very useful to add the pseudocounts to the binary encoding. This helps reduce the impact of small sample sizes if one group has very few sequences. Without pseudocounts, if a residue, at one position, is present once in one group and not present at all in a second group of sequences then the analysis will conclude this is an important residue at this position.
The BGA was carried out in R using the MADE4 package. Firstly, the alignment file was read into R using the seqinr commands. The sequence weights were then calculated, and the groups defined.
The sequences were converted into a matrix using either the binary or AAP encoding schemes. Columns with more than 80% gaps were excluded. For the AAP encoding the remaining gaps were filled in with the average of the subgroup. Pseudocounts were added to the matrix calculated with the binary encoding.
The matrix, group definition, and type of ordination technique were then passed to the BGA function in Made4. If this ordination technique is CA then the matrix is pre-processed by multiplication with the sequence weights, but if the ordination technique is PCA then the sequence weights are passed in as row weights. The results are plotted using the MADE4 function.
Three different test cases were chosen to demonstrate the method. They are Lactate/Malate dehydrogenases, Nucleotidyl cyclases and Serine Proteases. They have been used as examples by Hannenhalli and Russell  and Pazos et al  as well as others.
Lactate/Malate dehydrogenases share a similar substrate-binding domain. PFAM accession number, PF00056, "Lactate/Malate dehydrogenases, NAD binding domain". They are highly divergent and as such it is difficult to distinguish between lactate and malate subtypes. A single mutation Gln-Arg (position 102 in pdb 9ldta) is known to switch specifity from lactate to malate . This example has been used by Hannenhalli and Russell  and Pazos et al . In this study the alignment generated by Pazos et al was obtained from their website . In this alignment there are 35 malate and 8 lactate sequences. The phylogenetic tree of the Lactate/Malate dehydrogenase sequences used by Pazos et al is shown in Figure 1. There is no simple phylogenetic separation between the two groups of sequences. There is a group of malate sequences that are more closely related to the lactate sequences than the rest of the malate sequences.
Nucleotidyl cyclases are a family of cytosolic proteins that catalyse the reaction that transforms a nucleotide triphosphate into a cyclic nucleotide monophosphate. The cyclases act on either guanylate (GUC) or adenylate (ADC). The alignment used in this example is the same one used by Hannenhalli and Russell  and contains 41 adenylate and 29 guanylate sequences. The phylogenetic tree in Figure 2 was calculated from the alignment using the Neighbor-Joining method  implemented in ClustalW . The tree was rooted using the add_root programme supplied by Manolo Gouy. Two point mutations, Glu-Lys and Cys-Asp, are sufficient to change the specificity of the enzyme from GUC to ADC . These are positions 938 and 1018 of the protein sequence corresponding to the 3D structure of adenylate cyclase, 1ab8 .
Trypsin-like serine proteases are a large family of enzymes that cleave peptide bonds . They have similar catalytic mechanisms but have different preferences for the bonds that they preferentially cleave. Trypsins cleave C-terminal to the positively charged amino acid residues, arginine and lysine. Chymotrypsins cleave bond that are flanked by large aromatic residues. Elastases cleave peptide bonds that are next to small-uncharged amino acid residues. The difference in specifity is caused by changes in the binding pocket . An aspartic acid found in trypsin (Asp189) is usually replaced by a small residue in chymotrypsins (Ser) and elastases (Gly). Glycine at positions 216 and 226 in trypsin (also in chymotrypsins) is usually a valine or threonine in elastases . In this study all of the sequences with EC numbers corresponding to trypsin, chymotrypsin, and elastase were extracted from the full alignment of PF00089 from PFAM . This consisted of 117 trypsins, 17 chymotrypsins, and 7 elastases and these were realigned using ClustalW. Figure 3 gives the tree for this alignment, which was calculated using ClustalW and the Neighbor-Joining method. The elastase sequences all group together, while there is a set of chymotrypsin sequences, which are group closer to trypsin sequences than other chymotrypsins
We would like to thank Alfonso Valencia, Florencio Pazos, and Antonio Rausell for helpful conversations about correspondence analysis and identifying functional residues in general. We would also like to thank Rob Russell for the cyclase example alignment used, and for helpful conversations concerning this project. We would like to thank Guy Perrière, Ian Jeffery and Ailís Fagan for all their help with BGA and R. This work was funded by Science Foundation Ireland.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.