An interactive visualization tool to explore the biophysical properties of amino acids and their contribution to substitution matrices
© Bulka et al; licensee BioMed Central Ltd. 2006
Received: 12 December 2005
Accepted: 03 July 2006
Published: 03 July 2006
Quantitative descriptions of amino acid similarity, expressed as probabilistic models of evolutionary interchangeability, are central to many mainstream bioinformatic procedures such as sequence alignment, homology searching, and protein structural prediction. Here we present a web-based, user-friendly analysis tool that allows any researcher to quickly and easily visualize relationships between these bioinformatic metrics and to explore their relationships to underlying indices of amino acid molecular descriptors.
We demonstrate the three fundamental types of question that our software can address by taking as a specific example the connections between 49 measures of amino acid biophysical properties (e.g., size, charge and hydrophobicity), a generalized model of amino acid substitution (as represented by the PAM74-100 matrix), and the mutational distance that separates amino acids within the standard genetic code (i.e., the number of point mutations required for interconversion during protein evolution). We show that our software allows a user to recapture the insights from several key publications on these topics in just a few minutes.
Our software facilitates rapid, interactive exploration of three interconnected topics: (i) the multidimensional molecular descriptors of the twenty proteinaceous amino acids, (ii) the correlation of these biophysical measurements with observed patterns of amino acid substitution, and (iii) the causal basis for differences between any two observed patterns of amino acid substitution. This software acts as an intuitive bioinformatic exploration tool that can guide more comprehensive statistical analyses relating to a diverse array of specific research questions.
Molecular biology has made great progress in observing and quantifying the patterns by which amino acids exchange for one another within protein sequences over time. A key motivation here has been to create amino acid substitution matrices (such as the PAM and BLOSUM matrix families), which lie at the heart of mainstream bioinformatics procedures, from algorithms that determine whether  and how exactly  two proteins are homologous, to those that predict protein tertiary structure by comparison with known folds . However, these matrices represent generalized patterns of change "averaged" across all proteins: although they typically encompass the idea that patterns of substitution will vary with evolutionary distance, other systematic sources of variation are overlooked. An increasing literature supports the idea that this generalization may compromise the sensitivity of sequence comparison for various specialized subsets of proteins (e.g., for particular protein families [4–8], or for genomes that have evolved under unusual mutation biases or selection regimes [9–11]). Thus a worthy challenge is to seek the underlying ontology that can link individually derived, specialized models of amino acid substitution into a common framework: if we can ultimately replace generalized patterns of observed change with a flexible, quantitative model of amino acid substitution, then this offers significant potential to increase the sophistication of standard bioinformatics procedures. Such research may in fact be viewed as a subset of current efforts to find a general, chemical ontology for bioactivity (e.g., [12–14]) where researchers face the same challenge of unifying diverse observations into a model that predicts molecular interactions from first principles.
In this context, it has long been understood that amino acid substitution matrices reflect a combination chemical and evolutionary factors: most intuitively the biophysical properties (known within chemical disciplines as "molecular descriptors") of the amino acids [15, 16] and the mutational distance of their encodings within the genetic code [5, 17, 18]. However, establishing accurate, quantitative connections between the outcomes of molecular evolution and amino acids' molecular descriptors remains a complex issue under active research (e.g., [19–21]).
In this context, Nakai et al. created an innovative database, the AAindex , comprising both amino acid substitution matrices (20 × 20 matrices in which each element reflects some measure of the exchangeability of a pair of amino acids) and amino acid indices (vectors of 20 elements, each element being a value that describes some physiochemical property such as size or hydrophobicity, for one of the twenty amino acids encoded by the standard genetic code). In a later publication that expanded this database, Tomii and Kanehisa  suggested procedures for correlating any amino acid molecular descriptor with an observed exchange rate (e.g., substitution matrix) and for clustering indices together by similarity.
This latter technique of index clustering, is especially useful when exploring the relationship between indices, given that properties of widespread interest have often been measured in many different ways by different researchers. (For example, the latest version of the AAindex database  contains 29 different measurements of a property that contains the term "hydrophobicity" in its description.) Moreover, this comparison allows easy visualization of non-intuitive correlations (e.g., hydrophobicity and volume). The authors applied similarity-based methods to their AAindex database to build a minimum spanning tree: a graph-theoretic structure that connects discrete elements together based on similarity, by minimizing the overall sum of the distances of the direct connections. The result is a data structure in which elements are grouped together based on similarity (a detailed description and justification is given in the work of Tomii and Kanehisa who first applied this methodology to visualizing amino acid similarity ). This minimum spanning tree showed the underlying structure (clustering) for the 402 indices of their database. Since this time, numerous further indices and matrices have been developed: some have been incorporated into updates of the AAindex, while others remain isolated in the scientific literature (e.g., [10, 25]).
In this context, we have developed free, user-friendly, publicly available web-based software that enables researchers to repeat and extend the ideas of Nakai et al.,  and Tomii and Kanehisa  using interactive data visualization. We thus present the Amino Acid Explorer, a web tool that facilitates quantitative exploration of similarity between physiochemical properties of amino acids and their evolutionary dynamics. Our tool allows users to explore the similarity between any of the 83 matrices and any subset of the 494 indices housed by AAindex version 6.0, and to include any custom index or matrix (e.g., from recent scientific literature or from unpublished research, as a matrix derived from an alignment of proteins in a particular functional class, or an index derived by combining several physiochemical properties). We have embedded this analysis tool within a comprehensive web context: both a moderated user forum http://www.evolvingcode.net/forum/viewforum.php?f=24 in which to discuss problems, findings or questions and an open wiki http://www.evolvingcode.net/index.php?page=Amino_Acid_Indices in which the community of those researching the interface of biochemistry and protein evolution may contribute their knowledge.
User interface and visualization
The user interface of our tool is a Java applet that runs in a user's browser. It allows the user to (i) select any subset of the AAIndex indices (or custom indices) to be clustered using the minimum spanning tree method, (ii) choose an appropriate distance calculation method (to be used during the spanning tree computation), and (iii) choose a matrix or matrices to compare with the indices of a spanning tree.
Specifically, having built a spanning tree, the application can compute distances between all the indices in this tree and a user defined matrix; it displays these distances by shading the elements of the spanning tree with a color-coded scale. Additionally, it can use a second color-coded scale to display which of two user-defined matrices each index of the spanning tree is closest to (in other words, what makes these two matrices different from one another in terms of the indices under consideration?).
Drawing the spanning tree
Graph drawing and visualization are currently open research topics in computer science . Although an agreed method exists for creating the graph (calculating a spanning tree), finding an optimal spatial positioning for nodes and drawing edges in a readable way (e.g., grouping nodes that are directly connected together, while minimizing crossed edges) remain active areas of research. A large number of different software packages implement a variety of state-of-the-art graph drawing methods, which differ significantly in speed, quality of the drawing, and interactivity (i.e., allowing the user to influence the final shape of the graph being drawn). Our visualization tool uses a slightly modified form of the open source-package TouchGraph  to render the minimum spanning tree that was computed server-side. (Modifications to the original TouchGraph code are limited to changes that redefine the default parameters for flexibility of the edges, and minor modifications required to integrate the code into our applet.) A full description of TouchGraph can be found at their web site; in essence, it uses an iterative "force-based layout" algorithm (in which nodes each projects a force that repel other nodes, while edges act like springs that can be compressed or stretched) to move, though a series of incremental improvements, from a random graph layout to an optimal representation. The whole incremental process is visible, and the user can intervene at any point by dragging nodes to locations that seem to be better suited. In our application, this is most likely to be useful when users request a spanning tree for a large set of amino acid indices, under which conditions the force-based layout may become stuck at a local optimum, visible to the user as a representation in which one or a few key edges cross one another.
Visualizing distances between a matrix and a set of indices
Our application represents the distances between matrices and indices in two modes. In the first mode, each node in the spanning tree (representing a single amino acid index) is color-coded to represent its measured similarity to a single, user-defined reference matrix. The color scale runs from blue (most distant) to red (most similar). Distances are measured as described below. The second mode (differential mode) shows how two substitution matrices differ in terms of the amino acid indices of a spanning tree. This mode uses a color-coded scale to denote which of two matrices is closest to each node (index). In the figures shown here, the color scale is green (matrix 1) to brown (matrix 2) so as to avoid any confusion with Mode #1 described above. The degree of color saturation denotes the magnitude of the difference (i.e., strong colors indicate that the two matrices are very different in terms of this index).
All significant computation for this tool occurs on the server-side, because it often involves most or all of the data stored in the database (thus transfer to a client-side applet could take prohibitive time for users with low-bandwidth connections).
Computation of a minimum spanning tree
The software calculates a minimum spanning tree using Prim's algorithm, as described by Cormen et al. . Since this algorithm minimizes the total sum of distances between directly connected indices, the definition of distance here is of prime importance. Tomii and Kanehisa  used a statistical correlation measure between two indices (each is a vector of 20 numbers representing an amino acid property). Our software allows users to employ this metric, but also to explore another notion of distance, namely Euclidean distance (calculating distance between two indices as distance between two points in 20-dimensional space). This approach is often taken to compare normalized vectors in multi dimensional spaces . More generally, our software allows users to restrict the set of amino acids that are taken into account when calculating distance (e.g., it is possible to consider only hydrophobic amino acids, or only those encoded by GC-rich codons), whichever metric of distance is being used.
Computation of distance between a matrix and a set of indices
In order to compute the distance between a matrix and a set of indices, our software uses the correlation method described by Tomii and Kanehisa . This method first converts each index (a vector of 20 values, one for each amino acids) into a matrix by calculating the simple arithmetic distance between each pair of amino acids, as defined by the index. It then calculates the correlation coefficient between these two matrices. While the Euclidean distance method may be used to build a minimum spanning tree of indices, which have been normalized to facilitate direct comparison, this method would is inappropriate for matrix/index comparisons because matrix values have not been normalized (i.e., matrix elements may extend beyond the interval from 0 to 1 and thus Euclidean distance between any one element of an index and elements of a matrix would be misleading. Linear normalization of matrix elements would itself be inappropriate since many matrices, such as the PAM series, comprise values that are expressed in logarithmic units). Therefore, our software always uses the Tomii and Kanehisa method of simple correlation to compare a matrix with an index. If the user has selected only a subset of the 20 amino acids for tree building, then calculations of distance between a matrix and the indices of a spanning tree consider only the appropriate subset of matrix elements.
UMBC AAIndex database
We created the UMBC version of the AAIndex database as a local version of the original AAindex data (created by GenomeNet Japan ) to facilitate the manipulations required by our interactive software. Specifically, our local implementation converted all data of the original AAindex to XML format, generated interfaces that enable precise local and remote access to all aspects of the database, and normalized all amino acid index data.
XML is a standardized language that is designed to simplify sharing of information among independently created systems. In particular, it is easily readable by machines (there are many code libraries that allow access to XML data by programs written in almost any programming language), and thus facilitates conversions to other languages, both to formats that are intended to be read by humans (e.g., web pages or PDF files) and to other computer formats. Our UMBC AAIndex database allows direct user access via internet either in "raw" form (plain XML data) or transformed to a web page that is designed to be easily read by a human. In the former capacity, our implementation of this database has been designed for simple access by either programs residing on our server, or by simple HTTP requests from remote machines. When bandwidth for data transfer is an issue for some third-party users, our architecture also allows deployment of programs directly at the server for a more direct access. Both of these latter points reflect our aim to facilitate other researchers who would like to expand and improve the functionality we offer for the AAindex data.
The indices in the database have been normalized by linearly scaling all the values of each index from 0 (the smallest value of the original index) to 1 (the greatest value of the original index). This simplifies and makes more intuitive the comparison of values between different indices, which may originally have had values expressed using different units. (Note that this normalization does not influence the results obtained by the correlation coefficient method used by Tomii and Kanehisa , which may be reproduced exactly by our software in a matter of seconds.)
Here we present three simple, example analyses to illustrate the types of exploration that our software allows. Each illustrates a conceptually different question that the tool reduces to a simple "point and click" exercise. We have chosen to focus on the relationship between biophysical properties of amino acids, patterns of molecular evolution, and the structure of the standard genetic code. However, it would be trivial to find an equivalent set of example analyses that focused on protein folding or homology searching. Indeed, our visualization software can be used to investigate any area of bioinformatics that builds on understanding how amino acids' molecular descriptors influence the patterns by which amino acids substitute for one another during evolution.
This same feature of the AAIndex Explorer tool could equally well be used to quickly visualize which properties (and which amino acids) are responsible for the difference between any two substitution matrices (e.g., between a "generalized" or global model of amino acid substitution, as found in a PAM or BLOSUM matrix, and any observed pattern of interchange within a specific protein family or phyletic lineage).
In this paper, we present software that facilitates rapid, interactive exploration of data pertaining to three interconnected topics: (i) the multidimensional molecular descriptors of biochemical properties for the twenty proteinaceous amino acids, (ii) the correlation of these biophysical measurements with observed patterns of amino acid substitution (i.e. substitution matrices), and (iii) the causal, biocehmical basis for differences between any two observed patterns of amino acid substitution. This software acts as an intuitive bioinformatic exploration tool that can guide more comprehensive statistical analyses relating to a diverse array of specific research questions.
Availability and requirements
Project name: Amino Acid Explorer
Project home page: http://www.evolvingcode.net:8080/aaindex/tools/
Operating system(s): Platform independent
Programming language: Java
Use via EvolvingCode's website
○ Web browser (tested with Internet Explorer, Netscape and Mozilla under Windows and Linux, Safari under Mac OS X 10.3.9)
○ Java 1.4.2 plug-in for the web browser (or higher version)
Full installation on an independent server
○ Java 1.4.2 plug-in for the web browser (or higher version) on the client side
○ JDK 1.4.2 environment on the server
○ XML Database compliant with XML:DB API (tested with eXist database)
○ Servlet Web Container matching Servlet API 2.4 specifications (tested with Tomcat 5.0.28)
○ Xalan XSLT processor
License: Apache-style open source license
Any restrictions to use by non-academics: None
The authors would like to thank the members of their research groups (Freeland Lab and MAPLE Lab) for their comments and support. This work was funded in part by NSF grant DBI-0317349-001. The tool described here contains software developed by TouchGraph LLC http://www.touchgraph.com/.
- Henikoff S, Henikoff JG: Performance evaluation of amino acid substitution matrices. Proteins 1993, 17: 49–61.View ArticlePubMedGoogle Scholar
- Jeanmougin F, Thompson JD, Gouy M, Higgins DG, Gibson TJ: Multiple sequence alignment with Clustal X. Trends Biochem Sci 1998, 23: 403–405.View ArticlePubMedGoogle Scholar
- Tress M, Ezkurdia I, Grana O, Lopez G, Valencia A: Assessment of predictions submitted for the CASP6 comparative modelling category. Proteins 2005, in press.Google Scholar
- Vilim RB, Cunningham RM, Lu B, Kheradpour P, Stevens FJ: Fold-specific substitution matrices for protein classification. Bioinformatics 2004, 20: 847–853.View ArticlePubMedGoogle Scholar
- Teodorescu O, Galor T, Pillardy J, Elber R: Enriching the sequence substitution matrix by structural information. Proteins 2004, 54: 41–48.View ArticlePubMedGoogle Scholar
- Bastien O, Roy S, Marechal E: Construction of non-symmetric substitution matrices derived from proteomes with biased amino acid distributions. C R Biol 2005, 328: 445–453.View ArticlePubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: A mutation data matrix for transmembrane proteins. FEBS Letters 1994, 339: 269–275.View ArticlePubMedGoogle Scholar
- Sutormin RA, Rakhmaninova AB, Gelfand MS: BATMAS30: amino acid substitution matrix for alignment of bacterial transporters. Proteins 2003, 51: 85–95.View ArticlePubMedGoogle Scholar
- Pacholczyk M, Kimmel M: Analysis of differences in amino acid substitution patterns, using multilevel G-tests. C R Biol 2005, 328: 632–641.View ArticlePubMedGoogle Scholar
- Yu YK, Altschul SF: The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions. Bioinformatics 2005, 21: 902–911.View ArticlePubMedGoogle Scholar
- Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 1996, 42: 459–468.View ArticlePubMedGoogle Scholar
- Feldman HJ, Dumontier M, Ling S, Haider N, Hogue CW: CO: A chemical ontology for identification of functional groups and semantic comparison of small molecules. FEBS Letters 2005, 579: 4685–4691.View ArticlePubMedGoogle Scholar
- Giaever G, Flaherty P, Kumm J, Proctor M, Nislow C, Jaramillo DF, Chu AM, Jordan MI, Arkin AP, Davis RW: Chemogenomic profiling: Identifying the functional interactions of small molecules in yeast. PNAS 2004, 101: 793–798.PubMed CentralView ArticlePubMedGoogle Scholar
- di Bernardo D, Thompson MJ, Gardner TS, Chobot SE, Eastwood EL, Wojtovich AP, Elliott SJ, Schaus SE, Collins JJ: Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nature Biotechnology 2005, 23: 377–383.View ArticlePubMedGoogle Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974, 185: 862–864.View ArticlePubMedGoogle Scholar
- Benner SA, Cohen MA, Gonnet GH: Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng 1994, 11: 1323–1332.View ArticleGoogle Scholar
- Fitch WM: An improved method of testing for evolutionary homology. J Mol Biol 1966, 16: 9–16.View ArticlePubMedGoogle Scholar
- Schneider A, Cannarozzi GM, Gonnet GH: Empirical codon substitution matrix. BMC Bioinformatics 2005, 6: 134.PubMed CentralView ArticlePubMedGoogle Scholar
- Fujitsuka Y, Chikenji G, Takada S: SimFold energy function for de novo protein structure prediction: Consensus with Rosetta. Proteins 2005, in press.Google Scholar
- Yampolsky LY, Stoltzfus A: The exchangeability of amino acids in proteins. Genetics 2005, 170: 1459–1472.PubMed CentralView ArticlePubMedGoogle Scholar
- Dosztanyi Z, Torda AE: Amino acid similarity matrices based on force fields. Bioinformatics 2001, 17: 686–699.View ArticlePubMedGoogle Scholar
- Nakai K, Kidera A, Kanehisa M: Cluster analysis of amino acid indices for prediction of protein structure and function. Protein Eng 1988, 2: 93–100.View ArticlePubMedGoogle Scholar
- Tomii K, Kanehisa M: Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 1996, 9: 27–36.View ArticlePubMedGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res 2000, 28: 374.PubMed CentralView ArticlePubMedGoogle Scholar
- Gilis D, Massar S, Cerf NJ, Rooman M: Optimality of the genetic code with respect to protein stability and amino-acid frequencies. Genome Biol 2001, 2: RESEARCH0049.PubMed CentralView ArticlePubMedGoogle Scholar
- Tollis IG, Tamassia R, Eades P, Di Battista G: Graph Drawing: Algorithms for the Visualization of Graphs. Pearson Education; 1998.Google Scholar
- TouchGraph Website[http://www.touchgraph.com]
- Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. Second edition. Cambridge, MA, London: The MIT Press; Boston, MA, Burr Ridge, IL, Dubuque, IA, Madison, WI, New York, NY, San Francisco, CA, St. Louis, MO, Montreal, Toronto: McGraw-Hill Book Company; 2001.Google Scholar
- Mitchell TM: Machine Learning. McGraw-Hill Companies; 1997.Google Scholar
- AAindex Website[http://www.genome.ad.jp/dbget/aaindex.html]
- Woese CR: Evolution of the genetic code. Naturwissenschaften 1973, 60: 447–459.View ArticlePubMedGoogle Scholar
- Haig D, Hurst LD: A quantitative measure of error minimisation within the genetic code. J Mol Evol 1991, 33: 412–417.View ArticlePubMedGoogle Scholar
- Freeland SJ, Hurst LD: The genetic code is one in a million. J Mol Evol 1998, 47: 238–248.View ArticlePubMedGoogle Scholar
- Goodarzi H, Shateri Najafabadi H, Torabi N: On the coevolution of genes and genetic code. Gene 2005, 362: 133–140.View ArticlePubMedGoogle Scholar
- Freeland SJ, Wu T, Keulmann N: The case for an Error Minimizing Standard Genetic Code. Orig Life Evol Biosph 2003, 33: 457–477.View ArticlePubMedGoogle Scholar
- Woese CR, Dugre DH, Saxinger WC, Dugre SA: On the fundamental nature and evolution of the genetic code. Cold Spring Harb Symp Quant Biol 1966, 31: 723–736.View ArticlePubMedGoogle Scholar
- Kyte J, Doolittle RF: A simple measure for displaying the hydropathic character of a protein. J Mol Biol 1982, 157: 105–132.View ArticlePubMedGoogle Scholar
- Di Giulio M: The origin of the genetic code cannot be studied using measurements based on the PAM matrix because this matrix reflects the code itself, making any such analyses tautologous. J Theor Biol 2001, 208: 141–144.View ArticlePubMedGoogle Scholar
- Szathmary E, Zintzaras E: A statistical test of hypotheses on the organization and origin of the genetic code. J Mol Evol 1992, 35: 185–189.View ArticlePubMedGoogle Scholar
- Haig D, Hurst LD: A quantitative measure of error minimization in the genetic code. J Mol Evol 1999, 49: 708.View ArticlePubMedGoogle Scholar
- Ardell DH: On error minimization in a sequential origin of the standard genetic code. J Mol Evol 1998, 47: 1–13.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.