# A model-independent approach to infer hierarchical codon substitution dynamics

- Olof Görnerup
^{1}Email author and - Martin Nilsson Jacobi
^{1}

**11**:201

**DOI: **10.1186/1471-2105-11-201

© Görnerup and Jacobi; licensee BioMed Central Ltd. 2010

**Received: **20 October 2009

**Accepted: **23 April 2010

**Published: **23 April 2010

## Abstract

### Background

Codon substitution constitutes a fundamental process in molecular biology that has been studied extensively. However, prior studies rely on various assumptions, e.g. regarding the relevance of specific biochemical properties, or on conservation criteria for defining substitution groups. Ideally, one would instead like to analyze the substitution process in terms of raw dynamics, independently of underlying system specifics. In this paper we propose a method for doing this by identifying groups of codons and amino acids such that these groups imply closed dynamics. The approach relies on recently developed spectral and agglomerative techniques for identifying hierarchical organization in dynamical systems.

### Results

We have applied the techniques on an empirically derived Markov model of the codon substitution process that is provided in the literature. Without system specific knowledge of the substitution process, the techniques manage to "blindly" identify multiple levels of dynamics; from amino acid substitutions (via the standard genetic code) to higher order dynamics on the level of amino acid groups. We hypothesize that the acquired groups reflect earlier versions of the genetic code.

### Conclusions

The results demonstrate the applicability of the techniques. Due to their generality, we believe that they can be used to coarse grain and identify hierarchical organization in a broad range of other biological systems and processes, such as protein interaction networks, genetic regulatory networks and food webs.

## Background

Ever since its discovery by Nirenberg and Matthaei [1], the structure [2–8] and evolution [9–18] of the genetic code from nucleotide triplets in DNA to amino acid residues in proteins has been studied extensively. In structure-based studies--in terms of a snapshot of what codes to what--similar codons have for instance been found to be associated with amino acids with similar properties [19] and amino acids with simple structures are typically coded by more codons [6]. However, amino acids may be grouped with respect to several different properties, and it is difficult to quantitatively judge the relative and actual relevance of these properties. By studying the effective evolutionary dynamics of codons and amino acids one avoids this problem. In dynamic-based approaches the substitution process is often modeled as a Markov chain, where the distribution of substitutions of a given residue is independent of neighboring residues as well as prior residues at the same site. These assumptions are clearly not strictly correct, but are still meaningful as approximations. Dayhoff and coworkers pioneered Markov modeling by estimating substitution frequencies empirically from alignments of orthologous sequences [20]. From inspection of log odds scores they concluded that amino acids with similar properties indeed tend to form groups that are conserved. In other words, members of a group usually substitute to each other, rather than to external residues. In subsequent work [21–24], conservation has been turned into a criteria used for defining and inferring amino acid substitution groups. An interpretation of these results is that the substitution process hierarchically operates on multiple levels, from nucleotides to codons to groups of codons [4]. However, one relies on strong assumptions when aiming to infer hierarchical levels in terms of certain biochemical properties or explicit conservation criteria. Ideally, one would instead like to "blindly" infer levels purely from the observed dynamics. In this paper we present such an approach, which is based on recently developed methods for identifying hierarchical levels in dynamical systems [25]. The methods are derived from first principles, and only rely on the assumption that the dynamic process can be described as a Markov chain; there are no assumptions regarding for example amino acid conservation or group isolation. In fact, the techniques presented here are not limited to the substitution process, but may also be applied to the broad range of biological systems that can be represented by networks or transition matrices. In this presentation, however, we will concentrate on the substitution process by applying the techniques on an empirically derived codon transition matrix provided by Schneider et al. [26].

In the next section we will present our methodology. The underlying theory is only introduced briefly here, and will have a focus on Markov chains. A more thorough presentation (including proofs) that covers a broader class of systems can be found in Ref. [25]. We will then report on the results when applying our techniques on the codon transition matrix. After discussing the acquired results and their possible relation to the evolution of the genetic code, we conclude the paper with a few closing remarks about the methods relation to biological modeling in general, and possible future directions.

## Methods

*lumping*in the literature, and the Markov chain is said to be

*lumpable*[28]. See Figure 1 for a schematic illustration of Markov chain lumping in the context of the codon substitution process.

*s*

_{1},

*s*

_{2}, ...,

*s*

_{ n }} be the state space of an aggregated process,

*P*

_{ i }a stochastic variable of the preceding state of

*s*

_{ i }, and

*F*

_{ i }a stochastic variable of the subsequent state of

*s*

_{ i }(here we only consider one step--in general the past and future may be of arbitrary length). The mutual information between past and future states, given a current state

*s*

_{ i }is

*P*

_{ i }. H(

*F*

_{ i }) and H(

*P*

_{ i },

*F*

_{ i }) of the joint distribution of

*P*

_{ i }and

*F*

_{ i }are defined analogously. Then

*s*

_{ i }) is probability that the system is in state

*s*

_{ i }. The criterion can be used to test whether or not a given partition defines a lumping, but it is typically not useful for constructing the partitions that define lumpings. Since the number of possible ways to partition a state space of

*N*states is astronomical even for relatively small

*N*it is not feasible to evaluate all partitions. Instead, we employ two novel techniques for identifying aggregations that enables one to analyze systems with a large number of states (on the order of 10

^{3}, or 10

^{4}if the transition matrix is sparse). The first technique is based on the following observation (see [25, 29] for further details): Consider

*n*eigenvectors of a transition matrix. These will define

*N*points in an

*n*-dimensional space, where each point is associated with a state in the Markov chain.

*If the N points form n clusters, these clusters define an aggregation, where aggregates of states are given by corresponding points within clusters*. The task of finding aggregations is then reduced to the problem of finding

*n*eigenvectors that respect the same

*n*clusters of eigenvector elements. We will illustrate this with an example. Consider a Markov chain whose dynamics over some state space {

*a, b, c, d*} is given by a transition matrix

*p*and

*q*are probabilities. We can determine if the states can be aggregated by examining

*P*'s eigenvectors. These are given by the columns in

where *r* = *p*(*q* - 1)/*q* and
. We see that there are two clusters in the second eigenvector (with values 1 and -1). Since the first eigenvector respects the same clusters (trivially so since the first eigenvector forms a single cluster), the first and the second eigenvectors define an aggregation, namely {{*a, c*}, {*b, d*}}. There are also two trivial aggregations: {{*a, b, c, d*}} (due to the first eigenvector alone) and {{*a*}, {*b*}, {*c*}, {*d*}} (due to all eigenvectors, assuming they form a complete base). Due to the conservation of probability in a Markov process, the trivial aggregations where all states are in the same aggregate always exist.

Identifying *n* eigenvector constitutes a constraint satisfaction (SAT) problem. We have implemented a backtracking algorithm that in the typical case identifies aggregates in polynomial time. It is beyond the scope of this paper to describe the algorithm here. Instead we refer to [30], where we specify the algorithm in detail and provide pseudocode.

That is, initially each state is in a separate partition element, and the state space is then successively aggregated by joining the pair of aggregates that result in the lowest mutual information. The agglomeration method gives good results on the first levels in the aggregation hierarchy, but becomes less accurate at coarser levels. Since the spectral method works best in the latter case, the techniques complement each other.

We have applied the two techniques on a Markov chain of codon substitutions, whose transition probabilities have been empirically derived by Schneider et al. [26]. The codon substitution frequencies have been estimated from 17,502 pairwise alignments of orthologous sequences from human, mouse, chicken, frog and zebrafish. For this purpose, they aligned 8.3 million codons, counted the substitutions between codons, and derived the substitution probability matrix from the resulting counts.

## Results

*P*-- resulting in 61 points in a 21 dimensional space--21 distinct clusters are acquired. Since the number of eigenvectors equals the number of clusters, these define a valid aggregation. As exemplified in Figure 2(b) the clusters show as level sets in the individual eigenvectors. The aggregation constitutes the standard genetic code as each cluster constitutes codons that are associated with the same amino acid, with the exception of the codons of serine, which are divided into two clusters ({TCT, TCC, TCA, TCG} and {AGT, AGC}). This unique separation is due to that serine is the only amino acid whose codons are not connected with single point mutations (i.e. some codons are separated by a Hamming distance larger than one on a hypercube).

*A, T*}, {

*I, M, V*}, {

*E, D*} and {

*K, R*}, cf. Figure 4. If we go back to the spectral view, we see that the same aggregation is indicated by three of the eigenvectors in the transition matrix, Figure 5. This exemplifies that one may also identify aggregates by searching for clusters or clear separations of eigenvector elements and then test if these constitute valid aggregates by using Eq. 3. For instance, one known grouping is to separate purine-centred and pyrimidine-centred codons [31, 32]. Inspecting Figure 2(b) and Figure 5, we can see that the two groups indeed are separated if we exclude the rare amino acids

*C*,

*Y*and

*W*, although not forming two distinct meta-clusters. However, a purine-pyrimidine separation is not present in the dendrogram in Figure 3.

## Discussion

We will now compare our results with amino acid groupings that previously have been discussed in the literature. Firstly, Jiménez-Montaño and He have used the same matrix that we employed here to perform hierarchical clustering of codons based on an Euclidian distance measure [33] (supplementary material). In their case *W* also forms its own aggregate and *S* is split up. Other similarities are the grouping of {*E, D*} and {*A, T*}. However, *A* and *T* are also grouped with *P*, *R* and *S*, and so their groupings do not respect the purine-pyrimidine separation. This is also the case in Figure 3, but not on the first aggregation levels. Another difference is that the genetic code is not as distinct in their case. The codons of *F*, for instance, are more separated than some of the amino acid aggregates (e.g. *A* and *T*).

Kosiol et al. [34] have estimated a different empirical codon model than the one used here and perform an aggregation with the Almost Invariant Sets (AIS) algorithm [35], which aims to find groups of elements that are conserved. They group codons into 20 and 7 aggregates and first identify the genetic code. In the 7-element aggregation, the aliphatics {*I, M, V, L*} form one group and the aromatics {*Y, F*} form one group (in our case, this is not as distinct, see Figure 3). Furthermore, half of the amino acids--those that are hydrophilic and basic--form one large group (in this way, *S* is not split). This group, however, does not respect the purine-pyrimidine separation. Kosiol et al. also apply the same algorithm on an empirical amino acid model [36] and acquire very similar results. One may argue that this is expected, since the AIS algorithm identifies the genetic code and since the aggregation of the codon model with respect to genetic code probably is very similar to the amino acid model. If we compare the aggregates acquired by Jiménez-Montaño et al. and Kosiol et al., we see that there is little agreement, with the exception that *A*, *S* and *T* are in the same aggregates in both cases.

Johnson and Overington have compared dendrograms based on twelve different scoring matrices with respect to a distance measure between scoring distributions [37]. In the resulting dendrograms all of our aggregates occur to various degrees (in 7, 4, 6 and 3 out of 12 times for {*K, R*}, {*A, T*}, {*E, D*} and {*I, M, V*}, respectively). Interestingly, all of our aggregates occur both in the dendrograms based on the scoring matrices by Gonnet et al [38] and Jones et al [39], where *W* also forms its own aggregate. Both these matrices are based on empirical sequence comparisons. In contrast, there is much less agreement with respect to matrices based on chemical or physical properties.

The standard genetic code is quite easily identified since member codons within an aggregate are invariant under mutations as they code for the same amino acid. It is not as clear, however, why the most significant amino acid aggregation is given by {*A, T*}, {*I, M, V* }, {*E, D*} and {*K, R*}--one of many other possible ways to group together amino acids. One may hypothesize that the aggregated dynamics of codon substitutions provide information about the origin of the genetic code. There are several theories aiming to address the fundamental question on how the code came to be. See Ref. [15] for a comprehensive comparison. With the exception of the *frozen accident* theory by Crick [40], these theories couple the evolution of the genetic code primarily with physico-chemical properties of the amino acids or evolved biosynthetic pathways. Woese [9], specifically, suggested that the code has evolved by a process of ambiguity reduction. The idea is that a crude primordial version of the code, where groups of codons code for groups of amino acids with resembling properties, evolved into the code's current state by a series of refinements. One may ask if amino acid groups reflect earlier versions of the code. Riddle et al. [41] experimentally searched for a minimum set of amino acids capable of forming complex protein folds. They found that the five amino acids *A*, *G*, *I*, *E* and *K* (underlined in Figure 3) are capable of forming most of the ancient SH3 protein domain. Consider again the most significant amino acid aggregation and note that *A*, *I*, *E* and *K* are all members of separate aggregates and that *G* forms its own aggregate. One could speculate that the aggregates reflect group codons in an earlier version of the code, and that these groups were specialized into present day codons. It is an intriguing hypothesis that is also partly supported by Jiménez-Montaño's hypothesis on the evolutionary history of the code [14]. In the proposed evolutionary tree, which is based on group theory and the thermodynamics of codon-anticodon interactions, amino acids within aggregates {*A, T*}, {*I, M, V*} and {*E, D*} share the same branches up till the two last reassignment of codons, although *K* and *R* are separated earlier than that (four reassignments). However, these observations remain speculative and clearly a more careful analysis would be required in order to conclusively relate acquired aggregates to the evolution of the standard genetic code and its deviates.

## Conclusions

We have employed two techniques to identify multiple levels of substitution dynamics. The techniques only consider the raw dynamics of the system and are therefore independent of system dependent assumptions that may be irrelevant or even false. The techniques therefore manage to blindly identify the amino substitution process via the standard genetic code, as well as higher order substitution dynamics via amino acid groups. The techniques are not limited to the codon substitution process, but may be applied to systems that are specified by a state space and a transition matrix. This is a very broad class of systems that for instance include networks (where vertices constitute the state space, and where the transition matrix is defined by the network Lagrangian). The techniques may therefore be used to identify hierarchical dynamics in seemingly very different biological systems, such as protein interaction networks, genetic regulatory networks, metabolic pathways and food webs. Identifying the hierarchical structure of a system does not only increase our understanding of the system, especially if the levels are intuitively difficult to grasp, but it also enables effective coarse graining in simulations. As soon as one hierarchical level is identified, details of lower levels can be discarded if they are not of relevance when simulating the system at the new level. Due to the increasing size and complexity of current models that owe to the rapid growth of available biological data, such reductions are of particular value.

## Declarations

### Acknowledgements

This work was funded in part by the EU integrated project FP6-IST-FET PACE, by EU FP6-NEST project EMBIO, and by EU STREP project FP6-IST-STREP MORPHEX. The authors thank Rickard Sandberg for helpful discussions, and the anonymous reviewers for their valuable comments.

## Authors’ Affiliations

## References

- Nirenberg MW, Matthaei JH: The Dependence of Cell-Free Protein Synthesis in E. Coli Upon Naturally Occuring or Synthetic Polyribonucleotides.
*Proceedings of the National Academy of Sciences of the USA*1961, 47: 1588–1602. 10.1073/pnas.47.10.1588View ArticlePubMedPubMed CentralGoogle Scholar - Jungck JR: The genetic code as a periodic table.
*Journal of Molecular Evolution*1978, 11(3):211–224. 10.1007/BF01734482View ArticlePubMedGoogle Scholar - Karasev V, Stefanov V: Topological Nature of the Genetic Code.
*Journal of Theoretical Biology*2001, 209(3):303–317. 10.1006/jtbi.2001.2265View ArticlePubMedGoogle Scholar - Chechetkin VR: Block structure and stability of the genetic code.
*Journal of Theoretical Biology*2003, 222(2):177–188. 10.1016/S0022-5193(03)00025-0View ArticlePubMedGoogle Scholar - Wilhelm T, Nikolajewa S: A new classification scheme of the genetic code.
*Journal of molecular evolution*2004, 59(5):598–605. 10.1007/s00239-004-2650-7View ArticlePubMedGoogle Scholar - Di Giulio M: The origin of the genetic code: theories and their relationships, a review.
*BioSystems*2005, 80: 175–184. 10.1016/j.biosystems.2004.11.005View ArticlePubMedGoogle Scholar - Marquez R, Smit S, Knight R: Do universal codon-usage patterns minimize the effects of mutation and translation error?
*Genome Biology*2005, 6(11):R91. 10.1186/gb-2005-6-11-r91View ArticlePubMedPubMed CentralGoogle Scholar - Itzkovitz S, Alon U: The genetic code is nearly optimal for allowing additional information within protein-coding sequences.
*Genome Research*2007, 17: 405–412. 10.1101/gr.5987307View ArticlePubMedPubMed CentralGoogle Scholar - Woese CR: On the evolution of the genetic code.
*Proceedings of the National Academy of Sciences of the USA*1965, 54: 1546–1552. 10.1073/pnas.54.6.1546View ArticlePubMedPubMed CentralGoogle Scholar - Wong JTF: Role of minimization of chemical distances between amino acids in the evolution of the genetic code.
*Proceedings of the National Academy of Sciences of the United States of America*1980, 77(2):1083–1086. 10.1073/pnas.77.2.1083View ArticlePubMedPubMed CentralGoogle Scholar - Hornos JEM, Hornos YMM: Algebraic model for the evolution of the genetic code.
*Physical Review Letters*1993, 71(26):4401–4404. 10.1103/PhysRevLett.71.4401View ArticlePubMedGoogle Scholar - Jiménez-Sánchez A: On the origin and evolution of the genetic code.
*Journal of Molecular Evolution*1995, 41(6):712–716. 10.1007/BF00173149View ArticlePubMedGoogle Scholar - Trifonov E, Bettecken T: Sequence fossils, triplet expansion, and reconstruction of earliest codons.
*Gene*1997, 205(1–2):1–6. 10.1016/S0378-1119(97)00479-4View ArticlePubMedGoogle Scholar - Jiménez-Montaño MA: Protein evolution drives the evolution of the genetic code and vice versa.
*BioSystems*1999, 54: 47–64. 10.1016/S0303-2647(99)00058-1View ArticlePubMedGoogle Scholar - Trifonov EN: Consensus temporal order of amino acids and evolution of the triplet code.
*Gene*2000, 261: 139–151. 10.1016/S0378-1119(00)00476-5View ArticlePubMedGoogle Scholar - Copley SD, Smith E, Morowitz HJ: A mechanism for the association of amino acids with their codons and the origin of the genetic code.
*Proceedings of the National Academy of Sciences of the United States of America*2005, 102(12):4442–4447. 10.1073/pnas.0501049102View ArticlePubMedPubMed CentralGoogle Scholar - Bollenbach T, Vetsigian K, Kishony R: Evolution and multilevel optimization of the genetic code.
*Genome Research*2007, 17(4):401–404. 10.1101/gr.6144007View ArticlePubMedGoogle Scholar - Novozhilov AS, Wolf Y, Koonin EV: Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape.
*Biology Direct*2007, 2.Google Scholar - Woese CR: Order in the genetic code.
*Proceedings of the National Academy of Sciences of the USA*1965, 54: 71–75. 10.1073/pnas.54.1.71View ArticlePubMedPubMed CentralGoogle Scholar - Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In
*Atlas of protein sequence and structure*.*Volume 5*. National biomedical research foundation, Washington DC.: M. O. Dayhoff; 1978:345–358.Google Scholar - Taylor W: The classification of amino acid conservation.
*Journal of Theoretical Biology*1986, 119(2):205–218. 10.1016/S0022-5193(86)80075-3View ArticlePubMedGoogle Scholar - Wu TD, Brutlag DL: Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families. In
*Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology*. AAAI Press; 1996:230–240.Google Scholar - Hwang DG, Green P: Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution.
*Proceedings of the National Academy of Sciences*2004, 101(39):13994–14001. 10.1073/pnas.0404142101View ArticleGoogle Scholar - R O, F LW: A collection of amino acid replacement matrices derived from clusters of orthologs.
*Journal of Molecular Evolution*2005, 5(61):659–665.Google Scholar - Jacobi MN, Görnerup O: A Spectral Method For Aggregating Variables In Linear Dynamical Systems With Application To Cellular Automata Renormalization.
*Advances in Complex Systems*2009, 12(02):131–155. 10.1142/S0219525909002155View ArticleGoogle Scholar - Schneider A, Cannarozzi GM, Gonnet GH: Empirical codon substitution matrix. BMC Bioinformatics 2005., 6(13):Google Scholar
- Simon HA: The Architecture of Complexity.
*Proceedings of the American Philosophical Society*1962, 106: 467–482.Google Scholar - Kemeny JG, Snell JL:
*Finite Markov Chains.*2nd edition. New York, NY, USA: Springer; 1976.Google Scholar - Meilă M, Shi J: A random walks view of spectral segmentation.
*In AI and Statistics (AISTATS)*2001.Google Scholar - Görnerup O, Nilsson Jacobi M: A Method for Finding Aggregated Representations of Linear Dynamical Systems.
*Advances in Complex Systems*2010, in press.Google Scholar - Trifonov EN, Kirzhner A, Kirzhner VM, Berezovsky IN: Distinct Stages of Protein Evolution as Suggested by Protein Sequence Analysis.
*Journal of Molecular Evolution*2001, 53(4):394–401. 10.1007/s002390010229View ArticlePubMedGoogle Scholar - Trifonov EN: Theory of Early Molecular Evolution. In
*Discovering Biomolecular Mechanisms with Computational Biology*. Springer US; 2006:107–116. full_textView ArticleGoogle Scholar - Jiménez-Montano MA, He M: Irreplaceable Amino Acids and Reduced Alphabets in Short-Term and Directed Protein Evolution. In
*ISBRA '09: Proceedings of the 5th International Symposium on Bioinformatics Research and Applications*. Berlin, Heidelberg: Springer-Verlag; 2009:297–309.View ArticleGoogle Scholar - Kosiol C, Holmes I, Goldman N: An empirical codon model for protein sequence evolution.
*Molecular biology and evolution*2007, 24(7):1464–1479. 10.1093/molbev/msm064View ArticlePubMedGoogle Scholar - Kosiol C, Goldman N, Buttimore NH: A new criterion and method for amino acid classification.
*Journal of Theoretical biology*2004, 228: 97–106. 10.1016/j.jtbi.2003.12.010View ArticlePubMedGoogle Scholar - Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach.
*Molecular Biology and Evolution*2001, 18(5):691–699.View ArticlePubMedGoogle Scholar - Johnson MS, Overington JP: A Structural Basis for Sequence Comparisons: An Evaluation of Scoring Methodologies.
*Journal of Molecular Biology*1993, 233: 716–738. 10.1006/jmbi.1993.1548View ArticlePubMedGoogle Scholar - Gonnet GH, Cohen MA, Benner SA: Exhaustive Matching of the Entire Protein Sequence Database.
*Science*1992, 256(5003):1443–1445. 10.1126/science.1604319View ArticlePubMedGoogle Scholar - Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation matrices from protein sequences.
*CABIOS*1992, 8: 275–282.PubMedGoogle Scholar - Crick FH: The origin of the genetic code.
*Journal of Molecular Biology*1968, 38(3):367–379. 10.1016/0022-2836(68)90392-6View ArticlePubMedGoogle Scholar - Riddle DS, Santiago JV, Bray-Hall ST, Doshi N, Grantcharova VP, Yi Q, Baker D: Functional rapidly folding proteins from simplified amino acid sequences.
*Nature Structural and molecular biology*1997, 4: 805–809. 10.1038/nsb1097-805View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.