### Pair-wise sequence alignment scores in information theory

Criteria to measure the variation/conservation balance between proteins should embody as much as possible the structural and functional potentiality within sequences of amino acids. In the absence of explicit physical criteria, amino acid similarity was solved empirically by measuring amino acid substitution frequencies in alignments of homologous sequences [30, 40]. Given two amino acids *i* and *j*, the similarity function *s*(*i*,*j*) was set as:

where

is the observed frequency of substitution of *i* by *j* or *j* by *i*, and *π*_{
i
}and *π*_{
j
}are the frequencies of *i* and *j* in the two aligned sequences. The frequency is the estimate of the probability of substitution of *i* by *j* in real alignments; whereas *π*_{
i
}*π*_{
j
}is the estimate of the probability of substitution under the independency hypothesis. The similarity function gives a 20 × 20 similarity matrix usable to score protein sequence alignments, that can be interpreted in the information theory [37, 38] according to the following proposition.

#### Proposition 1

Amino acid substitution matrix values are estimates of the mutual information between amino acids in the sense of Hartley [37, 38]. Consequently, the optimal alignment score computed between two biological sequences is an estimate of the optimal mutual information between these sequences.

#### Proof

Given a probability law *P* that characterizes a random variable, the Hartley self-information *h* is defined as the amount of information one gains when an event *i* occurred, or equivalently the amount of uncertainty one loses after learning that *i* happened:

*h*(*i*) = -log(*P*(*i*)) (4)

The less likely an event *i*, the more we learn about the system when *i* happens. The mutual information *I* between two events, is the reduction of the uncertainty of one event *i* due to the knowledge of the other *j*:

*I*_{j→i}= *h*(*i*) - *h*(*i*/*j*) (5)

Mutual information is symmetrical, *i.e. I*_{j→i}= *I*_{i→j}, and in the following will be expressed by *I*(*i*;*j*). The self and mutual information of two events *i* and *j* are related:

*h*(*i* ∩ *j*) = *h*(*i*) + *h*(*j*) - *I*(*i*;*j*) (6)

If the occurrence of one of the two events makes the second impossible, then the mutual information is equal to - ∞. If the two events are fully independent, mutual information is null. The empirical measure of the similarity between two amino acids described in equation (3) can therefore be expressed in probabilistic terms:

where *P*_{
ϖ
}is the joint probability to have *i* and *j* aligned in a given alignment and *P*_{
π
}the measure of probability that amino acids occur in a given sequence. From equations (4) and (6), equation (7) becomes:

*s*(*i*, *j*) = *h*(*i*) + *h*(*j*) - *h*(*i* ∩ *j*) (8)

that is

*s*(*i*, *j*) = *I*(*i*; *j*) (9)

As a consequence, the similarity function (or score) is the mutual information between two amino acids. Additionally, the score between sequences (the sum of elementary scores between amino acids, [32, 33, 41, 42]) is, according to the hypothesis of independence of amino acid positions, the estimated mutual information between the two given biological sequences.

Once two sequences are aligned, we pose the question whether the alignment score is sufficient to assess that the proteins are conceivably alike and thus evolutionarily related? The theorem of the upper limit of a sequence alignment score probability (TULIP theorem, [39]), sets the upper bound of an alignment score probability, under a hypothesis less restrictive than the Karlin-Altschul model [43]. Given two real sequences *a* and *b* (*a* = *a*_{1}*a*_{2}...*a*_{
m
}and *b* = *b*_{1}*b*_{2}...*b*_{
n
}), where *s* = *s*(*a*,*b*) the maximal score of a pair-wise alignment obtained with any alignment method, *b** the variable corresponding to the shuffled sequences from *b*, and given *P*{*S*(*a*,*b**)≥*s*} the probability that an alignment by chance between *a* and *b** has a higher score than *s*, then whatever the distribution of the random variable *S*(*a*,*b**) the TULIP theorem states:

with *k* > 1, *μ* the mean of and *σ* its standard deviation. The unique restriction on *S*(*a*,*b**) is that it has a finite mean and a finite variance. A first corollary of the TULIP theorem is that the Z-score is a statistical test for the probability of a sequence alignment score. We additionally state the following new corollary.

#### TULIP corollary 2

Given the TULIP theorem conditions, let

be the Z-score [44]. Then, *z*(*a*,*b**) is the greatest possible value for *k* (*k*∈]1,+∞[), which holds relation (10) true. In consequence, with *k* = *z*(*a*,*b**), then

The best upper bound value for *P*{*S*(*a*,*b**)≥*s*} is termed . From the TULIP theorem and corollaries, the comparison of a protein to a given reference *a*, weighed by an alignment score, is characterized by a bounded probability that the alignment is fortuitous.

### Question of the proximity between protein sequences in the light of information theory

Since the optimized alignment score of two protein sequences allows an access to both the mutual information between proteins and an upper bound that the alignment is not fortuitous, one would expect that it is an accurate way to spatially organize proteins sets. A simple relation would be "the higher the mutual information, the nearest". There are three ways to assess the proximity between two objects *a* and *b* in a given space *E* [41]. The first is dissimilarity, a function *f*(*a*,*b*): *E* × *E* → ^{+} such that *f*(*a*,*b*) = 0 ⇔ *a* = *b* and *f*(*a*,*b*) = *f*(*b*,*a*); the second is the distance *per se*, that is a dissimilarity such that the triangle inequality is respected: ∀ *a*,*b*,*c* ∈ *E*, *f*(*a*,*c*) ≤ *f*(*a*,*b*) + *f*(*b*,*c*); and the third is the similarity defined as a function *f*(*a*,*b*): *E* × *E* → such that and *f*(*a*,*b*) = *f*(*b*,*a*). Representing objects in a space is convenient using the notion of distance. When the optimal alignment is global, *i.e.* requiring that it extends from the beginning to the end of each sequence [32], it is theoretically possible to define a distance *per se*, that is to spatially organize the compared sequences [41]. However, from a biological point of view, global alignment algorithms are not reliable to assess homology of protein domains. Local alignments are better suited, using scoring matrices to find the optimum local alignment and maximizing the sum of the scores of aligned residues [28, 31]. In contrast with global alignments, local alignments do not allow any trivial definition of distances [41].

Although amino acid similarity is a function *f*(*i*, *j*): *E* × *E* → , owing to the local alignment optimization algorithms, the computed score is a function *f*(*a*,*b*): *E* × *E* → ^{+}, requiring the existence of at least one positive score in the similarity matrices. Thus, when constructing an alignment with the Smith and Waterman [33] method, the constraint that *s*(*a*,*b*)>0 (*i.e. I(a;b)* > 0) is imposed. This condition is consistent with proposition 1: if two sequences are homologous, knowledge about the first has to bring information about the second, that is to say, the mutual information between the two sequences cannot decrease below zero: *I(a;b)* > 0 (*i.e. s(a,b)* > 0). As a consequence, in the following geometric construction we sought a refined expression for the proximity of proteins.

### Geometric construction of a configuration space of homologous proteins (CSHP) conserving mutual information

In a set of homologous proteins, any sequence *a* can be selected as a reference, noted *a*_{
ref
}, in respect to which the others are compared. A geometric representation of objects relatively to a fixed frame is known as a configuration space (CS). In physics, a CS is a convenient way to represent systems of particles, defined by their positional vectors in some reference frame. Here, given *n* similar sequences, it is therefore possible to consider *n* references of the CSHP. In a given (CSHP, *a*_{
ref
}), each amino acid position aligned with a position in the *a*_{
ref
}sequence, corresponds to a comparison dimension (CS dimension). Proteins are simply positioned by a vector, the coordinates of which are given by the scores of aligned amino acids. Gaps are additional dimensions of the CS. When considering that local algorithms identify the space of biological interest, *i.e.* a CSHP, the gap penalty is a parameter that maximizes the shared informative dimensions. Thus, given the amino acids mutual information, alignment optimization methods define the relative positions of proteins.

At this point in our construction, a first important property of the CSHP can be deduced. Since mutual information with *a*_{
ref
}is sufficient for the full positioning, then positioning of proteins in a given (CSHP, *a*_{
ref
}) is unambiguous, unique, and is not altered when proteins are added or removed. In other words, a (CSHP, *a*_{
ref
}) is a univocal space.

Given two sequences *a* and *b*, if *b* occurs in (CSHP, *a*_{
ref
}), then *a* also occurs in (CSHP, *b*_{
ref
}). The pair-wise alignment of *a* and *b* having no order (symmetry of the mutual information), the positions of *b* in (CSHP, *a*_{
ref
}) is dependent of the position of *a* in (CSHP, *b*_{
ref
}). Thus, once a (CSHP, *a*_{
ref
}) has been built, ∀*b* ∈ (CSHP, *a*_{
ref
}), part of the geometry of (CSHP, *b*_{
ref
}) is learnt. Thus, in a CSHP, information needed for the position of *n* sequences is totally contained in the geometry of the *n* (CSHP, *a*_{
ref
}). This geometric stability is not observed with multiple alignments, which can be deeply modified by addition or removal of sequences. In the CSHP, protein position is unaltered by additions or removals of other proteins. In practice, the construction of CSHP is therefore completely deduced from any all-by-all protein sequence comparison [45, 46] and can be easily updated.

### The q-dissimilarity, a proximity notion for a geometric representation of the CSHP

In the CSHP, the definition of a distance *per se* based on mutual information is reduced *ad absurdum* (For demonstration, see methods). To define a proximity function i) sharing properties of distance, *i.e.* increasing when objects are further apart, ii) deriving from similarity and iii) relying on mutual information, particularly the property "*f*(*a*,*a*) ≠ *f*(*b*,*b*) is possible", we introduce a fourth notion of proximity. Such proximity was called *q-dissimilarity* (for quasi-dissimilarity), a function *f*(*a*,*b*): *E* × *E* → ^{+} is defined such that

∀ *a* ∈ *E*, ∀ *b* ∈ *E*, *f*(*a*,*b*) = *f*(*b*,*a*) (13)

Let *s* be a similarity, then *q* = *e*^{-s}is a q-dissimilarity, named the 'canonical q-dissimilarity' associated to *s*. Accordingly, the TULIP theorem allows a statistical characterization of *q*(*a*,*b*) the canonical q-dissimilarity between two sequences *a* and *b*.

#### TULIP corollary 3

From the TULIP corollary 2, relation (14) is simply deduced:

with *Q*(*a*,*b**) being the random q-dissimilarity variable associated with *S*(*a*,*b**). Given a (CSHP, *a*_{
ref
}), each sequence *b* aligned with *a* is characterized by a q-dissimilarity *q*(*a*,*b*). In geometric terms, *b* can be represented as a point contained in a hyper-sphere *B* of radius *q*(*a*,*b*).

The representation of a (CSHP, *a*_{
ref
}) shown in Figure 1 is therefore in conformity with all constraints listed earlier and can also serve as a Venn diagram for the setting of events realized following a continuous random variable *Q*(*a*,*b**). When *a* is compared to itself, it is set on a hyper-sphere *A* of radius *q*(*a*,*a*), which is not reduced to one point. In the context of information theory, it is therefore possible to express that the proximity respects the property "*q*(*a*,*a*) ≠ *q*(*b*,*b*) is possible". Considering Figure 1, is the probability for a random sequence b* to be in the hyper-sphere *B*. In conclusion, the *q-dissimilarity* is therefore a proximity notion that allows a rigorous geometric description of the configuration space of homologous proteins, real or simulated, (CSHP, *a*_{
ref
}, *q*).

### Unification of pair-wise alignments theory, information theory, p-distance and q-dissimilarity in the CSHP model

A geometric space is a *topological* space when endowed with characterized *paths* that link its elements. Here, paths can be defined as the underlying evolutionary history separating sequences [11]. Given *u* the common unknown ancestor, then the divergence time *t*(*a*,*b*) is theoretically the summed elapsed times separating *u* to *a* and to *b*. Without any empirical knowledge of *u*, the simplest approximation for *t*(*a*,*b*) was sought as a function of the fraction of identical residues *f*_{
id
}, thus of the p-distance. With the hypothesis of the molecular clock, this function can be given as equation (2), where the transmutation of *a* and *b* is a consequence of a Poisson process. By using relation (9) on the equivalence between score similarity and mutual information, then the fundamental postulate "the closer in the evolution, the more alike and conversely, the more alike, *probably* the closer in the evolution" can be reformulated:

#### Fundamental postulate

Given two homologous proteins *a* and *b*, the closer in the evolution, the greater the mutual information between *a* and *b* (*i.e.* the optimal computed score *s*(*a*,*b*)) and conversely, the greater the mutual information between *a* and *b*, *probably* the closer in the evolution.

Whereas the first part of the postulate is a consequence of the conservational pressure on mutual information, the second assertion founds the historical reconstruction underlying a set of biological sequences on statistical concepts. A corollary is that evolution of two homologous proteins is characterized by a loss of mutual information.

In the CSHP, this formulation of the fundamental postulate allows a novel mathematical formalization of the p-distance in probabilistic terms. Basically, the p-distance is the divergence observed between two sequences *knowing* that they share some features (the observed sequences *a* and *b*) and that they were identical before the speciation event (sequence *u*).

Looking back to equation (1), we can re-formulate *f*_{
id
}in probabilistic terms, considering the fraction of shared features (identical sites) *knowing* the observed data and the existence of a common ancestor. Given two proteins *a* and *b*, let us consider the random variable *Q*(a,b*), defined in TULIP corollary 3. In (CSHP, *a*_{
ref
}), shown in Figure 1, one can define the probability law *P*{*Q*(*a*,*b**)≤*ρ*} as the probability that the q-dissimilarity between *b** and *a*_{
ref
}is lower than *ρ*. The hyper-sphere of radius *ρ* contains therefore the *b** random sequences sharing informative features with *a* accordingly. The probability *p*_{id/a}that *b** shares identity with *a*, *knowing* that the q-dissimilarity between *b** and *a* is lower than that between the real sequences *b* and *a*, is:

*p*_{id/a}(*b**) = *P*{*Q*(*a*,*b**) ≤ *q*(*a*,*a*) / *Q*(*a*,*b**) ≤ *q*(*a*,*b*)} (15)

which is a probabilistic expression of *f*_{
id
}in respect to the reference *a*_{
ref
}. According to the Venn diagram in Figure 1: *p*_{id/a}(*b**) = *P*(*A*/*B*)

Using the Bayes theorem, equation (15) can be expressed as:

In consequence:

which can be expressed as

Assuming that substitution rates are independent of lineages [35], then random sequence models a* and b* are equivalent, that is to say *Q*(*a*,*a**) ≈ *Q*(*a*,*b**) and

Thus *p*_{id/a}, and symmetrically *p*_{id/b}, provide a probabilistic expression of *f*_{
id
}*knowing* the data, *i.e.* the observed mutual information between *a* and *b* expressed as *Q*(*a*,*b*).

Given two homologous sequences *a* and *b*, when their optimal score is *s*(*a*,*b*) ≥ *μ* + *ψ* with *ψ* being a critical threshold value depending on the score distribution law (See Methods for the demonstration for the critical threshold), owing to the TULIP corollary 2, we can state that *p*_{id/a}is bounded above:

This expression can also be developed as:

where *μ*_{1}, *σ*_{1}, *μ*_{2} and *σ*_{2} are the mean and the standard deviation of *S*(*a*,*b**) and *S*(*a*,*a**) respectively. The right term in relation (21) exhibits analogies with *f*_{
id
}given by equation (1), showing that the pragmatic approach by Feng and Doolittle [19] could be supported and generalized in a theoretical elaboration.

Using the Poisson correction, an expression of *t*(*a*,*b*) is given as the linear combination of the two corrections of the p-distance deduced from *p*_{id/a}and *p*_{id/b}:

*t*(*a*,*b*) = -[log(*p*_{id/a}(*b**)) +log(*p*_{id/b}(*a**))] (22)

with *a** and *b** the random variables corresponding to the shuffled sequences of *a* and *b* respectively. The sum of the logarithms corresponds to the product of the two probabilities, an expression of the hypothesis of independence of lineage. Interestingly, equation (22) provides an expression of the symmetric effect of time on the variations that independently affected *a* and *b*.

From relation (20), *t*(*a*,*b*) appears as a function of Z-score ratios. For any set of homologous proteins, it is therefore possible to measure a table of pair-wise divergence times and build phylogenetic trees using distance methods.

### Reconstruction of protein phylogeny: first example, case study of the glucose-6-phosphate isomerase phylogeny

We compared the trees we obtained, called TULIP trees, to phylogenetic trees built using classical methods, for instance the popular PHYLIP [47] or PUZZLE-based [48] methods, termed here MAB trees (for multiple alignment-based trees). Firstly, because MAB trees are constructed from multiple alignments, removals or additions of proteins modify the multiple alignments. Inclusion of sequences is considered as a way to improve the quality of multiple alignments and to increase the sensitivity of the comparison of distant sequences [49, 50]. By contrast, the protein space used to build TULIP trees is not reordered when data sets are incremented or decremented (drawing of the TULIP tree may apparently change due to the tree graphic representation methods; nevertheless the absolute tree topology is not reordered). This remarkable property is due to both the geometrical construction by pair-wise comparison and the convergence of the distance matrix elements estimated by equation (21). Indeed, the estimate of the right-hand term of equation (21) relies on a Monte Carlo method, after randomization of the biological sequences [39, 44, 51] and is therefore dependent on the sequence randomization model [52] and convergent in respect to the weak law of large numbers [53]. Convergence is proportional to , where *numb*_{
rand
}is the number of randomizations. In the case studies presented here, we set *num*_{
rand
}= 2000 (see Methods). By contrast, stability of MAB trees is sought by bootstrapping approaches and consensus tree reconstruction. MAB trees appear as the result of a complex learning process including possible re-adjustment of the multiple alignments after eye inspection pragmatically applied to assist the reconstruction. Alternatively, Bayesian analyses have been recently proposed for phylogenetic inference [54], estimating posterior probability of each clades to assess most likely trees. Still, in a recent comparative study, Suzuki et al. [55] and Simmons et al. [56] provided evidence supporting the use of relatively conservative bootstrap and jacknife approaches rather than the more extreme overestimates provided by the Markov Chain Monte Carlo-based Bayesian methods. In the absence of any decisive methods to assess the validity of the trees obtained after such different approaches, no absolute comparison with the TULIP classification trees can be rigorously provided.

Whenever a TULIP classification was achieved on a dataset that led to a consensual MAB tree, both were always consistent. For example, Figure 2 shows the phylogenetic PHYLIP [47] and TULIP trees obtained for glucose-6-phosphate isomerases (G6PI). Phylogeny of the G6PI enzyme has been studied by Huang et al. [57] in order to demonstrate the horizontal transfer of this enzyme in the apicomplexan phylum due to a past endosymbiosis [57]. Owing to the neighbor-joining analysis used by Huang et al. [57] (see methods) Figure 2A shows that apicomplexan G6PI is "plant-like". The TULIP tree shown in Figure 2B is consistent with this conclusion. Interestingly, differences between the two trees are found only when the bootstrap values on the MAB tree are not strong enough to unambiguously assess branching topology.

### Reconstruction of protein phylogeny: second example, case study of the enolase phylogenic incongruence

TULIP classification tree further helps in solving apparent conflicting results obtained with MAB methods. In a comprehensive study from Keeling and Palmer [36] the PUZZLE-based reconstruction of the enolase phylogeny led to incongruent conclusions. Enolase proteins from a wide spectrum of organisms were examined to understand the evolutionary scenario that might explain that enolases from land plants and alveolates shared two short insertions. Alveolates comprise apicomplexan parasites, known to contain typical plant features as mentioned above, particularly a plastid relic. In this context, the shared insertion in apicomplexan and plant enolases (Figure 3) has been interpreted as a possible signature for some evolutionary relationship between apicomplexans and plants [58, 59] and a likely sign of a lateral transfer. From the distribution of this insertion in enolases from several key eukaryotic groups, Keeling and Palmer [36] postulated that lateral transfer had been an important force in the evolution of eukaryotic enolases, being responsible for their origin in cryptomonads, *Chlorarachnion* and *Arabidopsis*. However, they could not conclude about alveolates, finding a conflict between the distribution of the insertion and the MAB phylogenetic position (Figure 4A). The authors had to admit that lateral gene transfers failed to explain apicomplexa enolases, and were compelled to suppose that the lack of congruence between insertion and phylogeny could be because of a parallel loss of insertions in lineages, or to more complex transfers of gene portions.

Based on our theoretical model, we constructed the corresponding TULIP tree. TULIP trees given with BLOSUM 62 or PAM 250 matrices, Fitch-Margoliash or neighbor-joining methods led indistinctly to a unique tree topology (Figure 4B). Separation of great phyla (Archaebacteria, Eubacteria, Diplomonads, Trypanasomes, Animals, Fungi and Amoeba) is recovered. A plant-like cluster is additionally reconstructed, in which a distinct separation occurred between {Rhodophytes ; Cryptomonads} and {Land Plants ; Charophytes ; Chlorarachnion ; Alveolates} main clusters. It is remarkable that this latter cluster is that characterized by the enolase insertion.

This topology corresponds to the observed distribution of the enolase short insertions and provides therefore a solution to the apparent enolase phylogeny incongruence: the phylogenetic position of alveolates is not in conflict with the distribution of enolase insertion and the apicomplexa enolase is possibly a consequence of a lateral transfer, like in cryptomonads.

### Large scale phylogeny based on a CSHP built from massive genomic pair-wise comparisons

A CSHP containing large sets of protein sequences can be built after any all-by-all massive comparison providing Z-score statistics. Because the space elaboration is explicit, then quality of the mutual information conservation depends on the choice of the scoring matrix, the geometric positioning depends on the local alignment method, the homology assessment depends on the alignment score and probabilistic cutoffs and the phylogenetic topology on the choice of the stochastic law correction. Eventually, genome-scale pair-wise comparisons [39, 36] find in the present CSHP a robust, evolutionary consistent and easily updatable representation.