Evaluation of properties over phylogenetic trees using stochastic logics
 José Ignacio Requeno^{1}Email author and
 José Manuel Colom^{1}
https://doi.org/10.1186/s1285901610777
© Requeno and Colom. 2016
Received: 19 November 2015
Accepted: 7 May 2016
Published: 14 June 2016
Abstract
Background
Model checking has been recently introduced as an integrated framework for extracting information of the phylogenetic trees using temporal logics as a querying language, an extension of modal logics that imposes restrictions of a boolean formula along a path of events. The phylogenetic tree is considered a transition system modeling the evolution as a sequence of genomic mutations (we understand mutation as different ways that DNA can be changed), while this kind of logics are suitable for traversing it in a strict and exhaustive way. Given a biological property that we desire to inspect over the phylogeny, the verifier returns true if the specification is satisfied or a counterexample that falsifies it. However, this approach has been only considered over qualitative aspects of the phylogeny.
Results
In this paper, we repair the limitations of the previous framework for including and handling quantitative information such as explicit time or probability. To this end, we apply current probabilistic continuoustime extensions of model checking to phylogenetics. We reinterpret a catalog of qualitative properties in a numerical way, and we also present new properties that couldn’t be analyzed before. For instance, we obtain the likelihood of a tree topology according to a mutation model. As case of study, we analyze several phylogenies in order to obtain the maximum likelihood with the model checking tool PRISM. In addition, we have adapted the software for optimizing the computation of maximum likelihoods.
Conclusions
We have shown that probabilistic model checking is a competitive framework for describing and analyzing quantitative properties over phylogenetic trees. This formalism adds soundness and readability to the definition of models and specifications. Besides, the existence of model checking tools hides the underlying technology, omitting the extension, upgrade, debugging and maintenance of a software tool to the biologists. A set of benchmarks justify the feasibility of our approach.
Keywords
Background
A phylogenetic tree is a description of the evolution process which is discovered via molecular sequencing data and morphological data matrices [1]. Computer science tools have upgraded the capabilities of biologists for their construction as well as for extracting and analyzing the implicit biological messages embedded in them [2, 3]. Nowadays, more and more applications rely on the existence of a support phylogenetic tree for the confirmation of biological hypotheses that are valuable for the scientific community. For example, a small but representative portion of these researches combine phylogenetic trees (constructed via the mentioned tools using the information of the genome) with fossils, geographical or phenotypical data in order to find any mismatch, to trace the human migrations [4] and to inspect the distribution of endemic diseases [5]. In this sense, they use a phylogenetic tree for testing biological hypotheses about evolution in a similar way to [6]. Indirectly, the evaluation results and counterexamples help to feedback the phylogeny and increase its quality. However, the wide range of diverse methods and tools used by biologists for studying phylogenetic properties recommended the research of a generic framework for heterogeneous hypotheses testing over trees.
Model checking is a paradigm stemming from computer science based on temporal logics which has been successfully applied in industry for system modeling and verification [7]. The basic principle allowing the use of the model checking framework in the context of phylogeny is the interpretation of the phylogenetic tree as a transition system representing a computational model of the evolution process, i.e., a rooted directed acyclic graph describing the potential behavior of a discrete system whose paths indicate a sequence of intermediate states and the transitions are speciation events. The next step consists of formulating the hypothesis that we desire to investigate over a phylogeny using temporal logics as a formal language. Finally, a model checking tool automatically verifies the satisfaction of the property over the model in an exhaustive way.
Model checking has been recently proposed as a generic unifying framework that allows the phylogeneticist to focus on tree structures, biological properties and symbolic manipulation of phylogenies described using temporal logic, instead of on implementation issues concerned with verification algorithms [8, 9]. The model checking framework helps us to uncouple software tools from the formal definition of models and the specification of properties. Besides, it hides the underlying implementation technology to final users, enabling transparent software upgrades of the model checking tool and removing debugging and programming language concerns through different platforms [10]. Standard branchingtime temporal logics such as Computational Tree Logic (CTL [11]) already allow expressing biological properties referred to the structure of the tree or the arrangement of DNA sequences along the paths [8].
In this paper, we repair the limitations of the previous model checking approach using standard temporal logics. Some of the phylogenetic hypotheses require a more powerful language for being described, i.e., it is necessary to extended the models and specification logics with explicit time and probabilities. Evolution is a continuoustime and stochastic process where the branch length of a phylogeny represents time, and the distribution of biological features is nondeterministic along the whole tree. One of the potential applications of probabilistic model checking is the study of the dispersion of lactose intolerance among populations and the temporal point when it appeared in the tree [12]. The objective is to tolerate more flexibility and expressiveness in the specification of properties.
Explicit time and probabilities are not directly supported by qualitative logics, but they can be included in the specifications of a probabilistic transition system using discrete (PCTL [13]) or continuoustime logics (CSL^{TA} [14]). Much more logics have been proposed according to different metrics and time models [15] (e.g., number of clocks in the system or local vs global clocks), although we will focus on CSL^{TA} here. The interpretation of this kind of specifications requires the modification of the traditional transition system and the algorithms that traverse them. The models must be enlarged in order to support these new informations.
More in detail, we present and classify a more complete description of phylogenetic properties that are intrinsically stochastic by nature. Our aim is to show that an extended model checking framework is suitable for solving temporal and probabilistic problems in the domain of phylogenetics. Current probabilistic model checking tools are generic and powerful enough for our proposal. Other tools such as Phycas [16], RevBayes [17] or Beast2 [18] are focused on solving a set of specific problems while our approach can answer a wider range of questions.
The advantages of applying the probabilistic continuoustime extensions of model checking techniques for evaluating phylogenetic specifications are twofold: a) it is possible to verify if a property is true or false according to some temporal requirements and the probabilistic behavior of the model and b) it is also possible to obtain the minimum probability that makes the specification true within an interval of time. In order to show the feasibility of our approach, we apply our framework to a specific problem: the probability of obtaining a phylogeny with a certain topology. Specially, we focus on the computation of maximum likelihood estimations (MLE) for a phylogenetic tree using models of DNA evolution, that simulate the changes in the genome through the time.
The paper is arranged as follows. The next Section “Methodology” introduces the main notions of evolution and explains how it is modeled under the model checking framework. Besides, the logical specification of nontrivial phylogenetic properties is described. The logic presented in this section is used for evaluating stochastic properties and the likelihood of phylogenetic trees in the case of study. Section “Results and discussion” introduces the methods for managing continuoustime and probabilities in model checking and current probabilistic verifiers. After that, the case of study with maximum likelihood of a phylogenetic tree, and the experimentation details are presented. The key steps for implementing phylogenetic trees and biological properties within the scope of the PRISM tool is presented. Finally, Section “Conclusions” gathers the conclusions drawn from this research and details the future work.
Methodology
Analyzing models of evolution with probabilistic model checking
The process of model checking is divided in three phases. The first part consists of a correct modeling of the system. Next, a set of properties are written using a specification logic. Finally, a model checking tool takes the description of the model and the hypotheses, and analyze them in an automatic and exhaustive way. The quality of the analysis is influenced by the identification of the main characteristics of the system during modeling. To this end, the next paragraphs present the principal keystones that we try to capture of the evolution model and how they are translated into a computational model (i.e., transition system) suitable for the study.
Evolution as a transition system

Topological models. Birthdeath Markovian models describe the macroevolution process of speciation [19, 20]. They are usually used for the generation of random trees fulfilling the topologies and structural characteristics that are supposed to appear in correct phylogenies. They provide a skeleton (i.e., a phylogenetic structure) regardless of the internal details of the tree states. These trees would be later enriched by complementary information (e.g., DNA). These trees, or those constructed by inference or maximum parsimony methods [21], are used as initial seed in subsequent problems, for example, for the tree exploration phase during the evaluation of the maximum likelihood estimations [22, 23].

Sequence models. On the other hand, the genome constitutes nowadays one of the most important traits of a species for its study. Models that describe the modification of protein or DNA sequences through time provide finer details about the genetic changes operating behind the macroevolution models. These models nuance the process from which a sequence of characters switches into another one. In short, the sequence models capture how the genomic information is placed in each state of a tree inferred from a speciation model.
In the context of molecular evolution, the phylogenetic tree would indicate a reachability graph or execution trace of the system, that is, a snapshot capturing the order of occurrence of the mutations along the history of the speciation process since the beginning of time until present day. Therefore, a phylogenetic tree will reflect how the mutation model should act, showing the consecutive changes in a longterm run of the model. In other words, the phylogeny represents the behavior of the system, understood as a computational process modeling the hypothetical flow of the DNA during the time.
Markov chain as a data structure
In probabilistic model checking, the computational models are probabilistic transition systems (PTSs). By default, topological and sequence models are Markov chains whose structure and semantics matches directly with PTSs. For example, the simpler models of evolution leading to the introduction of mutations in the genome are a fourstate Markov chain with different probabilities for the transitions among nucleotides [24]. Commonly, they are parameterized templates indicating the ratios and relations between bases. Several versions of DNA mutation models have been proposed in the literature. By now, there are highly optimized models for specific purposes including the special biochemical features of the nucleotides such as physicalchemical stability of the strands.
Markov chains accept either discrete (integer) or continuous (real) time in the modeling, thought continuoustime is more suitable for capturing finer temporal details of the evolution. Instead of describing explicitly the probability of change from one state to another, the transitions are labeled with rates delimiting the time spent in that branch.
Definition 1 (Continuoustime Markov Chain).

S is a finite set of states,

S _{0}⊆S is the set of initial states,

\(\mathbf {R} \subseteq S \times S \rightarrow \mathbb {R}_{\geq 0}\) is the transition rate matrix between states, i.e., for every pair of states s,s ^{′}∈S, a transition occurs only if R(s,s ^{′})>0, and the probability of this transition being triggered in t time units equals \(1e^{\mathbf {R}(s,s')\cdot t}\phantom {\dot {i}\!}\), and

L:S→2^{ A P } is the labeling function that associates each state with the subset of atomic propositions (AP) that are true of it.
As there exist multiple pair candidates with R(s,s ^{′})>0 outgoing from state s, a race condition appears and the selection of the next state determines the history of events. A higher value of R(s,s ^{′}) prioritizes the transition between s and s ^{′}. If all the successors share the same ratio, the transitions are equally probable. The elapsed time before triggering affects the probability of a transition.
The computation of probabilities depends on the paths starting in the initial state. Given an infinite path of successive states π=s _{0} s _{1} s _{2}…, π[i]=s _{ i }, while π(t) returns the state s _{ i } of the path in which the system is found after \(t \in \mathbb {R}_{\geq 0}\) time units since the initial state s _{0} (i.e., π[0]=π(0)=s _{0}). For any set of infinite paths Π starting in the initial state s _{0}, the subset Π(π _{ n }) selects the paths π∈Π whose prefix equals to the finite sequence π _{ n }=s _{0} s _{1} s _{2}…s _{ n } of length n+1 states.
An alternative and more stable standard technique for calculating the probability matrix P _{ t }(s,s ^{′}) is presented in [25].
Assuming that phylogenetic trees are the state spaces produced by the executions of evolution models, they are also accepted as PTSs and Markov chains. Each transition connecting two nodes of the tree has the same information, probability and temporal requirements than the transition connecting an equivalent pair of states of the evolution model. Selfloops with unitary probability are printed in final states of the tree to produce infinite traces of presentday taxa because of uncertainty of future traits and species (Fig. 4). This trick is necessary for evaluating temporal properties and imposing conditions for the mathematical resolution of probabilities. The theoretical concepts, logics and tools will be detailed in the following sections.
Specification of phylogenetic properties over phylogenies
During the last years, new formalisms and tools have started to accept properties expressed in probabilistic temporal logics for capturing and analyzing the underlying randomness of stochastic systems. Probabilistic model checking combines the expressiveness for representing paths of a computational model using temporal logics with the calculation of likelihoods associated to the route. Our objective is to enrich the original applications presented in [8] with probabilities and explicit time in order to obtain a more flexible framework capable of solving complex problems.
Continuoustime probabilistic logic as a specification language
Once the data structure and semantic of the model are defined, the next step involves the presentation of the syntax and semantics for the stochastic temporal logic. The logic proposed here for working with CTMCs is CSL^{TA} [14].
Definition 2 (Continuoustime Stochastic Logic).

\(M, s_{0} \vDash p \Leftrightarrow p \in L \left (s_{0} \right)\),

\(M, s_{0} \vDash \neg \phi \Leftrightarrow M, s_{0} \nvDash \phi \),

\(M, s_{0} \vDash \phi \vee \psi \Leftrightarrow M, s_{0} \vDash \phi \) or \(M, s_{0} \vDash \psi \),

\(M, s_{0} \vDash \mathbb {P}_{\sim \lambda } \left [ \Phi \right ] \Leftrightarrow Prob(M,s_{0},\Phi) \sim \lambda \),

\( M,\pi \vDash \mathbf {X} \phi \Leftrightarrow M,\pi [1] \vDash \phi \)

\(M,\pi \vDash \left [ \phi \mathbf {U}_{I} \psi \right ] \Leftrightarrow \) for some t∈I, \(\exists \pi : M, \pi (t) \vDash \psi \), and \(M, \pi (t') \vDash \phi \) for all 0≤t ^{′}<t with I an interval of reals.
The set \(\{\pi \in \Pi \:\: M, \pi \vDash \Phi \}\) is obtained by the union of finitely many pairwise disjoint subsets Π(π _{ n }) by ([26] Definition 3), each one characterized by the finite prefix π _{ n } of all infinite sequences of the set satisfying Φ. Therefore, the probability function P r o b(M,s _{0},Φ) is computed as the summation of probabilities in all possible prefixes π _{ n } by ([26] Theorem 1). That is, \(Prob(M, s_{0},\Phi) = \Sigma _{\pi _{n}} Pr(\Pi (\pi _{n}))\phantom {\dot {i}\!}\) where Pr is a specialized function for obtaining the probability of the path defined by π _{ n }. The calculation of the probability will be detailed in a further section.
In qualitative branchingtime logics such as CTL, every path formula imposes a reachability relation or patterns among all the states that satisfy some propositional requirements. That is, they define sequences of events that must be found in the model. CTL reinterprets the quantifiers of firstorder logic as path quantifiers, expressing the fulfillment of a property throughout all computation paths (A), or at least exists one computation path (E). These two must be immediately qualified by one of five temporal operators, of which three express the satisfaction of a property eventually in a future state (F), generally at all states (G), or in the next state (X); and two are conditional constructs in which a precedent is verified until a consequent comes into force (U), or until and including the moment when it does, if it does (R). A complete grammar and semantics of CTL formulas can be defined from a minimal subset of logical operators [27].
CSL^{TA} supports timed transitions in the U operator. Timed variants of the modal operators F and G are obtained via U as F _{ I } ϕ=t r u e U _{ I } ϕ and G _{ I } ϕ=¬F _{ I }¬ϕ. Instead of writing probability and intervals explicitly, they can be abbreviated with inequalities. For example, \(\mathbb {P}_{\leq 0.5}\left [\Phi \right ]\) denotes \(\mathbb {P}_{[0,0.5]}\left [\Phi \right ]\). This temporal extension constrains the length of the computational path specified in the logical equation. The operator X implicitly takes the real time elapsed for a consecutive branch of the current state. By now, phylogenetic properties querying about the structure of the tree or mutation traces can be expressed in CSL^{TA}.
Sequence properties
In a nonprobabilistic temporal logic, it is already possible to query a wide range of questions about the arrangement of the nucleic information in the states of the tree [8]. However, what we look for here is the addition of probabilistic information to the specifications. For instance, a preliminary probabilistic study focused on lactose intolerance within humans [12]: it identifies the haplogroups having a higher concentration of SNP’s associated to this disease in the leaves of the phylogenetic tree.
In a similar way to evolution models, phylogenetic properties can be classified as sequence properties or topological properties depending on the information they extract from the phylogeny. In general, sequence properties are based on the content of the states, i.e., those propositions that are evaluated within a node and without need of temporal operators. Usually, they are composed of simple patterns whose application scope is restricted to the surrounding nodes or the entire phylogeny. Such types of state formulas are called patterns(p,t), being p the propositions that must be true and t an optional temporal parameter (t=∞ when omitted). They provide a powerful descriptive formalism for formulating general restrictions without the limitations of ad hoc approaches. Often, these properties may be used not necessarily to forbid patterns, but as queries and alerts to signal unusual, possibly anomalous behavior, and mark it for further study.

Conservation is modeled as a restriction on the symbols that can occur at a given position in a sequence. Commonly, the pattern is codified according to a set of (im)permissible symbols. It is possible to define general families of compatible elements, not bounded to specific positions, and restrict their usage to exactly one of these positions when needed.

Covariation imposes a relation of dependence between two (or more) positions in a sequence due to some physicalchemical constraints or the internal structure of the genome. It represents the set of symbols that may appear in the second place in association to each symbol in the first place.
The variation of the probabilistic parameter determines the scope (i.e., width) of the pattern over the complete state space. Lineagespecific haplogroup verification represents a further step forward, where patterns would be used to define conditional explorations of the sets of states of interest of the phylogeny.
Deleterious mutations. Exceptions to the aforementioned properties may in fact indicate suspicious or potentially deleterious mutations, which are of great interest in applied phylogenetic studies [28]. Furthermore, known or suspected mutations of this kind can be explicitly modeled as patterns and their positioning in a phylogeny assessed. In particular, those affecting important metabolic functions are expected to prevent or hinder the reproduction of the organism, and consequently should be confined in or near terminal leaves.
Back mutations. It is possible to determine whether a given tree is free of back mutations or reversions, which we abbreviate BM. Given an internal node, there is a back mutation in that subtree involving a position j of the alignment if at some point in some descending path from the node we find a different symbol (\(\overline {\sigma }_{\mathit {j}}\)) than that found in the root of the subtree (σ _{ j }). Then, at some point in the subtree hanging from that intermediate the symbol from the root reoccurs.
In these two formulas we present a nontrivial modeling example of a cladistic property with a heavy use of sequence data. The goal is to detect if the tree is free of back mutations (equivalently, it detects those points in the tree where back mutations occur due to counterexamples, if any). In the case of finite domains, such as the set of DNA sequences, the evaluation of logical quantifiers ∀ and ∃ can be substituted by multiple instances of boolean formulas connected by the \(\bigwedge \) and \(\bigvee \) operators. In sum, we have formalized the concept of back mutation in probabilistic temporal logic. The addition of intervals refine the length of the paths in the property.
Topological properties
Identifying if a phylogenetic tree reflects the correct structure of the evolution is one of the most important topological questions. Given a phylogenetic tree inferred from an alignment, it should be compared to the speciation model that we use as reference, for instance, a DNA mutation model. The computation of maximum likelihoods quantifies the fitness or probability that the topology obtained by the alignment follows the same trace imposed by a mutation model [29].
Due to the uncertainty of the true evolution process generating the tree of life, the estimation of maximum likelihoods works as a metric for comparing the topology of two or more trees built over the same data set (the higher value for this score, the better). The selection of other inferred phylogenies, alignments or alternative DNA models modifies the probabilities of the arrangements and returns different scores [30].
Nevertheless, the search for the tree with maximum likelihood requires the evaluation of a considerable number of equations over a great tree space, which converts the searching process using scores in an NPhard problem [1]. Despite the storage of partial solutions (\(L^{l}_{X,i}\)), more heuristics must be introduced in order to gain feasible solutions.
Extra simplifications consist of a) the propagation of the local maximum \(L^{l}_{X,i}\) instead of the four internal values \(L^{l}_{X,i}\) and b) the assignation of a preliminary character to the internal ancestors of the phylogenetic tree (for instance, using a maximum parsimony method when there is no ambiguity in the selection of the nucleotide). The main drawback of these approaches is the extraction of a local maximum instead of the real solution. The selection of the initial seed and the intensive sampling method of the tree space determines the quality of the approximation to the real likelihood value for the phylogeny [33].
In Section “Case of study: maximum likelihood of a phylogenetic tree” we show how a model checking tool can test and compute the reliability of a tree according to the defined mutation model. The mathematical equations of MLE for evaluating the phylogeny are rewritten using probabilistic temporal logics. Later, they are executed over a CTMC chain corresponding to a DNA mutation model. Finally, the model checking procedure queries whether the probability of appearance of a particular mutation in the tree is over or not a predefined threshold. The evaluation of the CSL^{TA} formulas needs the procedure for computing probabilities introduced in the next section.
Results and discussion
Algorithm and tools for model checking
The evaluation of biological hypotheses qualified with probability and time requires the extension of the traditional model checking algorithms [27]. The first point of this section is devoted to the introduction of functions and numerical equations for solving the quantitative part of the formulas, which is the main difference with respect to previous model checking algorithms. The second point focus on the tools implementing this extension.
Algorithm for CSL^{TA}
The value E(s)=0 means that s is an absorbing state or siphon. The path probability for each path operator is now calculated as follows.
\(\mathbb {P}_{\sim \lambda }[ \mathbf {X} \phi ]\) formula.
In CSL^{TA}, the next operator has no sense as in continuoustime there is not an unique next real number. This operator is included for compatibility. It detects if any direct transition between a state s and a successor s ^{′} has probability P _{ e m b(M)}(s,s ^{′})∼λ in the embedded DTMC.
\(\mathbb {P}_{\sim \lambda }[ \psi \mathbf {U}_{I} \phi ]\) formula.

I=[0,t] with \(t\in \mathbb {R}_{\geq 0}\);

I=[t,t ^{′}] with \(t, t'\in \mathbb {R}_{\geq 0}\) and t≤t ^{′};

I=[t,∞] with \(t\in \mathbb {R}_{\geq 0}\).
For the case of I = [0,t], the probability P r o b(M,s,ψ U _{[0,t]} ϕ) is equal to: (i) 1, if s∈S a t(ϕ)(ii) 0, if s∈S a t(¬ψ∧¬ϕ) or t=0(iii) \(\Sigma _{\substack {s \in Sat(\psi) \\ s' \in Sat(\phi)}} \mathbf {P}_{t}(s,s')\phantom {\dot {i}\!}\), otherwise
The embedded DTMC must be free of loops (infinite paths) of intermediate states so as to have a finite and solvable numerical equation system.
Model checking tools
A model checking tool or verifier requires two input files for the verification process: a first file with the description of the model, and a second file with the specification of the properties. Traditionally, it returns if the specification is satisfied in the model or not, and a counterexample that falsifies it if required. In probabilistic model checking, extra information shall be provided such as the minimum probability for which the property succeed.
The modularity and independence of model and specifications allow the evaluation of several properties over the same model and, vice versa, the test of the same phylogenetic requirements over different computational models. The encapsulation of the underlying technological implementation details (e.g., multithreading, data bases, etc.) is one of the most important features of this kind of tools. The main advantages of the model checking framework are the abstract description of model and properties using logics and mathematical formalisms, and the portability of these files between computers that have installed the same model checking tool.
There exists a considerable diversity of verifiers with different performances, qualities and designs [34]. Among all the model checking tools developed for probabilistic systems, PRISM [35], MRMC [36] and CADP [37] are the most important ones.
In our case, we have selected PRISM for the experimentation for several reasons. PRISM is a generic statebased model checking tool capable of handling probabilistic and timed specifications over Markov chains. PRISM offers Java portability, a powerful syntax for handling time and probabilities in models and specifications, and a good scientific community support. Besides, it is open source, which allows the modification and optimization of its code. The real performance depends on the particular structure of the model and specifications. This fact is shown in the example of case of study.
Case of study: maximum likelihood of a phylogenetic tree
Formal methods help for describing mutation models and evaluating properties over phylogenetic trees. Among all the potential applications, we focus on the problem of analyzing the statistical probability of a phylogeny. Maximum likelihood is one of the most common methods for phylogenetic inference. Given a mutation model, it scores the quality of the topology for a inferred tree. However, the incorporation of new models into current software tools requires an expertise in programming languages and a deep knowledge of the source code. This fact limits the extension, upgrade, debugging and maintenance of a software tool with new models defined by the biologists.
To this end, we show how to define a tree and a mutation model using the description language of CSL^{TA} and CTMCs. A phylogenetic tree is formulated as a succession of nucleotide mutations that are checked over a mutation model. The verification process returns the likelihood of obtaining the tree from a particular model. We have analyzed several phylogenies using the PRISM model checking tool. In addition, we have adapted the software for optimizing the computation of maximum likelihoods.
Translation to PRISM
Our framework plays the role of worker for computing the maximum likelihood during the evaluation of the tree topology in current inference algorithms and tools ([38], Fig. 2). The model checking tool returns a confidence value between [0,1], which represents the likelihood of getting a peculiar arrangement of nucleotides in the states of the tree with respect to a model of mutations. The score provided by the result of the verification process is useful for guiding the iterative refinement of the phylogenetic tree. Our tool requires an external tree generator.
The computation of the probability is modeldependent. Some of the historical mutation models are JukesCator (JC), Kimura, Felsenstein or the Generalized timereversible (GTR) [39]. For simplicity in this example, we consider the JC model. The substitution rate nu equals to 1. It expresses the number of ticks needed for the activation of the transition rather than an explicit probability.
Besides asking for a particular bound, PRISM includes iterative methods for calculating maximal (minimal) probabilities of a path of the tree. For example, the operator P=?[F<=5 x1 = A] returns the probability of reaching a nucleotide A at position x1 of the alignment within 5 time steps in the future. It is equivalent to \(\mathbb {P}_{=?}[\mathbf {F}_{<=5} seq[1] = A]\) in CSL^{TA} syntax, but ? is the threshold calculated by the model checking tool instead of an explicit probability.
PRISM allows the specification of the initial state. For example, filter(min, P=? [F<=5 x1 = A], x1 = C) returns the minimum probability taking the sequence x1 = C as the initial state. It ensures that there exists a path starting in x1 = C that eventually reaches x1 = A. The properties are annotated with names for defining the partial likelihoods \(L^{l}_{X,i}\). In case of phylogenetic trees with estimated nucleotides in the ancestors (such as in the case of maximum parsimony), the intermediate nodes are initialized with constant values \(L^{l}_{X,i}=1\) or \(L^{l}_{X,i}=0\) for reducing the initial calculations. The header of the file describes the notation of the variable names. The constant dxy is the distance between states X and Y.
However, the current release of PRISM does not store the partial results \(L^{l}_{X,i}\) in local memory, which implies a reevaluation of \(L^{l}_{X,i}\) every time it is accessed. The inability for caching these values damages the potential optimization caused by the factorization of the probabilities P _{ ij }(·) in Eq. 3. This peculiarity, together with the inherent delay introduced by the Java virtual machine, penalizes the performance in comparison to other specific systems for computing maximum likelihoods [22, 23]. In order to solve this problem, we have modified PRISM for caching the partial results \(L^{l}_{X,i}\). Applying this simple change, the model checker slightly improves its performance.
Experimentation
All tests have been run on a Intel Core 2 Duo E6750 @ 2.66 GHz with 8 GB RAM and Linux. Memory is not a limiting factor because the experiments have been successfully executed with only 4GB assigned to the Java virtual machine. The factorization of the equations and the initialization cost in PRISM are delays that must be considered for the judgment of the results.
The exponential trend observed in the graphic is explained by the exploration method selected by PRISM for calculating the probability function P _{ ij }(d). In this way, PRISM unfolds the DNA mutation model and generates all the possible traces between the bases i and j, applying implicitly an exhaustive searching method. PRISM looks for all the paths of n nodes with distance Σ _{ k=1…n } d _{ k }=d that maximize P _{ ij }(d), being i (j) the initial (final) state and d _{ k } the distance between each pair of intermediate states. The combinatorial exploration of the paths satisfying the previous restrictions of length and probability leads to the exponential cost for calculating P _{ ij }(d).
This framework becomes competitive for obtaining an upperbound approximation for MLE efficiently. Future optimizations should focus on the implementation of more efficient libraries and dynamic programming techniques that solve, or at least ease, the aforementioned limitations and bottlenecks in PRISM.
Conclusions
Model checking is a generic framework that has been applied for qualitative analysis over phylogenies. The main advantages of this approach are the legibility and soundness of formal methods based on temporal logics, the independence of the model from the specifications and the availability of powerful verifiers that automatically analyze the specifications and find counterexamples. Besides, the encapsulation of the implementation in a model checking tool hides and simplifies the access to the underlying technology.
Nevertheless, some phylogenetic properties are beyond the expressiveness of firstorder temporal logics used in that environment. In this paper, we have completed the previous framework including explicit time and probabilistic information. To this end, we have applied stochastic extensions of the model checking framework for querying quantitative properties over phylogenetic trees and mutation models. The inclusion of explicit time and probabilities matches naturally with the speciation models. In this sense, we have introduced a stochastic logic, data structure and methods adapted for manipulating and computing probabilities over continuoustime systems. Several quantitative properties have been specified.
In order to prove the feasibility of our approach, we focused on the problem of analyzing the statistical probability of a phylogeny. We have studied the likelihood of obtaining a phylogenetic tree through the evaluation of maximum likelihood estimations over DNA mutation models. A phylogenetic tree is formulated as a set of paths using temporal logics (i.e., a path is understood as a succession of nucleotide mutations that are checked over a mutation model). The verification process returns the likelihood of obtaining that tree from a mutation model.
We have shown how to translate the model and specifications to the particular notation of this framework. We have analyzed several synthetic (random) phylogenies using the PRISM model checking tool. A lineartime heuristic has been proposed for the calculation of an upper bound of the likelihood score. We have customized the tool in order to slightly improve the verification costs.
This work opens the door for the review of bigger systems with quantitative properties similar to those defined in this paper. The modularity of our framework allows the evaluation of hypotheses and the comparison of results for a set of phylogenetic trees by only changing the tree file (the specification of the property remains constant). Finally, the search for the valuations that verify a certain specification leads to an intensive exploration of the formula space or the solution of linear systems. The introduction of parametric model checking for the automatic discovery and mining of phylogenetic information outlines our future work.
Abbreviations
AP, atomic proposition; BM, back mutation; CADP, construction and analysis of distributed processes; CSL^{TA}, continuous stochastic logic with timed automata; CTL, computational tree logic; CTMC, continuoustime Markov chain; DTMC, discretetime Markov chain; GTR, generalized timereversible; JC, JukesCator; MLE, maximum likelihood estimation; MRMC, Markov reward model checker; PCTL, probabilistic CTL; PTS, probabilistic transition system
Declarations
Acknowledgments
We would like to thank the Spanish Ministry of Economy (SPE) under the program “Programa de I+D+i Estatal de Investigación, Desarrollo e innovación Orientada a los Retos de la Sociedad”, project id TIN201340809R, for supporting this work.
Funding
This work was supported by the Spanish Ministry of Economy (SPE) under the program “Programa de I+D+i Estatal de Investigación, Desarrollo e innovación Orientada a los Retos de la Sociedad”, project id TIN201340809R. SPE had no role in the design of the study and collection, analysis, and interpretation of data. SPE did not contribute in writing the manuscript and the decision to submit the manuscript for publication.
Availability of data and material
The synthetic data set used for the experiments was generated using the Tree::RandomFactory functions of BioPerl 1.6.0. The scripts for translating the maximum likelihood equations into the notation of PRISM model checking tool are publicly available at GitHub with DOI http://dx.doi.org/10.5281/zenodo.50545.
Authors’ contributions
JMC designed the study. JIR devised the algorithms, implemented them and conducted the experiments. JIR wrote the manuscript. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Felsenstein J, Vol. 2. Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates; 2003.Google Scholar
 Yang Z, Rannala B. Molecular phylogenetics: Principles and practice. Nat Rev Genet. 2012; 13(5):303–14.View ArticlePubMedGoogle Scholar
 Fitch WM. Uses for evolutionary trees. Philos Trans R Soc Lond Series B Biol Sci. 1995; 349(1327):93–102.View ArticleGoogle Scholar
 CavalliSforza LL, Feldman MW. The application of molecular genetic approaches to the study of human evolution. Nat Genet. 2003; 33:266–75.View ArticlePubMedGoogle Scholar
 Holden C, Mace R. Phylogenetic analysis of the evolution of lactose digestion in adults. Hum Biol. 2009; 81(5/6):597–619.View ArticlePubMedGoogle Scholar
 Mooers AO, Heard SB. Inferring evolutionary process from phylogenetic tree shape. Q Rev Biol. 1997; 72(21):31–54.View ArticleGoogle Scholar
 Grumberg O, Veith H. 25 Years of model checking: history, achievements, perspectives. Berlin: Springer; 2008.View ArticleGoogle Scholar
 Requeno JI, de Miguel Casado G, Blanco R, Colom JM. Temporal logics for phylogenetic analysis via model checking. IEEE/ACM Trans Comput Biol Bioinform. 2013; 10(4):1058–70.View ArticlePubMedGoogle Scholar
 Requeno JI. Formal methods applied to the analysis of phylogenies: Phylogenetic Model Checking PhD thesis: School of Engineering and Architecture, University of Zaragoza; 2014.Google Scholar
 Baier C, Katoen JP. Principles of model checking. Cambridge, Massachusetts: The MIT Press; 2008.Google Scholar
 Clarke EM, Emerson EA, Sistla AP. Automatic verification of finitestate concurrent systems using temporal logic specifications. ACM Trans Program Lang Syst (TOPLAS). 1986; 8(2):244–63.View ArticleGoogle Scholar
 Requeno JI, Colom JM. Analyzing phylogenetic trees with timed and probabilistic model checking: The lactose persistence case study. J Integr Bioinform. 2014; 11(3):248.PubMedGoogle Scholar
 Hansson H, Jonsson B. A logic for reasoning about time and reliability. Form Asp Comput. 1994; 6(5):512–35.View ArticleGoogle Scholar
 Donatelli S, Haddad S, Sproston J. Model checking timed and stochastic properties with CSL^{TA}. IEEE Trans Softw Eng. 2009; 35(2):224–40.View ArticleGoogle Scholar
 Konur S. A survey on temporal logics for specifying and verifying realtime systems. Frontiers of Computer Science. 2013; 7(3):370–403. doi:10.1007/s1170401321952. http://dx.doi.org/10.1007/s1170401321952 View ArticleGoogle Scholar
 Lewis P, Holder M, Swofford D. Phycas: software for phylogenetic analysis: Storrs, CT: University of Connecticut; 2008. See www.phycas.org.
 Höhna S, Heath TA, Boussau B, Landis MJ, Ronquist F, Huelsenbeck JP. Probabilistic graphical model representation in phylogenetics. Syst Biol. 2014; 63(5):753–71.View ArticlePubMedPubMed CentralGoogle Scholar
 Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu CH, Xie D, Suchard MA, Rambaut A, Drummond AJ. Beast 2: a software platform for bayesian evolutionary analysis. PLoS Comput Biol. 2014; 10(4):1003537.View ArticleGoogle Scholar
 Stadler T. Evolving trees: Models for speciation and extinction in phylogenetics. PhD thesis. 2008.Google Scholar
 Steel M, McKenzie A. Properties of phylogenetic trees generated by Yuletype speciation models. Math Biosci. 2001; 170(1):91–112.View ArticlePubMedGoogle Scholar
 Fitch WM. Toward defining the course of evolution: Minimum change for a specific tree topology. Syst Biol. 1971; 20(4):406–16.View ArticleGoogle Scholar
 Stamatakis AP, Ludwig T, Meier H. RAxMLIII: A fast program for maximum likelihoodbased inference of large phylogenetic trees. Bioinformatics. 2005; 21(4):456–63.View ArticlePubMedGoogle Scholar
 Stamatakis A. RAxML version 8: A tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinformatics. 2014; 30:1312–3.View ArticlePubMedPubMed CentralGoogle Scholar
 Yang Z, Vol. 284. Computational molecular evolution. New York: Oxford University Press; 2006.View ArticleGoogle Scholar
 Stewart WJ. Probability, Markov chains, queues, and simulation: the mathematical basis of performance modeling. Princeton, New Jersey: Princeton University Press; 2009.Google Scholar
 Kwiatkowska M, Norman G, Parker D. Stochastic model checking In: Bernardo M, Hillston J, editors. 7th International School on Formal Methods for Performance Evaluation. LNCS. Berlin: Springer: 2007. p. 220–70.Google Scholar
 Clarke EM, Grumberg O, Peled DA. Model checking. Cambridge, Massachusetts: The MIT Press; 2000.Google Scholar
 Montoya J, LópezGallardo E, DíezSánchez C, LópezPérez MJ, RuizPesini E. 20 years of human mtDNA pathologic point mutations: Carefully reading the pathogenicity criteria. Biochimica et Biophysica Acta. 2009; 1787(5):476–83.View ArticlePubMedGoogle Scholar
 Felsenstein J. Evolutionary trees from DNA sequences: A maximum likelihood approach. J Mol Evol. 1981; 17(6):368–76.View ArticlePubMedGoogle Scholar
 Lio P, Goldman N. Models of molecular evolution and phylogeny. Genome Res. 1998; 8(12):1233–44.PubMedGoogle Scholar
 Cho A. Constructing phylogenetic trees using maximum likelihood. PhD thesis, Scripps Senior These. 2012.Google Scholar
 Wägele JW. Foundations of phylogenetic systematics. Münich: Pfeil; 2005.Google Scholar
 Tuffley C, Steel M. Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull Math Biol. 1997; 59(3):581–607.View ArticlePubMedGoogle Scholar
 Jansen DN, Katoen JP, Oldenkamp M, Stoelinga M, Zapreev I. How fast and fat is your probabilistic model checker? An experimental performance comparison In: Yorav K, editor. Proceedings 3rd International Haifa Verification Conference on Hardware and Software, Verification and Testing. LNCS. Berlin: Springer: 2008. p. 69–85.Google Scholar
 Marta K, Gethin N, David P. PRISM 4.0: Verification of probabilistic realtime systems In: Gopalakrishnan G, Qadeer S, editors. Proceedings 23rd International Conference on Computer Aided Verification. LNCS. Berlin: Springer: 2011. p. 585–91.Google Scholar
 Katoen JP, Khattri M, Zapreev IS. A markov reward model checker. In: Proceedings 2nd International Conference on the Quantitative Evaluation of Systems. IEEE: 2005. p. 243–244, doi:10.1109/QEST.2005.2.
 Mateescu R, Requeno JI. Onthefly model checking for extended actionbased probabilistic operators In: Bošnački D, Wijs A, editors. 23rd International SPIN Symposium on Model Checking of Software. Springer: 2016. vol. 9641 p. 189–207.Google Scholar
 Stamatakis AP, Ludwig T, Meier H. The AxML program family for maximum likelihoodbased phylogenetic tree inference. Concurr Comput Pract Experience. 2004; 16(9):975–88.View ArticleGoogle Scholar
 Darriba D, Taboada GL, Doallo R, Posada D. jModelTest 2: More models, new heuristics and parallel computing. Nat Methods. 2012; 9(8):772–2.View ArticlePubMedPubMed CentralGoogle Scholar
 Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JGR, Korf I, Lapp H, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002; 12(10):1611–8.View ArticlePubMedPubMed CentralGoogle Scholar