Does the choice of nucleotide substitution models matter topologically?
 Michael Hoff†^{1},
 Stefan Orf†^{1},
 Benedikt Riehm†^{1},
 Diego Darriba^{2} and
 Alexandros Stamatakis^{1, 2}Email author
https://doi.org/10.1186/s128590160985x
© Hoffet al. 2016
Received: 26 February 2016
Accepted: 9 March 2016
Published: 24 March 2016
Abstract
Background
In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an opensource code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites × #taxa) yields different models and, as a consequence, different tree topologies.
Results
We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10 %) for approximately 5 % of the tree inferences conducted on the 39 empirical datasets used in our study.
Conclusions
We find that, using the bestfit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences.
Keywords
Phylogenetics Nucleotide substitution Model selection Information criterion BIC AICBackground
Statistical models of DNA evolution as used in Bayesian inference (BI) and Maximum Likelihood (ML) methods for phylogenetic reconstruction are typically required to be timereversible. That is, evolution (or the Markov process modeling it) is assumed to occur in the same way if followed forward or backward in time. Timereversibility has some intrinsic computational advantages, mainly that, the likelihood of a tree one intends to score is the same for any placement of the virtual root that is deployed to direct the Felsenstein pruning algorithm [1].
As implied by the above representation the rates r _{XY} in the upper and lower diagonal part of this matrix must be symmetrical. Therefore, there can be at most 6 distinct, independent rates. Because rates in such a matrix are relative rates, one rate (typically r _{GT}) is set to 1.0 by default such that there are at most 5 free parameters that can be estimated via ML optimization or sampled with MCMC methods for BI. A timereversible DNA substitution matrix can have between 1 up to 6 distinct rates. Note that, we further differentiate between substitution models with identical base frequencies (i.e., π _{ A }=…π _{ t }=0.25) or distinct base frequencies. Distinct base frequencies can be either be obtained by using empirical base frequencies or via direct sampling (BI) or ML inference of the base frequencies. Nonetheless, here we mainly focus on models as defined by substitution rates and not base frequencies. If all 6 rates and all base frequencies are identical we obtain the JukesCantor (JC [2]) model and if all 6 rates are independent, we obtain the General TimeReversible (GTR [3]) model.
Evidently, there is a large number of possible timereversible models situated between the JC and GTR model. We can essentially consider all possible combinations of dependencies between the 6 substitution rates. In the literature, only a subset of these has been explicitly described and named, for instance the HKY85 model [4]. We are aware of two papers and respective implementations that consider all possible timereversible nucleotide models and intend to select the bestfitting one among these 203 substitution matrices.
The first paper that addressed the problem used Bayesian statistics and MCMC methods to sample among all possible substitution matrices proportional to their posterior probability [5]. The method for integrating over all substitution models is available in MrBayes version 3.2 [6] via the command lset nst=mixed.
Another code that allows for testing all possible substitution matrices in a ML framework based on the AIC (Akaike Information Criterion), AICc (corrected AIC) BIC (Bayesian Information Criterion), and DT (Decision Theory) criteria is jModelTest 2 [7]. However, jModelTest relies on PHYML [8] to calculate likelihood scores for a given substitution model and tree. PHYML is relatively slow compared to RaxML (v. 8.1.7 [9]) and the Phylogenetic Likelihood Library (PLL [10]) that relies on the same phylogenetic likelihood kernel implementation as RaxML. For instance, optimizing the model parameters and branch lengths on a fixed tree with 354 ITS sequences [11] using a GTR model with Γdistributed rate heterogeneity among sites [12] and 4 discrete rate categories requires 11 s with RaxML and 110 s with PHYML.
Here, we present and make available a fast implementation of a MLbased model testing tool for all possible timereversible DNA substitution models. It was developed by three students within the context of a programming practical at the Master level in the computer science department of the Karlsruhe Institute of Technology. The educational aspects of this project are discussed in the online supplement (Additional file 1).
The effect of different selection criteria and substitution models on MLbased phylogenetic analyses has already been discussed before. Darriba et al. [7] evaluated the accuracy of distinct model selection criteria with respect to recovering the true generating model parameters (e.g., substitution rates, proportion of invariant sites, or the α shape parameter of the Γ distribution) under simulation, using 40,000 synthetic alignments under a wide range of model parameters.
Ripplinger and Sullivan [13] evaluated the influence of model choice on empirical tree topologies as determined by using the hierarchical Likelihood Ratio Test (hLRT [14]), the AICc and BIC criteria, as well as DT. In their tests, the selection criteria at hand, often returned different bestfit models. These different bestfit models then also induced distinct ML trees. Ripplinger and Sullivan exclusively focused on selecting named models in their study, that is, they did not select among all possible 203 substitution models. They did, however, also investigate the impact of model selection on bootstrap support values. Note that, this was not feasible in our study because of the limited amount of time and credits available to the students for completing the programming practical.
Nonetheless, our work fills the gap by the two aforementioned papers [7, 13] since we study the impact of selecting among all possible 203 substitution models on MLbased tree inferences on empirical datasets compared to tree inferences conducted under a default GTR +Γ model.
Initially, we present two algorithms for generating all possible substitution models. Thereafter, we discuss the sequential and parallel implementation using the PLL. Subsequently, we present a set of experiments on empirical datasets to answer the following question: Does model selection really matter with respect to its impact on the shape of the final tree topology? Posada and Buckley discussed the potential impact of the sample size on AICc and BIC criteria [15]. We also assess if different possible definitions of the sample size as used for calculating information criteria influence the model that will be selected. In addition, we conduct a small simulation study to verify that our tool works correctly. We conclude with speedup data for the parallel implementation of our tool.
Methods
Algorithms
The model test needs to calculate the ML score for every possible timereversible nucleotide substitution model. A full list of all possible models was already available [5]. Nonetheless, part of the programming task was to design algorithms for generating them.
In the following, we first provide some background information and notation and then describe our two algorithms for generating all timereversible nucleotide substitution models.
Notation, terminology & properties
In analogy to [5] we encode a substitution model as a string of six digits over Σ _{6}:={1,2,3,4,5,6}. For example the string 111333 describes a substitution model with the following constraints: r _{AC}=r _{AG}=r _{AT} and r _{CG}=r _{CT}=r _{GT}. Note that, the PLL [10] offers a similar format for specifying timereversible nucleotide substitution models. The only difference is that it uses the alphabet {0,1,2,3,4,5}.
In this string notation, the GTR model is denoted by 123456, that is, all rates are different and independent from each other. The HKY85 model is denoted by 121121. Here, we constrain r _{AC}=r _{AT}=r _{CG}=r _{GT} and r _{AG}=r _{CT}.
We call a model a kmodel, if it has exactly k distinct substitution rates. Hence, the parameter k characterizes the degree of freedom of a model. We further define Σ _{ k }:={1,2,…,k} as the alphabet of all digits with value ≤k.
While two distinct models over Σ _{6} can consist of different digits/characters, they can specify exactly the same constraints. For example, 121121 and 212212 are semantically identical since they have the same degree of freedom and identical rate dependencies. We call such models redundant.
To avoid generating redundant models we introduce a normalization property, which holds only for one, unique model in a set of semantically identical models. A model string s=s_{1}s_{2}s_{3}s_{4}s_{5}s_{6} (e.g., 123456 induces s_{1}=1,s_{2}=2,…) with digits s_{ i }∈Σ _{6} is normalized, if and only if, for each first occurrence of a value s_{ i } in s the following holds: \(\{\texttt {s}_{1},\texttt {s}_{2},\ldots,\texttt {s}_{i1}\} = \Sigma _{\texttt {s}_{i}  1}\phantom {\dot {i}\!}\). For example, the first occurrence of digit 3 in a model requires the prefix, excluding the current digit, to consist of exactly two digits: 1 and 2. The models 123111, 122132, 123456 fulfill this constraint and the models 311111, 113111, 124356 do not. The semantically equivalent, normalized models 122222, 112111, 123456 for the unnormalized ones in our example can be obtained by consistently replacing characters from left to right. As a consequence of the above definition, a normalized kmodel therefore only comprises characters in Σ _{ k }.
We further define the kprefix of a kmodel m, as the shortest prefix of m that contains all characters in Σ _{ k } at least once. This part of the string is denoted by underlining the respective characters. For example, the 4−model 1243 11 has the 4−prefix 1243, as any shorter prefix, say 124, does not yield all characters in Σ _{4}={1,2,3,4}.
In the following, we first present a bruteforce algorithm and then a more elegant constructive algorithm for generating all model strings.
Brute force approach
Our bruteforce implementation initially enumerates all possible strings of length 6 for the alphabet Σ _{6}={1,2,3,4,5,6}. The enumeration is done by counting from 111111 to 666666 in the base6 numerical system, except that the digits are in the range 1…6 instead of 0…5. In total, this yields 6^{6}=46656 strings out of which only 203 are nonredundant.
To ensure that the algorith only returns normalized model strings, we normalize every enumerated string and subsequently filter out redundant models.
The normalization remaps the digits of the model string such that the normalization property holds. In the model string, every constraint is represented by a unique digit. The transformation now chooses new digits for each constraint. These digits are assigned incrementally starting with 1. For this, the algorithm identifies the constraints of the model string by traversing it from left to right. To properly order the constraints, they are ordered by the index of their corresponding leftmost digit.
The first occurrence of each normalized string encountered in the enumeration of all strings is then added to a final model list. We discard normalized model strings that have already been included in this list. When all strings have been enumerated, normalized, and added to the final model list, the list contains all 203 nonredundant model strings.
Note that, a normalized string always begins with 1, the second digit can only be 1 or 2 etc. We use this observation to optimize the brute force enumeration algorithm. For this, we only place characters in the set {1,…,n} at the nth position of the string. This prevents the algorithm from enumerating a large number of nonnormalized strings. With this modification, only 1×2×3×4×5×6=720 strings are enumerated. This generates only 1.5 % of the strings the naive bruteforce approach enumerates and yields only 720−203=517 redundant models.
Inductive algorithm
Our inductive algorithm uses the set of all kmodels to construct all possible (k+1)−models. The algorithm starts with the single 1−model 111111 (the JC model) and constructs all 31 normalized 2−models (e.g., 111112, 111121, 111122, …). This process continues inductively by incrementing k until the sole possible 6−model 123456 (the GTR model) is generated.
In the following we will explain how a single new (k+1) −model can be derived by applying k↦k+1 on a kmodel by using character replacement. Extending a normalized model to k+1 degrees of freedom requires one character to be replaced by a new character k+1. To maintain the normalization property and to ensure the correctness of this operation, the replacement must not be applied to the kprefix. For example when applying 2↦3 the 2−prefix 12 of model 12 1111 must not be used for a character replacement operation, since the resulting models 321111 and 131111 would not be normalized. To avoid the construction of redundant models, we further restrict the character replacement step to only replace characters of value k. Otherwise, the sets of (k+1)−models that are derived from different ksource models will not be disjoint. To illustrate what can happen, the step 2↦3 is applied to the third character of two 2−models 12 1111 and 12 2111. This will generate the 3−model 123 111 twice. An example for a derivation respecting both constraints is 123 3211 \(\overset {3\mapsto 4}{\longrightarrow }\) 1234 211. Note that, the (k+1)−prefix always extends the kprefix to include the newly inserted character k+1.
The algorithm relies on the property that every conducted replacement operation yields a valid model and thus, a fraction of the overall result. As a consequence, every model derivation step has to be applied to a copy of the source model to avoid overwriting already generated models.
Given a single kmodel, the induction step k↦k+1 is conducted in two phases. First, all characters not contained in the kprefix that have value k are used to derive new models, one for each such character. This generates a set of (k+1)−models, where the (k+1)−prefixes now protect their newly inserted characters. For example the 1−model 1 11111 produces the 2−models 12 1111, 112 111, 1112 11, 11112 1 and 111112. Note that, only the first phase derives (k+1)−models from kmodels and thus produces models with a longer protected prefix.
In the second phase, we use this set of (k+1)−models to recursively derive further (k+1)−models. For each model in the set, we identify all occurrences of k not contained in the (k+1)−prefix. For each such “unprotected” occurrence of k we generate a new model by replacing the character k by k+1. Each new model is then processed recursively in the same way. Note that, only occurrences of k right of the last replaced character are considered. This process continues until we reach the end of the string.
For example the model 1112 11, which has been generated in the first phase of 1↦2 yields 1112 2 \(\bar {\texttt {1}}\) and 1112 12. Notice the changes in both new models and the trailing 1 in the first model marked by a line on top of it. The line indicates that the recursive processing of this model will start at this character and not consider characters left of it. The recursion then yields 1112 22, as the marked 1 can still be replaced.
Implementation
The goal of the programming practical was to develop a program that tests all possible timereversible nucleotide substitution models under Maximum Likelihood (ML) on a fixed, reasonable tree and subsequently selects the best model. Model selection is conducted using the following standard criteria: AIC, BIC, and AICc. In other words, our algorithm represents an MLbased reimplementation of the Bayesian approach deployed in [5]. Once the best model has been selected, the tool shall also conduct a MLbased phylogenetic tree search under the best model.
For likelihood calculations, tree searches, and handling I/O we used PLL [10].

Parse a multiple sequence alignment of DNA sequences

Compute a reasonable, nonrandom, reference tree using parsimony

Generate all 203 possible timereversible substitution matrices

Loop over these 203 matrices and optimize the ML score on the fixed tree for each model using PLL and Γdistributed rate heterogeneity among sites with 4 discrete rate categories

Use AIC, BIC, and AICc to determine the best among the 203 models

Conduct an ML tree search (as implemented in PLL) on the best model

Write the final ML tree to a Newick file
Note that, the actual implementation computes the best model for 5 instead of 3 information criteria. We compute BIC and AICc scores twice to allow for using distinct sample size definitions. In the following we briefly describe the sequential and parallel implementations of the algorithm.
Sequential implementation
Initially, we parse and read the alignment data (either in PHYLIP or FASTA format) into a PLL data structure using the appropriate I/O routines of the PLL. Besides the alignment file name, users can also specify if they want to use empirical base frequencies or conduct a ML estimate of the base frequencies. We then build a reasonable tree on which we are going to calculate the likelihoods for all models. This is done via the randomized stepwise addition parsimony tree algorithm as implemented in the PLL. We then loop over all (precomputed) possible models and set the substitution rate dependencies in the matrix using the appropriate PLL function according to the current model string. Subsequently, we optimize all model parameters (base frequencies, rates, branch lengths, α shape parameter of the Γ model of rate heterogeneity) on the fixed parsimony tree with PLL and calculate the values for the five information criteria. We keep track of the minimum value and the corresponding model for each criterion separately for further downstream analysis. Given the set of best models (at least one and at most three) according to the criteria, we then conduct a ML tree search for each criterion and bestscoring model and write the resulting ML tree to file in Newick format.
Parallel implementation
The runtime of the sequential implementation is dominated by likelihood calculations in the PLL that account for approximately 90–95 % of overall execution time. Thus, the “model optimization” and “tree search” steps can become performance bottlenecks, in particular for large datasets. To this end, we designed an MPIbased Master/Worker approach that evaluates all 203 distinct models in parallel. Note that, the likelihood evaluations for each of the 203 models are completely independent of each other.
The primary goal is to efficiently find the bestfitting model for each information criterion. As already mentioned, we need to optimize the model parameters for each of the 203 models. Thereafter, we can calculate the information criteria using the respective ML score and other relevant information like the degrees of freedom and the dataset properties, as mentioned before.
Since we only need to select the “best” model per information criterion, we can merge the criterion scores of different models by keeping only the minimum^{1} for each criterion. This means that, if we merge the results for two distinct model evaluations, we obtain a list of lowest criterionscore/modelindex pairs. To explain the parallelization we distinguish between “local” or “global” result pair lists. A local result only aggregates the information for a subset of models, whereas the global result aggregates the information over all models.
In the sequential case, the one and only thread starts with a local result list and updates it after each model evaluation. When the last model has been processed the local result corresponds to the global result.
In the parallel case each worker process maintains its own local result list. Worker processes receive work (models) from the master process, evaluate them, and update their local result lists. When all models have been evaluated, we use MPI to collect and automatically merge^{2} all local results lists into the global result list at the master.
In the following, we will briefly outline the MPI communication and synchronization scheme. After asynchronously sending an initial workload to each worker, the master process uses a blocking wait for any worker to finish its model evaluation task. Once a worker has finished, the master again uses an asynchronous send to propagate the next model to be evaluated to this worker. Via the asynchronous send, the master can immediately continue distributing tasks to other workers. The worker employs a blocking receive for a model, evaluates it, updates its own local result list, and informs the master about the task completion. When all models have been distributed and evaluated, the master initiates a parallel reduction operation to obtain the global result list and terminate all MPI processes.
Code quality assessment
In order to continuously monitor and improve the quality of our source code, we used several tools and methods throughout our project. We deployed both, static, as well as dynamic code analyses.
Static analysis
Static analyses help to identify static programming errors such as incorrect programming language syntax or static typing errors. We compiled our code with the gcc compiler using all available and reasonable warning flags^{3}. This allowed us to identify potential programming errors at an early stage. Due to its more pedantic nature (i.e., ability to detect more errors), we also used the clang compiler with respective flags^{4} periodically alongside of gcc to further reduce the amount of potential errors.
Dynamic analysis
These analyses cover runtime issues such as memory leaks. We used valgrind and its submodule memcheck for detecting memoryrelated errors in our program. By appending the Open MPI related valgrind suppression file to the tool, we excluded several MPIspecific memory management functions from the analyses which resulted in an easiertointerpret debugging output.
Experimental setup
To answer our main question, whether models matter or not, we applied our tool to a large number of representative empirical datasets. For each dataset we conducted 36 independent runs with 18 distinct random number seeds and 2 base frequency configurations (empirical and ML estimate) such as to generate 8 distinct randomized stepwise addition order parsimony starting trees. In each run, we determined a total of at most 5 bestscoring models using the following criteria: AIC, AICcS (sample size: #alignment sites), AICcM (samplesize: #alignment sites × #taxa), BICS (sample size: #alignment sites), BICM (samplesize: #alignment sites × #taxa). In addition, in each run we conducted one tree search under the GTR +Γ model. Thus, each out of the 36 perdataset runs generated at most 6 ML trees, 5 using the bestscoring models under the respective criteria and one under GTR +Γ. Note that, different information criteria (like the 3 variants of AIC) tend to choose the same model and therefore typically less than 6 distinct ML trees are generated overall per run. Thus, the total number of trees inferred for each dataset is ≤6·36=216. For subsequent analyses we then selected the trees with the respective highest ML score out of the 36 runs (18 random seeds) for each criterion separately and independently.
Empirical test datasets
Summary of empirical test datasets
#taxa  #sites  Source 

46  1104  Original paper 
64  1620  Original paper 
13  1255  Original paper 
106  378  Original paper 
68  818  Original paper 
15  379  Original paper 
13  273  Original paper 
23  2841  Original paper 
17  379  Original paper 
30  1456  Original paper 
23  9741  Original paper 
18  1689  Original paper 
17  432  Original paper 
31  1140  Original paper 
12  894  Original paper 
27  1949  Lakner et al. 
29  2520  Lakner et al. 
36  1812  Lakner et al. 
41  1137  Lakner et al. 
43  1660  Lakner et al. 
50  1133  Lakner et al. 
59  1824  Lakner et al. 
64  1008  Lakner et al. 
71  1082  Lakner et al. 
50  378  Lakner et al. 
67  955  Lakner et al. 
4  16119  MrBayes 
12  898  MrBayes 
9  720  MrBayes 
123  1606  MrBayes 
150  1269  RAxML 
218  2294  RAxML 
354  460  RAxML 
404  13158  RAxML 
500  1398  RAxML 
10  705  RAxML 
101  1858  RAxML 
714  1241  RAxML 
Hardware platform
With increasing number of taxa, the computations can become timeconsuming. Thus, we used a comparatively large multicore shared memory machine for our experiments (running Ubuntu 14.04). The server comprises four AMD Opteron 6174 processors, each equipped with 12 cores running at 2.2 GHz and has 256 GB RAM. While our parallel implementation can use all 48 cores simultaneously, the computations for all test datasets still required several days.
Results
In the following we describe the results of our experiments. Initially, we verify that our model generation algorithm is correct. Then, we provide speedup values for the parallel implementation. After these technical issues, we assess the magnitude of topological differences between trees inferred under GTR +Γ and trees inferred under the bestfit model according to the respective information criterion. Finally, we quantify differences between AICc, BIC, and AIC and also assess if the sample size definition has an impact on model selection.
Verification of model generation algorithms
We verified the correctness of the two model generation algorithms by systematically comparing the respective outputs to each other. Fortunately, they yield exactly identical sets of model strings. Note that, the two algorithms were developed and implemented completely independently. In addition, we also verified that our set of model strings was identical to those provided in [5].
Speedups
Topological analysis (RF distances)
While for the vast majority of datasets, the rRF distance between the tree inferred under GTR +Γ and the tree inferred under the respective optimal model is zero, there is a substantial number of outliers. For more than 5 % of the tree inferences the rRF distance is larger than 10 % which represents a notable topological difference.
Differences between AICc, BIC, AIC
In the following, we quantify the differences in tree topologies and also in inferred models between the different criteria (AIC, AICc, BIC). To simplify the presentation, we only use AICcS and BICS, since the number of sites is commonly used as sample size.
For comparing the inferred models, we used the model that yielded the bestscoring ML tree under AIC, AICcS, and BICS, respectively. We found that AIC and AICcS propose the same model for 1181 out of 1368 dataset configurations (38 datasets times 36 independent runs per dataset). Further, AIC and BICS select the same model in 683 out of 1368 cases. Finally, AICcS and BICS suggest the same model for 731 out of 1368 dataset configurations.
We observe that, in cases where the proposed models differ, this difference can induce notable rRF distances among the inferred trees.
Impact of sample size definition
In these analyses, we determined the impact of the sample size definition on the models selected by the AICc and BIC criteria. When comparing the model inferred for the respective bestscoring ML tree for AICcS versus AICcM, we found that they yielded different models for 187 out of 1368 datasets. Analogously, for BICS versus BICM different models were proposed in 261 out of 1368 cases.
As in previous experiments, there are a few dataset where the definition of the sample size yields trees that exhibit large topological distances exceeding 10 %.
Selected models
Significance of results
To assess the significance of our results we conducted 50 distinct ML searches under the respective bestfit model and 50 ML searches under the GTR+ Γ model for each dataset. We then calculated the pairwise rRF distances between all pairs of trees from the first (bestfit model) and second (GTR+ Γ) tree set. Thereafter, we compared the average rRF distance among these tree pairs with the random expectation for each empirical dataset. The random expectation is defined as the average rRF distance among all pairs of the 50 trees obtained from the searches under GTR+ Γ. We observed significantly higher average rRF distances than the random expectation between the trees inferred under the bestfit model and GTR+ Γ in 14 out of 41 datasets (p<0.01). Inversely, for 16 out of 41 datasets, we observed differences that were significantly lower than the random expectation (p<0.01). For the remaining 11 datasets, the inferred ML topology was identical, regardless of the model and the tree search replicate.
Model selection accuracy under simulation
In order to further evaluate the accuracy of our model selection process, we created 100 synthetic alignments based on the two empirical datasets (alignment with 27 taxa and 1949 sites from [16], and lice dataset from [5]) which exhibit the highest topological variance. To simulate, we initially determined the bestfit model, bestknown ML tree, and optimized the model parameters for the two datasets. We then deployed INDELible [18] to generate simulated alignments.
Accuracy of model selection for 100 simulations based on two ‘difficult’ empirical datasets
TTR  RF  BS  TMR  

Dataset 1  AIC  0.10  0.176  0.057  0.34 
AICcS  0.09  0.179  0.057  0.40  
AICcM  0.10  0.176  0.057  0.34  
BICS  0.09  0.176  0.055  0.51  
BICM  0.10  0.174  0.054  0.40  
GTR+ Γ  0.09  0.173  0.059  0.00  
Dataset 2  AIC  0.35  0.036  0.001  0.49 
AICcS  0.35  0.036  0.001  0.48  
AICcM  0.35  0.036  0.001  0.49  
BICS  0.35  0.036  0.001  0.14  
BICM  0.35  0.036  0.001  0.02  
GTR+ Γ  0.35  0.037  0.001  0.00 
The perfect topology was recovered more slightly more frequently with selection criteria using the sequence length times the number of taxa as sample size (0.10 vs. 0.09). Inferences conducted under model determined by BIC show lower BS distances than the other criteria or GTR+ Γ (0.54−0.55 vs. 0.57−0.59). However, these differences are not significant given the number of simulation replicates. We also observed that, when using only the sequence length as sample size, the true generating model was recovered more frequently (0.51 vs. 0.34−0.40).
For the second dataset (lice), we did not observe notable differences in any of our topological accuracy measures neither under the models selected by our criteria nor under GTR+ Γ.
All selected bestfit models as well as the GTR+ Γ model recovered the true topology with the same frequency (0.35). However, AIC performed much better than BIC in recovering the true generating model (0.48−0.49 vs. 0.02−0.14). For this dataset though, the recovery of the true model does not affect topological accuracy.
Discussion
We implemented, parallelized, and make available an efficient tool for determining the best nucleotide substitution model among all 203 possible models using the AIC, AICc, and BIC criteria. Apart from achieving our teaching goals, and beyond the computational speed of the tool, we addressed the question if nucleotide model selection matters topologically. To the best of our knowledge, this question has not been addressed using such a large collection of empirical datasets and taking into account all possible models before. Furthermore, we analyzed the topological differences between trees inferred under models proposed by the three standard criteria and assessed if these differences are significant. Finally, we also assessed to which extent the definition of the sample size in the criteria has an effect on the selected models and inferred tree topologies.
Conclusions
Overall, we find that, everything matters topologically. In particular, our experiments suggest that, selecting the bestscoring out of 203 nucleotide substitution model changes the final ML tree topology (compared to an inference under GTR +Γ) by over 10 for 5 % of the tree inferences we conducted. Thus, clearly, such a model test should be executed prior to starting tree inferences.
With respect to the observed topological differences induced by using distinct information criteria and sample size definitions, the effect is less pronounced since it affects a smaller fraction of datasets. Nonetheless, in some cases we did obtain substantial topological differences. However, whether one should use AIC, AICc, and BIC and how one should define the sample size is still subject to ongoing statistical debates. Thus, we are not in a position to make a clear statement here. What we can state though, is that, the statistical discussions are justified based on the differences we observed.
Code & data availability
The code, test datasets, and results are available under GNU GPL at https://github.com/teampltb.
Endnotes
^{1}A lower score indicates a better model
^{2}intercommunicator reduce operation
^{3}Wall Wextra Wredundantdecls Wswitchdefault Wimport Wnointtopointercast Wbadfunctioncast Wmissingdeclarations Wmissingprototypes Wnestedexterns Wstrictprototypes Wformatnonliteral Wundef
^{4}Weverything pedantic
Declarations
Acknowledgments
We wish to thank John Huelsenbeck for providing the test datasets from the original paper, Andre J. Aberer for providing additional test datasets, and Lucas Czech for implementing an initial prototype version for the task to verify if it was ‘doable’.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Felsenstein J. Evolutionary trees from dna sequences: a maximum likelihood approach. J mol evol. 1981; 17(6):368–376.View ArticlePubMedGoogle Scholar
 Jukes TH, Cantor CR. Evolution of protein molecules. Mamm protein metab. 1969; 3:21–132.View ArticleGoogle Scholar
 Tavaré S. Some probabilistic and statistical problems in the analysis of dna sequences. Lect math life sci. 1986; 17:57–86.Google Scholar
 Hasegawa M, Kishino H, Yano Ta. Dating of the humanape splitting by a molecular clock of mitochondrial dna. J mole evol. 1985; 22(2):160–174.View ArticleGoogle Scholar
 Huelsenbeck JP, Larget B, Alfaro ME. Bayesian phylogenetic model selection using reversible jump markov chain monte carlo. Mol Biol Evol. 2004; 21(6):1123–1133.2012. doi:10.1093/molbev/msh123. http://mbe.oxfordjournals.org/content/21/6/1123.full.pdf+html.View ArticlePubMedGoogle Scholar
 Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, Larget B, Liu L, Suchard MA, Huelsenbeck JP. Mrbayes 3.2: efficient bayesian phylogenetic inference and model choice across a large model space. Syst biol; 61(3):539–542.Google Scholar
 Darriba D, Taboada GL, Doallo R, Posada D. jmodeltest 2: more models, new heuristics and parallel computing. Nature methods. 2012; 9(8):772–2.View ArticlePubMedPubMed CentralGoogle Scholar
 Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of phyml 3.0. Syst biol. 2010; 59(3):307–321.View ArticlePubMedGoogle Scholar
 Stamatakis A. Raxml version 8: a tool for phylogenetic analysis and postanalysis of large phylogenies. Bioinforma. 2014; 30(9):1312–1313.View ArticleGoogle Scholar
 Flouri T, IzquierdoCarrasco F, Darriba D, Aberer AJ, Nguyen LT, Minh BQ, Stamatakis A. The phylogenetic likelihood library. Syst Biol. 2015; 64(2):356–362.View ArticlePubMedPubMed CentralGoogle Scholar
 Grimm GW, Renner SS, Stamatakis A, Hemleben V. A nuclear ribosomal dna phylogeny of acer inferred with maximum likelihood, splits graphs, and motif analysis of 606 sequences. Evol Bioinforma Online. 2006; 2:7.Google Scholar
 Yang Z. Maximum likelihood phylogenetic estimation from dna sequences with variable rates over sites: approximate methods. J Mole Evol. 1994; 39(3):306–314.View ArticleGoogle Scholar
 Ripplinger J, Sullivan J. Does choice in model selection affect maximum likelihood analysis?Syst Biol. 2008; 57(1):76–85.View ArticlePubMedGoogle Scholar
 Huelsenbeck JP, Crandall KA. Phylogeny estimation and hypothesis testing using maximum likelihood. Annu Rev Ecol Syst. 1997; 28:437–466.View ArticleGoogle Scholar
 Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. Syst biol. 2004; 53(5):793–808.View ArticlePubMedGoogle Scholar
 Lakner C, van der Mark P, Huelsenbeck JP, Larget B, Ronquist F. Efficiency of markov chain monte carlo tree proposals in bayesian phylogenetics. Syst Biol. 2008; 57(1):86–103. doi:10.1080/10635150801886156. http://sysbio.oxfordjournals.org/content/57/1/86.full.pdf+html.View ArticlePubMedGoogle Scholar
 Robinson D, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(1):131–47.View ArticleGoogle Scholar
 Fletcher W, Yang Z. INDELible: A flexible simulator of biological sequence evolution. Mole Biol Evol. 2009; 26(8):1879–1888. doi:10.1093/molbev/msp098.View ArticleGoogle Scholar