Does the choice of nucleotide substitution models matter topologically?

Background In the context of a master level programming practical at the computer science department of the Karlsruhe Institute of Technology, we developed and make available an open-source code for testing all 203 possible nucleotide substitution models in the Maximum Likelihood (ML) setting under the common Akaike, corrected Akaike, and Bayesian information criteria. We address the question if model selection matters topologically, that is, if conducting ML inferences under the optimal, instead of a standard General Time Reversible model, yields different tree topologies. We also assess, to which degree models selected and trees inferred under the three standard criteria (AIC, AICc, BIC) differ. Finally, we assess if the definition of the sample size (#sites versus #sites × #taxa) yields different models and, as a consequence, different tree topologies. Results We find that, all three factors (by order of impact: nucleotide model selection, information criterion used, sample size definition) can yield topologically substantially different final tree topologies (topological difference exceeding 10 %) for approximately 5 % of the tree inferences conducted on the 39 empirical datasets used in our study. Conclusions We find that, using the best-fit nucleotide substitution model may change the final ML tree topology compared to an inference under a default GTR model. The effect is less pronounced when comparing distinct information criteria. Nonetheless, in some cases we did obtain substantial topological differences. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0985-x) contains supplementary material, which is available to authorized users.


Teaching Perspective, Goals and Course Outline
Courses at the Master level in our computer science department are organized in so-called modules over two semesters. In the first semester of the Bioinformatics module, we teach a lecture called "Introduction to Bioinformatics for Computer Scientists", since KIT does not offer a Bioinformatics degree. This lecture covers basic topics such as an introduction to molecular biology, pair-wise sequence alignment, BLAST, de novo and by-reference sequence assembly, multiple sequence alignment, phylogenetic inference, MCMC methods, and population genetics.
In the second semester of the module, students can choose if they want to do a seminar presentation or the programming practical whose results we describe here. The goal of the practical is to carry out a selfcontained project and write, as well as release software, that is useful to the evolutionary biology community. Another focus is on learning to use tools that enhance * Correspondence: Alexandros.Stamatakis@h-its.org 1  software quality. Note that, at a CS department designing "classic" bioinformatics analysis pipelines using scripting languages is typically not considered as "real programming" by the students. Hence, we need to define a project that requires coding in C/C++ or Java. One should also strive to avoid having the students extend existing software, since this is generally frustrating and hinders creativity.
We thus decided to essentially re-implement the paper on Bayesian model selection [1], but in a ML framework. This project allows to apply a broad range of skills acquired in the Bioinformatics and other masterlevel modules at our department. Initially, students need to read and understand the original paper. Then, they can use their algorithmic knowledge and training to design an algorithm that correctly enumerates all possible time-reversible substitution models. Subsequently, they can use the PLL to carry out the model tests. Note that, implementing the likelihood function efficiently and in a numerically stable way is a tedious task that requires comprehensive background knowledge and experience. This can not be accomplished by students during a single semester. Hence, the PLL lends itself for conducting such practicals because students will learn how to use a scientific high performance library. Thereby, they can also use their knowledge on phylogenetic likelihood models and the AIC, AICc, and BIC criteria presented in the lectures. By requesting students to parallelize the code via a simple master-worker approach they can also deploy their parallel computing knowledge (from other modules) and use MPI (Message Passing Interface) in practice.
Finally, we also want to encourage critical thinking. Thus, we ask the question if such extensive model testing is actually required. In other words, we wanted to assess if using the best model for a ML tree search (as implemented in the PLL) induces substantial differences in the final tree topologies compared to an inference that simply relies on GTR. In these tests the students are also given the opportunity to calculate Robinson-Foulds (RF [2]) distances between trees that were covered in the lectures. Another question we asses is how different ways to incorporate the sample size of the two-dimensional sample (the #taxa and #sites in the multiple sequence alignment) into AIC and BIC criteria affects the model selection process.
In terms of project documentation, students are usually required to write a report. However, in the present case, we jointly took the decision to write a paper about the practical. This has the positive effect that students also learn how to write scientific papers.

Teaching Conclusions
In the following we outline our subjective perception of the teaching outcome of this practical from the student and teacher point of view.

Student View
From the student's perspective, the practical was wellorganized and always took place in a very constructive, stress-free atmosphere. Prior to the practical, Alexis made sure that the task was feasible by asking one of his lab members to implement a proof-of-concept solution. This way, we were sure that the task at hand is doable. We also had a responsive advisor (Diego Darriba) for asking implementation questions regarding our program and the usage of the PLL.
The main challenge was to become familiar with the PLL. This scientific library has a plethora of features and covers a broad range of different application scenarios. It therefore took a while, until the first model evaluation on a tree was successful. During this time we met Alexis every week to discuss our latest achievements and issues.
The generalization towards testing all models was then finished quickly and we could start with the parallelization of the code. Our main challenge for the parallelization was to decide which data to communicate and how to design the application in an understandable and reusable manner.
Towards the end of the practical we focused on testing the models. We wrote several scripts to automatically executed our program on each test dataset. As soon as the results were calculated, we built scripts to retrieve and visualize the data for this paper.

Teacher View
This was the first programming practical I carried out at KIT. The teaching experience was generally very positive because we worked on something interesting that I had been wondering about since listening to a talk on the topic given by John Huelsenbeck and not on a completely constructed programming exercise. Furthermore, the students were highly motivated and the group was small which allowed for close interactions and good supervision. The students were also very enthusiastic about trying to write a paper instead of a boring report, despite the fact that this induced a considerable amount of extra work that extended well into the following semester.
Thus, we will continue running the practical based on this scheme. This semester's task is to implement a numerically stable and highly optimized (using SSE3 and AVX vector intrinsics as well as cache optimization techniques) version of the TKF91 [3] statistical alignment kernel from scratch.