P*R*O*P: a web application to perform phylogenetic analysis considering the effect of gaps

Background Phylogenetic analysis strongly depends on evolutionary models. Most evolutionary models for estimating genetic differences and phylogenetic relationships do not treat gap sites in the alignment of sequences. Appropriately incorporating evolutionary information of sites containing insertions and deletions into genetic difference measures will be improve the accuracy of phylogenetic estimates. Results We introduced a new measure for estimating genetic differences, and presented P*R*O*P, a web application for performing phylogenetic analysis based on genetic difference considering the effect of gaps. As an example of phylogenetic analysis using P*R*O*P, we used complete p53 amino acid sequences of 31 organisms and illustrated that the genetic differences with and without information on sites containing gaps result in trees with different topologies. Conclusions P*R*O*P is available at https://www.rs.tus.ac.jp/bioinformatics/prop and the user can perform phylogenetic analysis by uploading sequence data on the website. The most distinctive feature of P*R*O*P is its genetic difference that is estimated without eliminating gap sites for alignment sequences, which helps users detect meaningful difference in an evolutionary process. The source code is available in GitHub: https://github.com/TUS-Satolab/PROP.

(K2P) model [2], called K2P + Gap, to incorporate the evolutionary information of sites containing insertions and deletions into the measure for estimating genetic difference between two nucleotide sequences.
Here, we incorporate this idea into the Jukes-Cantor (JC) method [3] for amino acid sequences, and present P*R*O*P (Phylogenetic Relationships based On Proper genetic differences), a web application for performing phylogenetic analysis based on genetic difference considering the effect of gaps in both nucleotide sequences and amino acid sequences. Unlike the software packages for phylogenetic analysis such as MEGA [4] and PAUP* [5], P*R*O*P is a web application, so the user can perform phylogenetic analysis only by preparing sequence data in FASTA format without downloading or installing it.

An extension of the JC model
The JC measure for estimating genetic difference between two nucleotide sequences, in terms of the number of nucleotide substitutions per site, is estimated by where P is the probability of homologous sites that are different between the two sequences and a = 3/4 . In the case of a = 19/20 , Eq. (1) can be used for amino acid sequences [10].
We extend the JC model to estimate genetic differences considering gap information for aligned amino acid sequences. The idea is the same as that for the K2P + Gap difference measure introduced in a previous paper [1]. All amino acid substitutions occur at the same rate α per site per unit time (year). In addition, when each of the twenty amino acids has an equal rate of changing to a gap, the rate of deletions per site per unit time is ε . On the other hand, assuming that a gap changes to one of the twenty amino acids with an equal rate and its rate per site per unit time is ε/20 , the rate of insertions (i.e., change of a gap to any of the twenty amino acids) per site per unit time is ε . Therefore, the total rate of amino acid changes per site per unit time k is given by the following mixture: where w is the mixture weight, which means the probability of amino acid occurrence between two aligned homologous sequences. In such a case, our measure (JC + Gap) for estimating genetic difference between two amino acid sequences, in terms of the number of amino acid changes per site that occurred during t years, is given by As described above, in this equation, w is the occurrence probability of amino acids in two sequences compared. P and S are the probabilities of homologous sites showing different amino acids and showing identical amino acids, respectively. Obviously, if gaps do not exist in two sequences compared (namely w = 1 : P + S = 1 ), then Eq. (3) becomes equal to Eq. (1).

Simulation analysis
In order to evaluate the performance of the difference measure in our model (JC + Gap), we investigated the accuracy of phylogenetic reconstruction for both the JC + Gap difference measure and the JC difference measure by using computer simulation. We had 60 model conditions (five numbers of taxa, four sequence lengths, and three change rates) in a similar way to a previous paper [1]. The probability of amino acid substitutions was fixed at 0.01 per site per branch, and the probabilities of insertion and deletion changes were changed to 0.001, 0.002 and 0.005 per site per branch. 100 replications were performed for each model condition. The sequence data corresponding to the leaf nodes on each perfect binary tree were given as input to the phylogenetic reconstruction. For each data set, the JC genetic differences with complete deletion of gaps, the JC genetic differences with pairwise deletion of gaps, and the JC + Gap genetic differences were estimated to reconstruct phylogenetic trees using the NJ method (see [1] for more details).

Accuracy of phylogenetic reconstruction
The average percentage of correctly reconstructed topologies in data sets for all 60 model conditions was 46.1% when calculated with the JC difference measure (complete deletion), 64.2% when calculated with the JC difference measure (pairwise deletion) and 73.3% when calculated with the JC + Gap difference measure (Fig. 1). In case the probabilities of insertion and deletion changes were 0.001, the average accuracy for the JC difference measure (complete deletion), the JC difference measure (pairwise deletion) and the JC +

Phylogenetic analysis
We introduced a new measure for estimating genetic differences, and presented P*R*O*P, a web application for performing phylogenetic analysis based on genetic  Fig. 2 Phylogenetic trees of the p53 amino acid sequences based on the treatment of gaps "+Gap", "Pairwise Deletion" and "Complete Deletion". Each tree was generated using P*R*O*P and was midpoint rooted using FigTree. Organism species are colored as follows: Actinopterygii (9 species), black; Amphibia (1 species), purple; Aves (1 species), green; Mammalia Euarchontoglires (13 species), red; Mammalia Laurasiatheria (7 species), blue difference considering the effect of gaps. Here, we use the amino acid sequences of cellular tumor antigen p53 as an example to illustrate the effect of different treatment of gaps in phylogenetic analysis using P*R*O*P. Complete p53 amino acid sequences of 31 organisms were retrieved from UniProt KB/Swiss-Prot database (https ://www.unipr ot.org/unipr ot/). These 31 sequences with amino acid length ranging from 352 to 396 were aligned with MAFFT and the genetic differences were respectively calculated using the JC measure in each case of the treatment of gaps (+Gap/Pairwise Deletion/Complete Deletion). For each of the three cases, the phylogenetic tree was generated with the NJ method and the resulting Newick tree file was furthermore plotted and edited in FigTree (Version 1.4.4) developed by Andrew Rambaut. The p53 sequences were grouped according to their class (Actinopterygii, Amphibia, Aves, and Mammalia) in all three trees; however, as for the class Mammalia, the tree based on the JC + Gap difference measure had a different topology compared to the other two trees (Fig. 2). Two subtrees in its tree based on the JC + Gap difference measure that are rooted at the sibling nodes of the same internal node correspond to the two clades (Euarchontoglires and Laurasiatheria), respectively. Many studies support that the superorder Euarchontoglires and the superorder Laurasiatheria are sister taxa [11][12][13][14][15]. The result with the JC + Gap difference measure in our analysis is consistent with these studies.

Conclusions
P*R*O*P is a web application for performing phylogenetic analysis based on genetic difference considering the effect of gaps. The user can perform phylogenetic analysis by uploading sequence data in FASTA format. The most distinctive feature of P*R*O*P is its genetic difference that is estimated without eliminating gap sites for alignment sequences, which helps users detect meaningful difference in an evolutionary process and obtain a more accurate classification. The front-end is implemented in JavaScript using the Angular framework. The back-end is implemented in Python and is deployed on the Amazon Elastic Compute Cloud (Amazon EC2). P*R*O*P is available at https :// www.rs.tus.ac.jp/bioin forma tics/prop. We will continue to update P*R*O*P by adding additional information, improving the implementation, and incorporating new measures for estimating genetic differences. The user can always access the latest version of P*R*O*P.