FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets
 Cuong Cao Dang^{1},
 Vinh Sy Le^{1},
 Olivier Gascuel^{2},
 Bart Hazes^{3} and
 Quang Si Le^{4}Email author
https://doi.org/10.1186/1471210515341
© Dang et al.; licensee BioMed Central Ltd. 2014
Received: 13 May 2014
Accepted: 29 September 2014
Published: 24 October 2014
Abstract
Background
Amino acid replacement rate matrices are a crucial component of many protein analysis systems such as sequence similarity search, sequence alignment, and phylogenetic inference. Ideally, the rate matrix reflects the mutational behavior of the actual data under study; however, estimating amino acid replacement rate matrices requires large protein alignments and is computationally expensive and complex. As a compromise, suboptimal precalculated generic matrices are typically used for proteinbased phylogeny. Sequence availability has now grown to a point where problemspecific rate matrices can often be calculated if the computational cost can be controlled.
Results
The most time consuming step in estimating rate matrices by maximum likelihood is building maximum likelihood phylogenetic trees from protein alignments. We propose a new procedure, called FastMG, to overcome this obstacle. The key innovation is the alignmentsplitting algorithm that splits alignments with many sequences into nonoverlapping subalignments prior to estimating amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure was an order of magnitude faster than without splitting. Importantly, there was no apparent loss in matrix quality if an appropriate splitting procedure is used.
Conclusions
FastMG is a simple, fast and accurate procedure to estimate amino acid replacement rate matrices from large data sets. It enables researchers to study the evolutionary relationships for specific groups of proteins or taxa with optimized, dataspecific amino acid replacement rate matrices. The programs, data sets, and the new mammalian mitochondrial protein rate matrix are available at http://fastmg.codeplex.com.
Keywords
Amino acid replacement rate matrices Maximum likelihood methods Phylogenetic trees Protein alignments Large data setsBackground
Amino acid replacement rate matrices represent the estimates of instantaneous substitution rates between amino acids. The rates simultaneously capture aspects of DNAlevel mutation, the genetic code and proteinlevel selection strength, which varies based on similarity in chemical and physical properties. For example, we usually observe a high substitution rate between lysine and arginine (both positively charged and polar) and a low substitution rate between lysine and cysteine (neutral and nonpolar). Ideally, the replacement rate matrix parameters are optimized against the data under study, but in practice information content in typical sequence alignments is insufficient to do so. Instead, a small number of generic matrices are made available to researchers.
Amino acid replacement rate matrices are essential for many protein analyses, including estimating pairwise distances between protein sequences, or reconstructing protein phylogenetic trees using maximum likelihood or Bayesian frameworks [1, 2] and references therein. Amino acid replacement rate matrices can also be converted into score matrices for protein sequence alignment. Roles and applications of amino acid replacement rate matrices were summarized by Thorne [3].
A number of methods have been proposed to estimate the matrices from protein alignments since the time of Dayhoff [4]. These methods belong to either counting or maximum likelihood approaches. The counting methods are fast, but they are limited to only pairwise protein alignments and closely related amino acid sequences [4, 5]. The maximum likelihood methods have been designed to fully utilize the information contained in multiple protein alignments and the corresponding phylogenetic trees which must be estimated from the data [6–8]. This assumes that the trees are correct, which can be ameliorated by a Bayesian analysis over a set of plausible trees but this would increase an already large computational burden.
With the rapid rise in whole genome sequencing it is now increasingly common to have access to both large numbers of taxa and long concatenated sequence alignments. This creates the opportunity to estimate dataspecific amino acid replacement matrices but also requires efficient computational methods because estimating amino acid replacement rate matrices from protein alignments by maximum likelihood methods is a complex and timeconsuming process [7, 9, 10] and references therein.
Recently, a fully automated maximum likelihood estimation procedure was proposed and used to estimate matrices from different data sets [8, 10, 11]. It consists of two main steps: building maximum likelihood phylogenetic trees and estimating parameters based on the information contained in multiple protein alignments and the corresponding phylogenetic trees. Building maximum likelihood trees is itself a difficult problem because the number of possible trees increases exponentially with the number of sequences in the alignment [1, 12]. A number of maximum likelihood tree search heuristics have been developed to reduce the computational burden [13–16]; however, building maximum likelihood trees is still the most time consuming step in the estimation process. For example, in this study it took 98% (319 out of 324 hours) when estimating the amino acid replacement rate matrix from 100 large alignments in the HSSP (homologyderived structures of proteins) database [17].
In this paper, we propose a new maximum likelihood estimation procedure, FastMG, to work with large data sets. The key idea is to split large alignments into multiple nonoverlapping subalignments with fewer sequences (each subalignment contains at most k sequences) in order to substantially reduce the computational burden of building maximum likelihood trees. The matrices are then estimated from the joint maximum likelihood analysis of the smaller subalignments instead of from the large original alignments.
A preliminary study showed that the splitting strategy significantly decreased the running time of the estimation procedure [9]. Here we demonstrate that a naïve random splitting method compromises the quality of the results. In contrast, our “treebased splitting method” selects subtrees that retain enough information to estimate accurate amino acid replacement rates. Experiments with different large data sets showed that the FastMG procedure yields similar quality matrices in much less time than the standard estimation procedure.
Results and discussion
We assessed the performance of the FastMG procedure on three large data sets: HSSP data set [17], Pfam data set [18], and our concatenated protein alignment of mitochondrial proteins from 299 mammalian species with 3062 amino acid sites. The FastMG procedure was examined with the random splitting algorithm, the treebased splitting algorithm, and different k values targeting subalignment sizes of 8, 16, 24, 32, and 64 sequences. All matrices were estimated on a personal computer (Intel 2.66 GHz, 8 GB RAM). The PhyML software version 3.0 [14] was used to build maximum likelihood trees from alignments with options: 4 gamma categories model, no invariant sites, no bootstrap, SPR tree improvement, and JTT model. The XRATE software version 0.2 [19] was used to estimate model parameters using information in the alignments and corresponding phylogenetic trees. Let us denote:

FastMG^{R}: The FastMG estimation procedure with the random splitting algorithm.

FastMG^{T}: The FastMG estimation procedure with the treebased splitting algorithm.

: Replacement rate matrix estimated from data set M using the FastMG^{R} procedure and threshold k (e.g., ${\mathrm{HSSP}}_{8}^{\mathrm{R}}$ is the matrix estimated by the FastMG^{R} procedure with k = 8).${\mathrm{M}}_{k}^{\mathrm{R}}$

: Replacement rate matrix estimated from data set M using the FastMG^{T} procedure and threshold k (e.g., ${\mathrm{HSSP}}_{8}^{\mathrm{T}}$ is the matrix estimated by the FastMG^{T} procedure with k = 8).${\mathrm{M}}_{k}^{\mathrm{T}}$

M^{s}: Replacement rate matrix estimated from data set M using the standard maximum likelihood estimation procedure (e.g., HSSP^{S} is the matrix estimated by the HSSP data set using the standard estimation procedure).
We compared ${\mathrm{M}}_{\mathrm{k}}^{\mathrm{R}},\phantom{\rule{0.5em}{0ex}}{\mathrm{M}}_{\mathrm{k}}^{\mathrm{T}},\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}{\mathrm{M}}^{\mathrm{s}}$ matrices in terms of both quality and running time. To avoid model bias due to overfitting, each data set consisted of two alignment sets: the training alignment set and the testing alignment set. The matrices were estimated from alignments in the training set and subsequently used to build maximum likelihood trees for alignments in the testing set. Likelihood scores for test set alignments were used to compare the quality of different matrices as used in other studies [7, 8, 11]. Moreover, we used the KishinoHasegawa test [20] to assess the statistical significance of the difference between two matrices as used in previous studies [8, 11].
HSSP data set
We selected 400 alignments from the HSSP database to assess the performance of estimation procedures. The 100 alignments with the largest number of sequences were used as the training alignment set for estimating matrices while the other 300 alignments were used as the testing alignment set. The 100 training alignments contained between 140 and 285 sequences (the mean and max pairwise distances between sequences were 0.481 and 1.286 respectively), for a total of 18325 sequences. The 300 testing alignments contained between 10 and 100 sequences (the mean and max pairwise distances between sequences were 0.635 and 1.846, respectively), for a total of 12854 sequences.
The Pearson correlations between the HSSP ^{ S } matrix and other matrices estimated by the FastMG procedure
Matrices  Frequencies  Exchangeability matrix 

${\mathrm{HSSP}}_{8}^{\mathrm{R}}/{\mathrm{HSSP}}_{8}^{\mathrm{T}}$  0.992/0.986  0.989/0.991 
${\mathrm{HSSP}}_{16}^{\mathrm{R}}/{\mathrm{HSSP}}_{16}^{\mathrm{T}}$  0.996/0.993  0.992/0.996 
${\mathrm{HSSP}}_{24}^{\mathrm{R}}/{\mathrm{HSSP}}_{24}^{\mathrm{T}}$  0.998/0.995  0.994/0.997 
${\mathrm{HSSP}}_{32}^{\mathrm{R}}/{\mathrm{HSSP}}_{32}^{\mathrm{T}}$  0.998/0.997  0.995/0.998 
${\mathrm{HSSP}}_{64}^{\mathrm{R}}/{\mathrm{HSSP}}_{64}^{\mathrm{T}}$  0.999/0.999  0.997/0.999 
Pairwise comparisons between HSSP ^{ S } and other matrices estimated from the FastMG procedure
M1  M2  LogLK (M1M2)  M1 > M2 (#TP)  M1 < M2 (#TP)  #M1 > M2 (#TP)  #M1 < M2 (#TP) 

HSSP^{S}  ${\mathrm{HSSP}}_{8}^{\mathrm{R}}$  0.02  201 (130)  99 (65)  66 (27)  12 (4) 
HSSP^{S}  ${\mathrm{HSSP}}_{16}^{\mathrm{R}}$  0.01  208 (126)  92 (64)  69 (26)  10 (4) 
HSSP^{S}  ${\mathrm{HSSP}}_{24}^{\mathrm{R}}$  0.01  203 (114)  97 (72)  72 (26)  9 (5) 
HSSP^{S}  ${\mathrm{HSSP}}_{32}^{\mathrm{R}}$  0.01  206 (119)  94 (59)  69 (24)  11 (4) 
HSSP^{S}  ${\mathrm{HSSP}}_{64}^{\mathrm{R}}$  0.01  200 (101)  100 (63)  73 (16)  11 (4) 
HSSP^{S}  ${\mathrm{HSSP}}_{8}^{\mathrm{T}}$  0.01  191 (124)  109 (75)  43 (19)  33 (15) 
HSSP^{S}  ${\mathrm{HSSP}}_{16}^{\mathrm{T}}$  0.00  152 (95)  148 (81)  26 (5)  34 (9) 
HSSP^{S}  ${\mathrm{HSSP}}_{24}^{\mathrm{T}}$  0.00  142 (88)  158 (89)  19 (4)  36 (11) 
HSSP^{S}  ${\mathrm{HSSP}}_{32}^{\mathrm{T}}$  0.00  132 (78)  168 (87)  18 (3)  32 (6) 
HSSP^{S}  ${\mathrm{HSSP}}_{64}^{\mathrm{T}}$  0.00  131 (72)  169 (80)  15 (4)  40 (4) 
The running time of the standard estimation procedure and the FastMG procedure with different splitting algorithms and k values for the HSSP data set
Matrices  Building trees (hours)  Estimating parameters (hours)  Total time (hours) 

${\mathrm{HSSP}}_{8}^{\mathrm{R}}/{\mathrm{HSSP}}_{8}^{\mathrm{T}}$  10.7/7.4  6.3/3.6  17.0/11.0 
${\mathrm{HSSP}}_{16}^{\mathrm{R}}/{\mathrm{HSSP}}_{16}^{\mathrm{T}}$  22.9/19.0  6.0/3.9  28.9/22.9 
${\mathrm{HSSP}}_{24}^{\mathrm{R}}/{\mathrm{HSSP}}_{24}^{\mathrm{T}}$  32.6/29.9  5.7/3.9  38.3/33.8 
${\mathrm{HSSP}}_{32}^{\mathrm{R}}/{\mathrm{HSSP}}_{32}^{\mathrm{T}}$  42.3/40.0  5.5/3.9  47.8/43.9 
${\mathrm{HSSP}}_{64}^{\mathrm{R}}/{\mathrm{HSSP}}_{64}^{\mathrm{T}}$  73.7/71.7  5.2/4.1  78.9/75.8 
HSSP^{S}  319.5  4.2  323.7 
Pfam data set
We also examined different estimation procedures on alignments from the Pfam database. The training alignment set contained the 100 largest alignments from the Pfam database with in total 7640 sequences and a range of 46 to 202 sequences per alignment (the mean and max pairwise distances between sequences were 1.341 and 62.428, respectively). The testing alignment set consisted of 480 other alignments with in total 5434 sequences and a range of 5 to 41 sequences per alignment (the mean and max pairwise distances between sequences were 1.174 and 63.096, respectively). Note that the Pfam alignments tended to contain fewer sequences than the HSSP alignments.
The Pearson correlations between the Pfam ^{ S } matrix and other matrices estimated from the FastMG procedure
Matrices  Frequencies  Exchangeability matrix 

${\mathrm{Pfam}}_{8}^{\mathrm{R}}/{\mathrm{Pfam}}_{8}^{\mathrm{T}}$  0.994/0.990  0.993/0.993 
${\mathrm{Pfam}}_{16}^{\mathrm{R}}/{\mathrm{Pfam}}_{16}^{\mathrm{T}}$  0.997/0.996  0.995/0.997 
${\mathrm{Pfam}}_{24}^{\mathrm{R}}/{\mathrm{Pfam}}_{24}^{\mathrm{T}}$  0.998/0.999  0.995/0.999 
${\mathrm{Pfam}}_{32}^{\mathrm{R}}/{\mathrm{Pfam}}_{32}^{\mathrm{T}}$  0.999/0.999  0.998/0.999 
${\mathrm{Pfam}}_{64}^{\mathrm{R}}/{\mathrm{Pfam}}_{64}^{\mathrm{T}}$  1.000/1.000  0.999/1.000 
Pairwise comparisons between the Pfam ^{ S } matrix and other matrices estimated by the FastMG procedure
M1  M2  LogLK (M1M2)  M1 > M2 (#TP)  M1 < M2 (#TP)  #M1 > M2 (#TP)  #M1 < M2 (#TP) 

Pfam^{S}  ${\mathrm{Pfam}}_{8}^{\mathrm{R}}$  0.01  299 (67)  181 (41)  119 (8)  55 (8) 
Pfam^{S}  ${\mathrm{Pfam}}_{16}^{\mathrm{R}}$  0.01  294 (54)  186 (35)  132 (6)  55 (3) 
Pfam^{S}  ${\mathrm{Pfam}}_{24}^{\mathrm{R}}$  0.01  309 (57)  171 (38)  142 (10)  51 (3) 
Pfam^{S}  ${\mathrm{Pfam}}_{32}^{\mathrm{R}}$  0.00  275 (40)  205 (35)  116 (6)  65 (2) 
Pfam^{S}  ${\mathrm{Pfam}}_{64}^{\mathrm{R}}$  0.00  279 (38)  201 (28)  117 (4)  64 (3) 
Pfam^{S}  ${\mathrm{Pfam}}_{8}^{\mathrm{T}}$  0.00  218 (54)  262 (64)  51 (3)  80 (8) 
Pfam^{S}  ${\mathrm{Pfam}}_{16}^{\mathrm{T}}$  0.00  190 (33)  290 (51)  41 (2)  104 (6) 
Pfam^{S}  ${\mathrm{Pfam}}_{24}^{\mathrm{T}}$  0.00  212 (33)  268 (39)  50 (2)  82 (1) 
Pfam^{S}  ${\mathrm{Pfam}}_{32}^{\mathrm{T}}$  0.00  233 (36)  247 (36)  58 (1)  72 (0) 
Pfam^{S}  ${\mathrm{Pfam}}_{64}^{\mathrm{T}}$  0.00  166 (21)  314 (31)  27 (0)  91 (2) 
The running time of the standard estimation procedure and the FastMG procedure with different splitting algorithms and k values for the Pfam data set
Matrices  Building trees (hours)  Estimating parameters (hours)  Total time (hours) 

${\mathrm{Pfam}}_{8}^{\mathrm{R}}/{\mathrm{Pfam}}_{8}^{\mathrm{T}}$  2.8/2.4  5.5/3.1  8.3/5.5 
${\mathrm{Pfam}}_{16}^{\mathrm{R}}/{\mathrm{Pfam}}_{16}^{\mathrm{T}}$  6.6/6.2  4.7/3.0  11.3/9.3 
${\mathrm{Pfam}}_{24}^{\mathrm{R}}/{\mathrm{Pfam}}_{24}^{\mathrm{T}}$  9.9/9.4  4.2/2.9  14.1/12.3 
${\mathrm{Pfam}}_{32}^{\mathrm{R}}/{\mathrm{Pfam}}_{32}^{\mathrm{T}}$  12.4/12.6  4.0/2.9  16.4/15.5 
${\mathrm{Pfam}}_{64}^{\mathrm{R}}/{\mathrm{Pfam}}_{64}^{\mathrm{T}}$  20.8/22.4  3.4/2.8  24.2/25.2 
Pfam^{S}  35.8  2.9  38.7 
Mammalian protein alignment
Our mammalian protein alignment (Mam) consists of the concatenated 12 mitochondrial proteins (all except ND6) of 299 mammalian species and 3602 amino acid sites. The mean and max pairwise distances between sequences were 0.256 and 0.441, respectively. Because it is a single alignment, we used 10fold cross validation to examine the performance of different estimation procedures [21]. In particular, the 3602 amino acid sites were randomly distributed across 10 nonoverlapping partitions P_{1}, …, P_{10} each consisting of 360 or 361 sites. The validation was repeated 10 times. Let V_{ t } (t = 1…10) denote the t^{ th } validation (i.e. the part P_{ t } was used as the testing data and the other 9 parts were used as the training data).
The log likelihood per site comparisons between the Mam ^{ S } matrix and matrices estimated by the FastMG procedure
$\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{T}}$  

LogLK per site  0.72/0.04  0.58/0.06  0.49/0.05  0.42/0.05  0.26/0.02 
# Significantly better  10/0  10/0  10/0  10/0  10/0 
# Significantly worse  0/1  0/4  0/4  0/5  0/3 
The log likelihood per site comparisons between the original MtMam matrix and matrices estimated by the FastMG procedure
$\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{T}}$  

LogLK per site  0.37/0.39  0.23/0.40  0.14/0.40  0.07/0.40  0.09/0.37 
# Significantly better  0/0  0/0  0/0  0/0  0/0 
# Significantly worse  0/10  0/10  1/10  4/10  7/10 
The running time (hours) of different estimation procedures
Mam^{S}  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{8}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{16}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{24}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{32}}^{\mathbf{T}}$  $\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{R}}/\mathbf{Ma}{\mathbf{m}}_{\mathbf{64}}^{\mathbf{T}}$  

Avg. time  22.2  0.5/0.4  1.5/0.9  2.2/1.4  2.7/1.9  4.3/3.6 
Speed up  42/61.4  14.8/24.5  10.2/16  8.2/11.6  5.1/6.1 
Trends and special considerations
At the start of our studies we anticipated that alignment splitting would result in only minimal deterioration of matrix quality. Instead we observed, on balance, a small improvement in the performance of FastMG^{T} matrices compared to those obtained without splitting. Although the effect is small it is consistent and suggests that a systematic effect is at play. Although further studies are needed to better understand these effects, it is plausible that the specific elimination of deeper and often less well defined branches contributes to the effect. In addition, treebased splitting results in alignments of more closely related sequences with lower risk of sequence alignment errors.
There were a number of alignments where the matrices inferred from the standard and FastMG^{T} procedures resulted in different tree topologies. This occurred in about 5075% of cases for HSSP alignments and 1025% of cases for the Pfam alignment (Tables 2 and 5, columns 4 and 5). This different rate of occurrences is expected because the HSSP alignments have more sequences and are therefore more prone to topology differences. We considered the possibility that achieving a significantly better likelihood score was due to finding a different tree topology. However, in cases where the likelihood score is significantly better only a minority of alignments has a different tree topology.
Our three test cases all include a considerable amount of sequence divergence and all show that treesplitting is superior to random splitting. A preliminary study on closely related influenza sequences showed that treesplitting is still optimal [9]. However, in extreme cases the number of observed substitutions upon treesplitting may become too small to be informative. In such a case random splitting may be preferred.
Another consideration is that our current FastMG procedure uses neighborjoining to create the tree needed for the treebased splitting algorithm. This scales well for alignments with hundreds or even thousands of sequences, but becomes inefficient for extremely large alignments. The performance of the FastMG procedure for such huge alignments will likely benefit from faster alignment splitting methods [23] and faster tree building methods (e.g. FastTree [24]) and this deserves further study if such cases become more common.
Conclusions
Amino acid replacement matrices are essential for many statistical methods to analyze protein sequences. Maximum likelihood methods typically generate the best replacement matrices because they can fully utilize information in the multiple protein alignments. However, for this application maximum likelihood analysis is complex and computationally expensive. Here we propose a modified maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets. The key component of the modified estimation procedure is the splitting algorithm that divides large alignments into multiple subalignments with fewer sequences that are subsequently used to estimate matrices. The extensive experiments showed that the FastMG^{T} procedure performed well with large data sets and reduces the running time as function of the number of sequences from approximately quadratic to linear, as we expect based on the time complexity of tree inference (see Methods section). FastMG^{T} with k ≥ 16 was about an order of magnitude faster than the standard estimation procedure while it did not reduce the quality of estimated matrices.
Experiments strongly suggest that k = 16 is a good choice for the FastMG^{T} estimation procedure. Even the analysis of the 100 largest alignments of the HSSP database took less than a day with FastMG^{T} and k = 16 on a typical desktop computer. Thus, our method now enables researchers to avoid generic precalculated matrices and instead estimate optimal amino acid replacement matrices for their particular needs from large data sets on their personal computers.
Methods
Model
As usual, the amino acid substitution process is assumed to be independent among amino acid sites. Although the data might not be compositionally homogeneous across the sequences in the alignment (e.g., different amino acid compositions among the clades), we assume that the standard model for the amino acid substitution process over the tree is a Markov process with timehomogeneous, timecontinuous, and timereversible properties [6–8] and references therein. The model is represented by a 20 × 20 instantaneous substitution rate matrix Q = {q_{ xy }} where q_{ xy }(x ≠ y) represents the number of substitutions from amino acid x to amino acid y per time unit. The diagonal elements q_{ xx } are assigned such that the row sums are all zero. Since the model is time reversible, the matrix Q can be decomposed into a symmetric exchangeability rate matrix R = {r_{ xy }} and an amino acid equilibrium frequency vector Π = {π_{ x }} such that q_{ xy } = r_{ xy }π_{ y } and ${q}_{xx}={\displaystyle \sum _{x\ne y}}{q}_{xy}\text{.}$
Model estimation procedure
where L(T_{ i }, ρ_{ i }, Q; D_{ i }) is the likelihood of protein alignment D_{ i } given phylogenetic tree T_{ i }; the rate variation model ρ_{ i }; and substitution model Q[7, 8].
A better model Q can be estimated from alignments of D using an iterative approach as detailed in the 4step standard estimation procedure as follows [8]:
Standard estimation procedure

Step 0: Input a set of multiple alignments D and a starting matrix Q (typically input only exchangeability rate matrix R, the frequency vector Π is estimated from the data D).

Step 1: Build phylogenetic tree T_{ i } and rate across site model ρ_{ i } for each alignment D_{ i } using maximum likelihood tree construction programs such as PhyML [14].

Step 2: Estimate a new exchangeability matrix Q’ using the approach described by Le and Gascuel [8] and the XRate software [19].

Step 3: Compare Q’ and Q, if they are nearly identical, return Q’ as the optimal model. Otherwise, assign Q ← Q’ and goto Step 1.
Previous studies have showed that this estimation procedure usually stops after three iterations. The procedure is called “standard maximum likelihood estimation procedure”.
Building maximum likelihood trees in Step 1 is a computationally expensive problem [12]. Although a number of heuristics have been proposed for searching maximum likelihood trees [13–16], Step 1 is still the bottleneck of the estimation process. It is roughly estimated [25] that the computing time of PhyML (and of other similar maximum likelihood approaches) is in O(n^{2}s), where n is the number of taxa and s the number of sites. Experiments with a number of different data sets have confirmed this approximation. Thus, it is expected that splitting the original alignments into two equallysized subalignments divides the total computing time by a factor two. This property explains why the computing time to build trees displayed in Tables 3, 6 and 9 is approximately linear as a function of k (size of subalignments).
Alignmentsplitting algorithms
Consider a multiple alignment D_{ i } of n sequences (d_{ i }^{1}, …, d_{ i }^{ n }), splitting alignment D_{ i } is a process to divide alignment D_{ i } into nonoverlapping smaller subalignments D_{ i }^{1}, …, D_{ i }^{ m } such that each sequence d_{ i }^{j=1…n} ∊ D_{ i } belongs to exactly one subalignment D_{ i }^{ t }(t = 1…m). For example, alignment D_{ i } of 8 sequences (d_{ i }^{1}, …, d_{ i }^{8}) can be split into three subalignments D_{ i }^{1} = (d_{ i }^{1}, d_{ i }^{2}, d_{ i }^{8}), D_{ i }^{2} = (d_{ i }^{3}, d_{ i }^{4}, d_{ i }^{5}), D_{ i }^{3} = (d_{ i }^{6}, d_{ i }^{7}). Let k be the maximum number of sequences in a subalignment; we seek a method of alignment splitting that allows us to minimize k, to maximize computational efficiency, while retaining adequate amounts of information in subalignments for estimating the amino acid replacement rates.
Random splitting algorithm
The random splitting algorithm is very simple; however, its main drawback is that sequences of the same subalignment might be very distantly related. This could compromise estimation of amino acid replacement rates.
Treebased splitting algorithm
Note that the distance matrix between sequences used in the Neighborjoining algorithm is estimated by the maximum likelihood method using the LG matrix [8]. Technically, we used BIONJ [27] (an improved version of the Neighborjoining algorithm) to split large alignments.
Fast maximum likelihood estimation procedure (FastMG)
The FastMG procedure consists of two phases: first, the large original alignments are split into nonoverlapping subalignments by one of the alignment splitting algorithms; then the matrix is estimated by joint maximum likelihood analysis of the smaller subalignments instead of the large original alignments. The FastMG procedure is described by the following 5steps
Fast maximum likelihood estimation procedure (FastMG)

Step 0: Input a set of multiple alignments D; a starting matrix Q (typically input only exchangeability rate matrix R, the frequency vector Π is estimated from the data D); and a threshold k.

Step 1: For each alignment D_{ i } ∊ D, split D_{ i } into subalignment D_{ i }^{1}….D_{ i }^{ m } using either the random splitting algorithm or the treebased splitting algorithm.

Step 2: Build phylogenetic tree T_{ i }^{ j }and rate across site model ρ_{ i }^{ j } for each subalignment D_{ i }^{ j } using maximum likelihood tree construction programs such as PhyML [14].

Step 3: Estimate a new matrix Q’ from subalignments using the approach described by Le and Gascuel [8] and the XRate software [19].

Step 4: Compare Q’ and Q, if they are nearly identical, return Q’ as the optimal model. Otherwise, assign Q ← Q’ and go to Step 2.
Declarations
Acknowledgements
This work is financially supported by Vietnam National Foundation for Science and Technology (102.012013.04). BH and OG were supported by LABEX NUMEV to work on this study.
Authors’ Affiliations
References
 Felsenstein J: Inferring Phylogenies. 2004, Sunderland, MA, USA: Sinauer AssociatesGoogle Scholar
 Yang Z: Computational Molecular Evolution. 2006, Oxford, UK: Oxford University PressView ArticleGoogle Scholar
 Thorne JL: Models of protein sequence evolution and their applications. Curr Opin Genet Dev. 2000, 10: 602605. 10.1016/S0959437X(00)001428.View ArticlePubMedGoogle Scholar
 Dayhoff M, Schwartz R, Orcutt B: A model of evolutionary change in proteins. Atlas Protein Seq Struct. 1978, 5: 345351.Google Scholar
 Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci CABIOS. 1992, 8: 275282.Google Scholar
 Adachi J, Hasegawa M: Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol. 1996, 42: 459468. 10.1007/BF02498640.View ArticlePubMedGoogle Scholar
 Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximumlikelihood approach. Mol Biol Evol. 2001, 18: 691699. 10.1093/oxfordjournals.molbev.a003851.View ArticlePubMedGoogle Scholar
 Le QS, Gascuel O: An improved general amino acid replacement matrix. Mol Biol Evol. 2008, 25: 13071320. 10.1093/molbev/msn067.View ArticlePubMedGoogle Scholar
 Le DV, Dang CC, Le QS, Le VS: A Fast and Efficient Method for Estimating Amino Acid Substitution Models. Proceedings of The Third International Conference on Knowledge and Systems Engineering. Edited by: Ho TB, McKay RI, Nguyen XH, Bui TD. 2011, New York, NY, USA: IEEE Publishing, 8591.Google Scholar
 Dang CC, Lefort V, Le VS, Le QS, Gascuel O: ReplacementMatrix: a web server for maximumlikelihood estimation of amino acid replacement rate matrices. Bioinformatics. 2011, 27: 27582760. 10.1093/bioinformatics/btr435.View ArticlePubMedGoogle Scholar
 Dang CC, Le QS, Gascuel O, Le VS: FLU, an amino acid substitution model for influenza proteins. BMC Evol Biol. 2010, 10: 99110. 10.1186/147121481099.View ArticlePubMed CentralPubMedGoogle Scholar
 Chor B, Tuller T: Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics. 2005, 21: 97106. 10.1093/bioinformatics/bti1027.View ArticleGoogle Scholar
 Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol. 2003, 52: 696704. 10.1080/10635150390235520.View ArticlePubMedGoogle Scholar
 Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O: New algorithms and methods to estimate maximumlikelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010, 59: 307321. 10.1093/sysbio/syq010.View ArticlePubMedGoogle Scholar
 Le VS, von Haeseler A: IQPNNI: moving fast through tree space and stopping in time. Mol Biol Evol. 2004, 21: 15651571. 10.1093/molbev/msh176.View ArticleGoogle Scholar
 Stamatakis A, Ludwig T, Meier H: RAxMLIII: a fast program for maximum likelihoodbased inference of large phylogenetic trees. Bioinformatics. 2005, 21: 456463. 10.1093/bioinformatics/bti191.View ArticlePubMedGoogle Scholar
 Schneider R, de Daruvar A, Sander C: The HSSP database of protein structuresequence alignments. Nucleic Acids Res. 1997, 25: 226230. 10.1093/nar/25.1.226.View ArticlePubMed CentralPubMedGoogle Scholar
 Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, GriffithsJones S, Howe KL, Marshall M, Sonnhammer ELL: The Pfam protein families database. Nucleic Acids Res. 2002, 30: 276280. 10.1093/nar/30.1.276.View ArticlePubMed CentralPubMedGoogle Scholar
 Klosterman PS, Uzilov AV, Bendaña YR, Bradley RK, Chao S, Kosiol C, Goldman N, Holmes I: XRate: a fast prototyping, training and annotation tool for phylogrammars. BMC Bioinformatics. 2006, 7: 428453. 10.1186/147121057428.View ArticlePubMed CentralPubMedGoogle Scholar
 Kishino H, Hasegawa M: Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J Mol Evol. 1989, 29: 170179. 10.1007/BF02100115.View ArticlePubMedGoogle Scholar
 Kohavi R: A Study of Crossvalidation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th International Joint Conferences on Artificial Intelligence. 1995, Montreal: Morgan Kaufmann Publishers Inc, 11371143.Google Scholar
 Yang Z, Nielsen R, Hasegawa M: Models of amino acid substitution and applications to mitochondrial protein evolution. Mol Biol Evol. 1998, 15: 16001611. 10.1093/oxfordjournals.molbev.a025888.View ArticlePubMedGoogle Scholar
 Blackshields G, Larkin M, Wallace IM, Wilm A, Higgins DG: Fast embedding methods for clustering tens of thousands of sequences. Comput Biol Chem. 2008, 32 (4): 282286. 10.1016/j.compbiolchem.2008.03.005.View ArticlePubMedGoogle Scholar
 Price MN, Dehal PS, Arkin AP: FastTree 2 – approximately maximumlikelihood trees for large alignments. PLoS ONE. 2010, 5: e949010.1371/journal.pone.0009490.View ArticlePubMed CentralPubMedGoogle Scholar
 Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S, Lefort V, Lescot M, Claverie JM, Gascuel O: Phylogeny.fr: robust phylogenetic analysis for the nonspecialist. Nucleic Acids Res. 2008, 36 (suppl 2): W465W469.View ArticlePubMed CentralPubMedGoogle Scholar
 Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4: 406425.PubMedGoogle Scholar
 Gascuel O: BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997, 14: 685695. 10.1093/oxfordjournals.molbev.a025808.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.