Systematic exploration of guidetree topology effects for small protein alignments
 Fabian Sievers^{1}Email author,
 Graham M Hughes^{1} and
 Desmond G Higgins^{1}
https://doi.org/10.1186/1471210515338
© Sievers et al.; licensee BioMed Central Ltd. 2014
Received: 30 May 2014
Accepted: 25 September 2014
Published: 4 October 2014
Abstract
Background
Guidetrees are used as part of an essential heuristic to enable the calculation of multiple sequence alignments. They have been the focus of much method development but there has been little effort at determining systematically, which guidetrees, if any, give the best alignments. Some guidetree construction schemes are based on pairwise distances amongst unaligned sequences. Others try to emulate an underlying evolutionary tree and involve various iteration methods.
Results
We explore all possible guidetrees for a set of protein alignments of up to eight sequences. We find that pairwise distance based default guidetrees sometimes outperform evolutionary guidetrees, as measured by structure derived reference alignments. However, default guidetrees fall way short of the optimum attainable scores. On average chained guidetrees perform better than balanced ones but are not better than default guidetrees for small alignments.
Conclusions
Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to suboptimal guidetrees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guidetrees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guidetrees. The results for randomly chained guidetrees improve with the number of sequences.
Keywords
Multiple sequence alignment Guidetree topology Alignment accuracy BenchmarkingBackground
Multiple Sequence Alignments (MSAs) are an integral part of many bioinformatics analyses. From an evolutionary perspective, MSAs can be considered as attempts at arranging sequences in such a way that homologous residues occupy the same columns. This will necessitate the introduction of gaps into some or all of the sequences. A good alignment is found by searching for one that attains a favourable score by penalising the introduction of gaps and substitution of residues. Given a scoring function it is possible to determine the optimum alignment of two sequences using dynamic programming [1]. However, as time and memory requirements for determining the optimum alignment grow exponentially with the number of sequences, an exact solution is not feasible for more than a few sequences. This is why the ‘progressive alignment’ heuristic was developed [2]. In progressive alignment, initially only pairs of sequences are aligned, producing socalled profiles, which are then in turn aligned in a pairwise fashion with other sequences or ever growing profiles. The order in which sequences and profiles are aligned is determined by a socalled ‘guidetree’ [3]. A shortcoming of progressive alignment is that an arrangement of residues that was determined early in the MSA cannot be changed at a later stage [2]. Consequently the guidetree has an effect on the quality of the final alignment, and there are different strategies of constructing such a guidetree. A common strategy [4–6] is to determine pairwise distances amongst the sequences and to construct from these distances either a UPGMA [7] or neighbour joining [8] tree. In another strategy [9] the guidetree is made to resemble the evolutionary tree that is imputed to have given rise to the sequences. Recently we have demonstrated [10] that randomly labeled chained guidetrees produce good alignments for very large numbers of protein sequences. This approach is similar to early MSA strategies [11, 12] where sequences were simply added to a growing MSA, one by one. It is also similar to how alignments are generated for the Pfam [13] alignment database.
The purpose of the current study is to systematically assess the impact of the guidetree on the quality on alignments of small numbers of protein sequences, where the accuracy can be measured using protein structure derived reference alignments. By confining ourselves to small numbers of sequences, we can systematically generate and test every possible guidetree topology.
Methods
We will construct guidetrees for different protein families with a small number of sequences and measure their respective alignment qualities, using several commonly used MSA programs on benchmark data derived from protein structure based alignments. The quality of the alignments is evaluated in terms of total column (TC) and sumofpairs (SP) score.
Aligners
We consider MSA programs that are in common use, allow input of an external guidetree and are fast. We used the following programs with the respective commandline arguments:
–anysymbol –retree 1 –maxiterate 0 –unweight

MAFFT LINSi v7.029b [5]
–anysymbol –retree 1 –maxiterate 0 –unweight
These aligners are based on related algorithmic approaches. Clustal Omega converts sequences and intermediate profiles into hidden Markov models (HMMs) and aligns these HMMs using HHalign [15]. By default it makes very fast guidetrees using the mBed algorithm [16] and does not use iteration, although both guidetree and alignment can be iterated, on request. MUSCLE uses a standard profiletoprofile alignment method but is very highly optimised and makes extensive use of iteration to gradually improve the alignment and guidetree. MAFFT LINSi uses consistency as introduced in [17] and as such, is only suitable for relatively small numbers of sequences. It also makes use of iteration. MAFFT FFTNSi uses Fast Fourier Transforms for very fast pairwise alignments and is the standard MAFFT program for fast highthroughput alignment of medium to large numbers of sequences. PAGAN uses a phylogenyaware graph alignment algorithm and relies explicitly on having a phylogenetic tree as guidetree. These have to be generated outside of PAGAN.
The main purpose of this study is to correlate alignment scores with particular tree topologies for the basic profileprofile alignment engine. Therefore we disable iteration, as this will modify the alignment order. By iteration we mean a process that attempts to improve the objective score by repeatedly adjusting an initial MSA. This process can entail a modification of the guidetree or a realignment of individual sequences onto a preliminary alignment. Consequently the scores for MAFFT (FFTNSi/LINSi) and MUSCLE in this study are lower than in their respective default modes, where iteration is enabled. We do give these default scores in Additional file 1: Supplement S1.
To score an alignment we use the Total Column (TC) and SumofPairs (SP) scores, as implemented by qscore [6].
Benchmark data
Benchmark data were extracted from the HOMSTRAD data base [18]. We selected single domain protein families with at least 5 sequences and multidomain families with 8 or more sequences. If more than 8 sequences were available for a particular family then we reduced the number down to 8, picking the sequences randomly. If more than 12 sequences were available then we created an extra test family, possibly with resampling. If the number of protein structures was between 5 and 7, then we supplemented the sequences with homologous Pfam sequences [13]. In this case scoring of the alignment can only be done on the embedded HOMSTRAD reference alignment. We assembled 153 protein families, 75 were augmented with additional Pfam sequences and 15 were resamples of larger HOMSTRAD families. Alignment lengths vary between 35 and 936, average sequence lengths vary between 28.5 and 780.8 and average pairwise identities ranged between 14.76% and 77.55%. 15 families were comprised of multiple (up to 3) domains. A summary of reference family statistics can be found in Additional file 1: Supplement S2.
Guidetrees
Distance based default guidetrees
Progressive alignment is a 2stage process. The first stage is the guidetree construction, the second stage comprises of the alignment of individual sequences and successively larger profiles, as specified by the guidetree. In this study we will treat the second stage, that is the profileprofile aligner, as a ‘black box’ and focus on the first stage. In order to construct the guidetree, many multiple sequence aligners construct a matrix of pairwise distances. These distances can be ktuple distances of unaligned sequences or full alignment distances. For small numbers of sequences, N, it is feasible to construct a full N × N distance matrix; if the number of sequences is large (usually N > 10,000), then time and memory may be conserved by calculating distances of the N sequences to only a small number of seeds, n ≪ N[16]. The distance matrix can then be converted into a guidetree, using NeighbourJoining or UPGMA algorithms. The version of PAGAN that we used in this analysis does not construct a default guidetree. As mentioned earlier, we turn off all iterations, which would interfere with our guidetreeselection.
Guidetrees based on estimated phylogeny
We do not know the true phylogeny of the test sets we align but we do have high quality reference alignments. These were used to estimate the phylogeny using a range of methods. The bestfit empirical model of amino acid sequence evolution for each reference alignment was determined using ProtTest 3 [19]. Each model was determined using the Akaike Information Criterion (AIC) [20], corrected Akaike Information Criterion (AICc) [21], Bayesian Information Criterion (BIC) [22] and Decision Theory Criterion (DT) [23]. The most likely tree for each alignment was inferred using the maximum likelihood approach employed by RAxML [24]. In addition to the bestfit model of sequence evolution, the Generalised Time Reversible (GTR) model and GTR model where a fraction of amino acids is considered invariable (‘+I’) were used for each alignment. In all cases, GTR or GTR+I trees produced higher log likelihood scores than the bestfit model predicted using ProtTest 3, and so were considered to be the most likely tree available. For 78 families we had species information for all 8 sequences. Out of these we were able to root 43 trees by hand. Of these, 13 trees were midpoint rooted and another 9 trees had a RobinsonFoulds distance of 2 from the midpoint rooted tree. We also tried all 14 possible rootings for all 153 families and ranked the quality of the alignment, using the midpoint rooting. For Clustal Omega midpoint rooting was the best in 61/153 cases, for MUSCLE 45/153, for default MAFFT 55/153 and for MAFFT LINSi 54/153. We therefore used handrooted trees where they were available and thought it reasonable to use midpoint rooted trees, where no better tree could be obtained. Trees were (midpoint) rooted using PHYLIP’s retree command [25]. The list of estimated phylogenetic trees, henceforth called ML trees, can be found in Additional file 1: Supplement S3, where we also show that 133/153 trees are within 1σ of the imbalance expected under an equal rates Markov model [26].
Systematic guidetree construction
The number of possible guidetrees grows with the number of sequences N. For a rooted tree the number of labeled guidetrees is L_{ N } = (2N3)!! [27]. In the present study we will analyse L_{4} = 15, L_{5} = 105, L_{6} = 945, L_{7} = 10, 395 and in particular L_{8} = 135, 135. No closed formula is known for the number of unlabeled guidetree topologies U_{ N }, but in this study we use U_{4} = 2, U_{5} = 3, U_{6} = 6, U_{7} = 11 and U_{8} = 23 [28]. In general, there are N! ways to distribute N sequence labels amongst the leaves of a guidetree; however, sequence alignment should be a symmetric process, so that every degree of symmetry decreases the number of topologically distinct labeled guidetrees by a factor of 2. For example, a perfectly balanced tree with 4 sequences ((1,2),(3,4)), has three degrees of symmetry, that is (1,2) ↔ (2,1), (3,4) ↔ (4,3) and ((1,2),(3,4)) ↔ ((3,4),(1,2)), so that there are B_{4} = 4!/2^{3} = 3 distinct balanced trees with 4 leaves. These are ((1,2),(3,4)), ((1,3),(2,4)) and ((1,4),(2,3)). A perfectly chained tree of 4 sequences (((1,2),3),4) has only one degree of symmetry, that is, (1,2) ↔ (2,1), so that there are C_{4} = 4!/2^{1} = 12 distinct chained trees of 4 sequences, given in Additional file 1: Supplement S4. There are no other topologically distinct unlabeled trees for 4 sequences. This example is consistent, as B_{4} + C_{4} = 3 + 12 = L_{4}, which is the expected number of all labeled trees with 4 leaves.
We are particularly interested in the case of 8 sequences, as these trees can be perfectly balanced. This means that at every internal node there is an equal number of sequences subtended by both branches. Only trees with N leaves, where N is a power of 2, can be perfectly balanced. The next such tree has 16 sequences. For 16 sequences there are U_{16} = 10, 905 unlabeled trees and L_{16} ≈ 6.2 × 10^{15} labeled trees. A complete exploration for 16 sequences is outside the scope of this study. However, in Additional file 1: Supplement S5 we present results for 101 topologically distinct trees of 16 sequences, each labeled in 10,000 different ways. Perfectly imbalanced trees are trees where at every internal node (at least) one of the two branches subtends exactly one sequence. These trees are sometimes referred to as pectinate (comblike) or linear; we call them chained. For N > 4 sequences there are, aside from perfectly balanced and perfectly chained trees, trees of an intermediate degree of balance. Several measures to quantify this degree are in use, for example, Sackin’s index [29], the index described by Colless [30], the inversemaximum index, as described by Sokal [31] and Shannon entropy. However, apart from the perfectly balanced and chained trees, none of these indices give exactly the same ranking of trees (for 8 or more sequences), so that the ordering of trees according to their degree of balance is somewhat fuzzy and depends on the specific aspect of the property measured by the respective index. In Additional file 1: Supplement S6 we show all unlabeled guidetrees for 4, 5, 6, 7 and 8 sequences and quote their respective measures of im/balance.
Different clustering schemes
Apart from the aligners’ default and the ML trees we tried various other clustering schemes, as outlined in [32]. We consider Single Linkage (SL), merging clusters for which the minimum distance between their elements is the least one; Complete Linkage (CL), merging clusters for which the maximum distance between their elements is the least one; Mean Linkage (MeanL), merging clusters for which the Euclidean distance between their centroids or means is the least one; Ward’s Criterion, merging clusters for which the increase in variance for the resulting group is the least one. In addition, we considered UPGMA and Neighbour Joining trees, as produced by Clustal W2 [33].
Populating chained guidetrees
It has been shown [10] that randomly populated chained guidetrees on average produce good alignments. However, any particular randomly populated chained guidetree might in fact produce a bad alignment. One would like to select an ordering with the best possible outcome. In order to determine such an ordering scheme we will arrange the sequences according to their length, hydrophobic moment (HM), isoelectric point (IP) and sequence similarity. For the HM and IP we consider absolute values and values normalised by the sequence length. HM and IP are calculated according to [34]. For all criteria we sort in ascending and descending order. Sequences cannot always be uniquely sorted according to just one sort key, there may be ties. We also explore secondary sort keys.
Benchmarking
For each protein family, we allowed each aligner to construct a default guidetree. Using this default guidetree the aligners construct an initial alignment, without iteration. We call this alignment the defaulttree alignment as it uses the default guidetree, despite not using default commandline flags. The version of PAGAN that we used does not construct a default guidetree. In a next step we use the ML tree as the guidetree, again without iteration. We then used all possible guidetrees to align the sequences. The alignments were scored using qscore. We collected the TC score, which is defined as the fraction of correctly aligned core columns of residues of all core columns in the alignment. Core columns were determined where the structural superposition for every reference sequence agreed. We accepted helix, sheet and coil states, since the Euclidean distance between each pair of alpha carbons within the column was within a threshold of 0.3 nm [35]. We rejected the JOY criterion [18], where only 70% of sequences have to agree, as this produced too many columns, which by visual inspection could not be deemed reliably aligned. This reduces the ranges of lengths of the alignments from [35:936] to [6:526], the largest percent reduction was down to 12.2% (74 →9), while the largest retention was 88.6% (323 →286). The behaviour of four example families can be found in Additional file 1: Supplement S7. The tree/s yielding the highest TC score is/are then easily identified by sorting the 135,135 TC scores for the different trees. We count how many trees produce the same, highest TC score.
Results
Estimated phylogenetic trees
Best possible trees
In the bottom righthand panel of Figure 3 we show the proportion of times the ML guidetree has a certain distance from the guidetree giving the best possible TC score, as measured by the RobinsonFoulds (RF) metric for rooted trees [36]. Trees that are isomorphic and label preserving have an RF distance of 0, the maximum RF distance for two rooted trees with 8 sequences is 12. If for any family more than one tree produced the same highest score we registered the tree with the lowest RF distance. The bottom righthand panel of Figure 3 shows that in less than 12% (5.2%–11.8%) of the families the best possible tree is isomorphic and label preserving (RF = 0) wrt the ML tree. On the other hand for roughly a sixth (14.4%–19.6%) of families the ML tree is as far away from the best tree as possible (RF = 12) under the RF metric. The average RF distances for the different aligners are 7.49 (Clustal Omega), 7.01 (LINSi), 7.02 (MAFFT FFTNSi), 7.56 (MUSCLE), 8.14 (PAGAN) – all more than half the maximum distance.
Different clustering schemes
TC scores for different aligners and clustering schemes
ClustalOmega  MUSCLE  MAFFT  LINSi  PAGAN  

Default  0.757  0.722  0.718  0.754  NA 
RAxML  0.743  0.731  0.718  0.740  0.532 
UPGMA  0.753  0.738  0.734  0.749  0.535 
NJ  0.744  0.725  0.711  0.735  0.527 
Single  0.759  0.744  0.736  0.754  0.535 
Complete  0.742  0.732  0.721  0.735  0.526 
Mean  0.750  0.737  0.730  0.742  0.541 
Ward  0.711  0.700  0.686  0.709  0.498 
Populating chained trees
TC scores for chained guidetrees populated according to certain criteria
Aligner:  Clustal Omega  MUSCLE  MAFFT  LINSi  PAGAN 

len/a  0.731  0.711  0.697  0.723  0.453 
len/d  0.700  0.693  0.671  0.684  0.447 
HM/a  0.705  0.704  0.680  0.696  0.451 
HM/d  0.717  0.697  0.693  0.710  0.454 
HML/a  0.709  0.718  0.695  0.706  0.458 
HML/d  0.706  0.692  0.678  0.695  0.453 
IP/a  0.731  0.701  0.692  0.717  0.466 
IP/d  0.698  0.699  0.671  0.693  0.459 
IPL/a  0.717  0.701  0.691  0.707  0.449 
IPL/d  0.705  0.704  0.683  0.705  0.454 
hi/a  0.685  0.685  0.649  0.674  0.423 
hi/d  0.745  0.731  0.724  0.742  0.483 
lo/a  0.654  0.641  0.598  0.614  0.397 
lo/d  0.730  0.722  0.717  0.729  0.461 
def  0.757  0.722  0.718  0.754  NA 
RAxML  0.743  0.731  0.718  0.740  0.532 
Tree branch lengths
Optimum guidetree discrimination
Average TC score for different topologies
We first observe a distinct increase in the TC scores with increasing imbalance. This means that a randomly labeled chained tree is on average better than a randomly labeled balanced tree. This is true for all aligners considered in this study. On the other hand, it is not true to say that all balanced trees produce bad alignments, as the top whisker of the balanced box well overlaps with the top whisker of the chained box. Secondly, we observe that the default score is always above the median score for any of the guidetrees. This means that the default guidetree for 8 sequences on average is better than a randomly populated guidetree, whether it be chained or balanced. This is particularly true for LINSi and Clustal Omega, however for MAFFT FFTNSi and especially MUSCLE the default guidetree is on average only marginally better than a randomly labeled chained guidetree. This is also true for 16 or fewer sequences, as shown in Additional file 1: Supplement S9. In Additional file 1: Supplement S10 we show how often certain tree topologies produced the best and the worst results.
Discussion
We found that for Clustal Omega and MAFFT LINSi TC scores for the default tree and the ML tree are tightly correlated. This may be in part due to the underlying profileprofile alignment strategy. Clustal Omega uses HMMs and LINSi uses consistency; both appear to confer a certain degree of ‘robustness’ wrt the choice of guidetree. For MAFFT FFTNSi and particularly for MUSCLE we found that phylogeny based guidetrees produce a small improvement over default trees for difficult alignments and a deterioration for easy alignments. Here the underlying alignment engine is more susceptible to a suboptimal guidetree, and the quality of the alignment depends more on the choice of a good guidetree. In their respective default modes MUSCLE and MAFFT FFTNSi compensate for this by iteration. On average we found that ML guidetrees are not better than default distance based guidetrees when performing a progressive alignment. This has long been suspected [37]. The argument there is that sequences with the highest identity can be aligned most accurately. However, if phylogenetic rates vary considerably among lineages, then the evolutionary neighbour may not be the nearest neighbour wrt identity. We see evidence for this conjecture by comparing TC scores for both strategies as well as analysing the RobinsonFoulds distances.
While the differences in TC scores are small between ML and default guidetrees, there is a vast potential when compared with results for the best possible trees. It would be worthwhile to try to devise better guidetree construction schemes, especially since contributions from the guidetree to the alignment accuracy appear to decouple from contributions from the profileprofile alignment stage while the overall accuracy is bound to decrease for larger numbers of sequences [38].
A structure based evaluation is only one possible angle on benchmarking as it does not primarily test gap placement due to insertion/deletion events [39]. We could confirm that PAGAN is by far the most phylogeny aware aligner amongst the ones considered in this study, despite being evaluated on a nonphylogeny based benchmark strategy. The other aligners displayed a similar degree of awareness in discriminating between good and bad guidetrees (evaluated on a protein structure based reference alignment), with MUSCLE being slightly more sensitive than the other three.
When grouping alignment scores according to guidetree topology we found that chained guidetrees, on average, produce better results than balanced ones. This seems to run counter the established wisdom of trying to balance guidetrees but can be understood when realising that chained trees have fewer sequence pairs that cross the root and the mean pairwise distance therefore being less than for a balanced tree [40]. For the small numbers of sequences we analysed, we could not confirm that a randomly labeled chained guidetree is better than the default guidetree. However, as the number of sequences is increased from 4 to 8 and then to 16 this difference appears to decrease, and we suspect that beyond a certain number of sequences, randomly labeled chained guidetrees will be better than distance based default guidetrees, see Additional file 1: Supplement S12. This is consistent with findings in [10] who observed that for the small numbers of sequences in BAliBASE 3.0 [41] randomly labeled chained trees were sometimes as good as default trees, while for more than 1,000 sequences randomly labeled chained trees were clearly better. This suggests that the greatest (and easiest) improvements of guidetree construction may come from finding an optimum nonrandom labeling strategy for chained trees.
Conclusions
Alignment methods that use Consistency or hidden Markov models to make alignments are less susceptible to suboptimal guidetrees than simpler methods, that basically use conventional sequence alignment between profiles. The latter appear to be affected positively by evolutionary based guidetrees for difficult alignments and negatively for easy alignments. One phylogeny aware alignment program can strongly discriminate between good and bad guidetrees. The results for randomly chained guidetrees improve with the number of sequences.
Availability of supporting data
Benchmark sequences, tree topologies, utility programs and driver scripts are available as http://www.bioinf.ucd.ie/download/BMC2014treeExploration.tar.gz.
Declarations
Acknowledgements
Funding was provided by Science Foundation Ireland to DGH through PI grant 11/PI/1034.
Authors’ Affiliations
References
 Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443453. 10.1016/00222836(70)900574. doi:10.1016/00222836(70)900574View ArticlePubMedGoogle Scholar
 Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol. 1987, 25 (4): 351360. 10.1007/BF02603120.View ArticlePubMedGoogle Scholar
 Higgins DG, Bleasby AJ, Fuchs R: CLUSTAL V: improved software for multiple sequence alignment. Comput Appl Biosci. 1992, 8 (2): 189191. doi:10.1093/bioinformatics/8.2.189PubMedGoogle Scholar
 Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG: Fast, scalable generation of highquality protein multiple sequence alignments using clustal omega. Mol Syst Biol. 2011, 7 (539): doi:10.1038/msb.2011.75Google Scholar
 Katoh K, Misawa K, Kuma K, Miyata T: Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic Acids Res. 2002, 30: 30593066. 10.1093/nar/gkf436.View ArticlePubMed CentralPubMedGoogle Scholar
 Edgar RC: Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 19 (32(5): 17921797.View ArticleGoogle Scholar
 Sneath PHA, Sokal RR: Numerical Taxonomy. The Principles and Practice of Numerical Classification. 1973, San Francisco: FreemanGoogle Scholar
 Saitou N, Nei M: The neighborjoining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406425.PubMedGoogle Scholar
 Liu K, Raghavan S, Nelesen S, Linder CR, Warnow T: Rapid and accurate largescale coestimation of sequence alignments and phylogenetic trees. Science. 2009, 324 (5934): 15611564. 10.1126/science.1171243.View ArticlePubMedGoogle Scholar
 Boyce K, Sievers F, Higgins DG: Simple chained guide trees give high quality protein multiple sequence alignments. PNAS. 2014, 111 (29): 10556105561. 10.1073/pnas.1405628111.View ArticlePubMed CentralPubMedGoogle Scholar
 Barton GJ, Sternberg MJE: A strategy for the rapid multiple alignment of protein sequences: confidence levels from tertiary structure comparisons. J Mol Biol. 1987, 198 (2): 327337. 10.1016/00222836(87)903160.View ArticlePubMedGoogle Scholar
 Taylor WR: A flexible method to align large numbers of biological sequences. J Mol Evol. 1988, 198 (2): 161169.View ArticleGoogle Scholar
 Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The pfam protein families database. Nucleic Acids Res. 2012, 40: 290301. 10.1093/nar/gkr717.View ArticleGoogle Scholar
 Löytynoja A, Vilella AJ, Goldman N: Accurate extension of multiple sequence alignments using a phylogenyaware graph algorithm. Bioinformatics. 2012, 28 (13): 16841691. 10.1093/bioinformatics/bts198.View ArticlePubMed CentralPubMedGoogle Scholar
 Söding J: Protein homology detection by hmmhmm comparison. Bioinformatics. 2004, 21 (7): 951960. doi:10.1093/bioinformatics/bti125View ArticlePubMedGoogle Scholar
 Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG: Research sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithm Mol Biol. 2010, 5: 2110.1186/17487188521. doi:10.1186/17487188521View ArticleGoogle Scholar
 Notredame C, Higgins DG, Heringa J: Tcoffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000, 302 (1): 205217. 10.1006/jmbi.2000.4042. doi:10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
 Mizuguchi K, Deane CM, Blundell TL, Overington JP: Homstrad: a database of protein structure alignments for homologous families. Protein Sci. 1998, 7: 24692471. 10.1002/pro.5560071126.View ArticlePubMed CentralPubMedGoogle Scholar
 Darriba D, Taboada GL, Doallo R, Posada D: Prottest 3: fast selection of bestfit models of protein evolution. Bioinformatics. 2011, 27: 11641165. 10.1093/bioinformatics/btr088.View ArticlePubMedGoogle Scholar
 Akaike H: Information theory and an extension of the maximum likelihood principle. Proceedings of the 2nd International Symposium on Information Theory. 1973, Budapest: Akademia Kiado, 267281.Google Scholar
 Sugiura N: Further analysis of the data by akaike’s information criterion and the finite correction. Comm Stat ATheory Meth. 1978, 7: 1326. 10.1080/03610927808827599.View ArticleGoogle Scholar
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6: 461464. 10.1214/aos/1176344136.View ArticleGoogle Scholar
 Minin V, Abdo Z, Joyce P, Sullivan J: Performancebased selection of likelihood models for phylogeny estimation. Syst Biol. 2003, 52: 674683. 10.1080/10635150390235494.View ArticlePubMedGoogle Scholar
 Stamatakis A: Raxmlvihpc: maximum likelihoodbased phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006, 22: 26882690. 10.1093/bioinformatics/btl446.View ArticlePubMedGoogle Scholar
 Felsenstein J: Phylip  phylogeny inference package (version 3.2). Cladistics. 1989, 5: 164166.Google Scholar
 Rogers JS: Central moments and probability distribution of colless’s coefficient of tree imbalance. Evolution. 1994, 48 (6): 20262036. 10.2307/2410524.View ArticleGoogle Scholar
 OEIS: Double factorial of odd numbers. [http://www.oeis.org/A001147],
 OEIS: WedderburnEtherington numbers (binary rooted trees). [http://www.oeis.org/A001190],
 Sackin MJ: ‘good’ and ‘bad’ phenograms. Syst Zool. 1972, 21: 225226. 10.2307/2412292.View ArticleGoogle Scholar
 Colless DH: Phylogenetics: the theory and practice of phylogenetic systematics. Syst Zool. 1982, 31: 156169. 10.2307/2413034.View ArticleGoogle Scholar
 Shao KT, Sokal RR: Tree balance. Syst Zool. 1990, 39 (3): 266276. 10.2307/2992186.View ArticleGoogle Scholar
 Pavlopoulos GA, Soldatos TG, BarbosaSilva A, Schneider R: A reference guide for tree analysis and visualization. BioData Min. 2010, 3 (1): doi:10.1186/1756038131Google Scholar
 Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal w and clustal x version 2.0. Bioinformatics. 2007, 23 (21): 29472948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
 Biro JC: Amino acid size, charge, hydropathy indices and matrices for protein structure analysis. Theor Biol Med Model. 2006, 3 (15): doi:10.1186/17424682315Google Scholar
 Blackshields G, Wallace IM, Larkin M, Higgins DG: Analysis and comparison of benchmarks for multiple sequence alignment. In Silico Biol. 2006, 6 (0030):Google Scholar
 Robinson DR, Foulds LR: Comparison of phylogenetic trees. Math Biosci. 1981, 53: 131147. 10.1016/00255564(81)900432.View ArticleGoogle Scholar
 Edgar RC: Phylogenetic trees are not good guide trees!. [http://www.drive5.com/muscle/manual/guidevsphylo.html],
 Sievers F, Dineen D, Wilm A, Higgins DG: Making automated multiple alignments of very large numbers of protein sequences. Bioinformatics. 2013, 29 (8): 989995. 10.1093/bioinformatics/btt093. doi:10.1093/bioinformatics/btt093View ArticlePubMedGoogle Scholar
 Löytynoja A, Goldman N: An algorithm for progressive multiple alignment of sequences with insertions. PNAS. 2005, 102: 1055710562. 10.1073/pnas.0409137102.View ArticlePubMed CentralPubMedGoogle Scholar
 Ogden TH, Rosenberg MS: Multiple sequence alignment accuracy and phylogenetic inference. Syst Biol. 2006, 55 (2): 314328. 10.1080/10635150500541730. doi:10.1080/10635150500541730View ArticlePubMedGoogle Scholar
 Thompson JD, Koehl P, Ripp R, Poch O: Balibase 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins. 2005, 61 (1): 127136. 10.1002/prot.20527. doi:10.1002/prot.20527View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.