MaxAlign: maximizing usable data in an alignment
© Gouveia-Oliveira et al. 2007
Received: 30 October 2006
Accepted: 28 August 2007
Published: 28 August 2007
Skip to main content
© Gouveia-Oliveira et al. 2007
Received: 30 October 2006
Accepted: 28 August 2007
Published: 28 August 2007
The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for instance, gapped columns are often discarded entirely from the alignment.
MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns – the alignment area – by selecting the optimal subset of sequences to exclude from the alignment.
MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAlign's performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies.
We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of other sequences.
The MaxAlign web-server is freely available online at http://www.cbs.dtu.dk/services/MaxAlign where supplementary information can also be found. The program is also freely available as a Perl stand-alone package.
A multiple alignment of nucleotide or protein sequences often forms the basis for phylogenetic analysis. In a perfect alignment, gaps correspond to insertion or deletion events, and as such should contain phylogenetic information on a par with substitutions. While some work has been done to make use of this type of data [1–3] there are still many unsolved issues. Additionally, gaps can also stem from misalignment, as well as from sequencing or data-management problems, in which case they obviously provide no useful information. Consequently, several bioinformatical and phylogenetic analyses are often based on alignments where gapped columns (i.e., columns containing at least one gap) have been discarded. For instance, removal of gapped columns is an option in the frequently used programs Paup , Paml  and Crann . However, as the number of sequences in an alignment grows, the probability of having a gap in any given site also grows, and with it the risk of removing that site from the analysis. An alternative approach, that is sometimes used when applying maximum likelihood and other model-based methods, is to treat the gaps as unknown nucleotides (or amino acids) and sum over all the possible combinations, but this is not consensual and can become prohibitively costly for larger data sets. For some bioinformatical analyses, moreover, this alternative is not possible.
One way around this problem is to remove particularly gap-rich sequences, thereby ending up with a dataset containing more ungapped columns. This solution is of course not meaningful if the main goal of the analysis is to infer the topology of the phylogenetic tree connecting all the included taxa and one has a sufficiently long sequence. However, there are many other scenarios where the approach can be useful. For instance, it is often the case in molecular evolutionary analysis today that the focus is not on the phylogeny but on the analysis of the sequences themselves, and on properties of each position, such as the rate of evolution or the action of natural selection. In such cases keeping the sites in the analysis becomes important. The use of an automated, rapid alignment clean-up method is also clearly relevant in the case of large-scale or batch-type analyses, where phylogenies are produced from many potentially large data sets, or in the case of bioinformatical analyses not tolerant to the presence of gaps.
The goal of this tool is to maximize the alignment area, defined as the number of characters that are present in gap free columns. Alignment area is thus equal to the number of sequences included in the alignment times the number of columns that have no gaps. This maximization of the alignment area is done by selectively removing sequences from the alignment. Finding the right sequences to remove, however, is not a straightforward problem (see Methods for details on the MaxAlign algorithm).
To scrutinize the performance of MaxAlign, we have analyzed very large sets of protein alignments from two different sources: (1) alignments extracted from the Pfam database, and (2) synthetic alignments created by simulating evolution on a set of three different trees using the program simprot . The Pfam data were chosen to be representative of typical, fairly diverged biological alignments, and consisted of all the 5242 alignments from Pfam-A, release 21.0, that had between 30 and 500 sequences. The simulated data were constructed so it resembled the Pfam-data in terms of sequence divergence, and in the number and length of gaps (see below). Specifically, we analyzed the following aspects of MaxAlign behaviour: (1) performance of the heuristic version of the MaxAlign algorithm (optimality and computation time), (2) Improvement of alignment area by MaxAlign, (3) Ability of MaxAlign to remove "contaminating" (non-homologous) sequences from an alignment, (4) Impact of MaxAlign on phylogenetic accuracy (meaning how closely the reconstructed tree resembles the true tree), and (5) Impact of MaxAlign on the computation time of subsequent maximum likelihood phylogenetic analysis.
MaxAlign can use one of two different algorithms: a heuristic algorithm (which is faster, but which is not guaranteed to find the best solution), and an optimal algorithm (which is an adaptation of branch-and-bound; see Methods). We tested both algorithms on the full Pfam dataset. The average runtime for the heuristic was 0.6 seconds per alignment (user time) demonstrating that use of MaxAlign is not a bottleneck for data analysis. Among the 5242 alignments, the heuristic algorithm found the optimal solution in 78% of the alignments. In only 4% of the alignments, the heuristic algorithm did not find the optimal solution. The remaining alignments either could not be improved from the start (4%), or the branch and bound algorithm could not reach a solution in 2 hours (15% of the alignments). Thus, the heuristic algorithm found the best solution in 95% of the cases in which a solution was found. Moreover, analysis of the solutions in the few cases where the heuristic was not optimal showed that the alignment area of the heuristic solution was, on average, 99.0% of the optimal solution (median: 99.6%). On the whole, we therefore recommend using the heuristic algorithm of MaxAlign.
Changes in alignment dimensions caused by MaxAlign
Number of gap-free columns
Number of sequences
It is not rare that data sets are contaminated with non-homologous sequences. These will typically result in the generation of many gaps when the sequences are aligned. To investigate how well MaxAlign performs in removal of such contaminating sequences we therefore constructed a set of noisy datasets: for each of the 5242 original Pfam alignments we added random, non-homologous sequences (taken from other Pfam data sets) such that the final content of noise was approximately 10%. These datasets were then re-aligned using the program mafft with default parameters. We then ran MaxAlign (heuristic algorithm) on the resulting alignments and analyzed the performance in terms of removal of contaminating sequences. We found that, taken over all 5242 alignments, MaxAlign performed very well, removing on average 87% of the non-homologous sequences per alignment (median removal success: 96%). In 46% of all cases, MaxAlign was able to remove all non-homologous sequences from the alignment.
Effects of MaxAlign and removal of gapped columns on phylogenetic accuracy
Removal of gaps
Set of taxa used for comparison
When measuring phylogenetic accuracy on the subset of taxa that remain after MaxAlign processing, it can be seen that the highest phylogenetic accuracy was achieved by using all the sequences available without discarding gapped columns (figure 1, all rows but 2nd and last). However, it can also be seen that applying MaxAlign decreases the accuracy only very slightly (figure 1, compare "Original" and "MaxAlign"). Moreover, if one decides to discard columns with gaps, then using MaxAlign is clearly the best option (figure 1, compare "Removed Gaps" with "MaxAlign Removed Gaps").
If the accuracy of trees (based on non-MaxAligned data) is measured on the full set of taxa, then a different trend is apparent: especially in trees 1 and 2, it can be seen that processing the alignment with MaxAlign increases the accuracy of the phylogeny (figure 1, accuracy on full set of taxa is shown in bottom two rows; compare "Original Whole Tree" to "MaxAlign"). The reason for this phenomenon is that MaxAlign predominantly removes sequences with many gaps. These will necessarily also be the ones having the fewest amino acids, and therefore the ones associated with the highest phylogenetic uncertainty. It is important to note that in phylogenetic analyses that measure support for each branch, such as Bayesian analysis or Bootstrap, the position of these taxa would show up with low support values. It should also be noted that the presence of gaps in some sequences does not seem to disturb phylogenetic inference on the remaining sequences (figure 1, compare "Original" to "Original Whole Tree"). It should be noted that we have not investigated the additional impact of alignment error on this, since we directly use the simulated (and therefore perfect) alignments themselves.
Runtimes of the phylogenetic analysis
Removal of gaps
It can be seen that either using MaxAlign or removing columns with gaps nearly halves the computer time needed to find the phylogenetic tree. Removing columns with gaps does so by preventing the summation over unknown characters as well as reducing the number of sites to be included in the analysis. MaxAlign diminishes the number of summations required and also reduces the size of the dataset, by including fewer sequences.
The computer time required for these phylogenetic analyses was particularly short, as we used a very fast program (phyml), and the alignments contained few sequences. In situations using more time-consuming analyses and software, as well as larger alignments, these relative differences in computing time will have an increased impact.
In the present work, we have shown how MaxAlign is helpful in "cleaning up" big alignments, and demonstrated how it may be used in connection with phylogenetic analysis. In the analysis presented above, we found that the use of MaxAlign more than halved the running time of the analysis even when gapped columns were not excluded. When gapped columns were discarded from the alignment, the output was of much higher quality if MaxAlign was first used to pre-process the data, as the number of columns included in the analysis increased immensely. We have also shown that in the conditions tested, the accuracy of the phylogenetic estimate of the tree topology increases if one includes the gapped columns. The sequences removed by MaxAlign were predominantly those associated with the highest degree of uncertainty regarding their placement in the final trees.
The use of MaxAlign prior to bioinformatical and phylogenetic analyses optimizes the number of nucleotides or amino acids present in ungapped alignment columns (the "alignment area"), thereby increasing the amount of useful data and/or the speed of the analysis. Moreover, MaxAlign is also very helpful in detecting misaligned or defective sequences without requiring manual inspection. Given its very short running times and ease of use, it can without difficulty be used as a step in both automated and manual phylogenetic and bioinformatical analyses.
MaxAlign has several additional features. For example, it is possible to select a subset of sequences that should be kept in the final alignment, even if their removal would yield an improvement of the alignment area. This is done by simply adding a plus sign ("+") in front of the sequence name in the fasta file. In this way, the user can preserve key sequences in his/her analysis. Also, users interested in codon-based analyses can run MaxAlign on protein sequences and obtain their output alignments both in protein and nucleotide format (as long as they also provide the corresponding nucleotide sequences).
The server is freely accessible at  where a stand-alone Perl version can also be downloaded free of charge [see Additional file 1]. The input to MaxAlign is an alignment in fasta format (two in the case of codon-based analyses). The main output is an alignment (in fasta format) consisting of the set of sequences that maximizes the alignment area. The program also lists the sequences that were included and excluded from the original alignment, and furthermore provides a report on the resulting improvement.
To test the relevance of MaxAlign in real biological problems, the Pfam-A Release 21.0 of full alignments  was used. All the 5242 alignments comprising between 30 to 500 sequences were submitted to MaxAlign. The mean number of sequences in this set of alignments was 140 (median:90). The alignment area and number of ungapped columns was recorded, before and after MaxAlign. In this analysis, as in all of the following unless otherwise stated, the heuristic algorithm was used.
To test the ability of MaxAlign in removing non-homologous sequences from an alignment, we added noise sequences to each of the 5242 above-mentioned Pfam alignments in a ratio of 1:10. Thus, the alignments ended up containing between 33 to 550 sequences. These alignments were submitted to MaxAlign and the fraction of noise sequences removed was recorded. For each alignment, non-homologous sequences were chosen randomly from the rest of Pfam.
Data on the simulated datasets
Average sequence identity
Original number of sequences
Average number of sequences after MaxAlign
Average number of indels per sequence
Average length of indels
Each simulated alignment gave rise to 4 derived datasets: the alignment after processing with MaxAlign, but keeping the gapped columns, the alignment after processing with MaxAlign and removal of gapped columns, the original alignment with the gapped columns removed and the original alignment without any processing.
These datasets were then used as the basis for reconstructing the tree topology. This was done by using phyml  with the WAG substitution model and having the evolutionary rate among sites modelled by a gamma distribution discretized into 4 categories, with the shape parameter being estimated from the data. The tree topologies found were then compared among the different datasets, using Normalized Symmetric Tree Similarity (see below). As MaxAlign removes sequences from the alignment, the resulting trees have a smaller set of leaves. We therefore compared the correct tree against the four tree topologies resulting from the phylogenetic analyses using only the subset of taxa that remained after MaxAlign processing. For the two trees based on data that had not been processed by MaxAlign, we furthermore compared to the true tree using the full set of taxa. Statistical significance of differences between similarities was assessed using a double-sided t-test at 5%.
All Pfam alignments having between 30–100 sequences (a total of 2416) were included in this analysis, which consisted of measuring the average duration of phylogenetic analyses, in a way very similar to the one described above, apart from using real alignments instead of simulated alignments (same dataset generation, same phylogenetic analysis). We did not analyze computation time on the original datasets (those not processed by MaxAlign) with removal of gapped columns, as the majority of these alignments had zero columns available for the analysis.
To compare the performance of the two algorithms available in MaxAlign, we applied them to the full set of 5242 Pfam alignments described above and evaluated the solutions in terms of time spent and changes in alignment area.
The degree ofdissimilarity between two phylogenetic trees can be quantified using a number of different measures of tree distance. The most widely used measures are probably the symmetric distance, the quartets distance , and the branch score distance . The two former measures focus on tree topology, while the latter also uses branch length information. Here, we propose a simple measure of (topological) tree similarity that is based on the symmetric distance of Robinson and Foulds, and that can be used to compare tree similarities measured on different sets of taxa.
Any branch in a phylogenetic tree can be thought of as defining a bipartition where the leaves are divided into those present on one side of the branch and those present on the other side of the branch. The symmetric distance of Robinson and Foulds is simply the number of bipartitions (branches) in tree 1 that are not present in tree 2 plus the number of bipartitions in tree 2 that are not present in tree 1. This is an intuitively understandable measure of treedissimilarity that is fairly rapid to compute.
We now propose a normalized version of this distance measure: simply divide the symmetric distance between two trees by the total number of bipartitions present in the two trees. (The total number of bipartitions in the two trees is used for normalization since it is the maximum possible symmetric distance). This "Normalized Symmetric Distance" is zero when trees are identical and 1.0 when they have absolutely no bipartitions in common. From the Normalized Symmetric Distance (NSD), the corresponding "Normalized Symmetric Similarity" (NSS) can be computed simply as NSS = 1.0 – NSD. This similarity measure is 1.0 for identical trees and zero for trees without any common bipartitions.
RGO thankfully acknowledges grant SFRH/BD/12448/2003 from the Foundation for Science and Technology from the Portuguese Ministry of Science.
The authors would like to thank the Danish Center for Scientific Computing and Kristoffer Rapacki for helping on putting up the web server.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.