MaxAlign: maximizing usable data in an alignment

Gouveia-Oliveira, Rodrigo; Sackett, Peter W; Pedersen, Anders G

doi:10.1186/1471-2105-8-312

Methodology article
Open access
Published: 28 August 2007

MaxAlign: maximizing usable data in an alignment

Rodrigo Gouveia-Oliveira¹,
Peter W Sackett¹ &
Anders G Pedersen¹

BMC Bioinformatics volume 8, Article number: 312 (2007) Cite this article

7151 Accesses
82 Citations
Metrics details

Abstract

Background

The presence of gaps in an alignment of nucleotide or protein sequences is often an inconvenience for bioinformatical studies. In phylogenetic and other analyses, for instance, gapped columns are often discarded entirely from the alignment.

Results

MaxAlign is a program that optimizes the alignment prior to such analyses. Specifically, it maximizes the number of nucleotide (or amino acid) symbols that are present in gap-free columns – the alignment area – by selecting the optimal subset of sequences to exclude from the alignment.

MaxAlign can be used prior to phylogenetic and bioinformatical analyses as well as in other situations where this form of alignment improvement is useful. In this work we test MaxAlign's performance in these tasks and compare the accuracy of phylogenetic estimates including and excluding gapped columns from the analysis, with and without processing with MaxAlign. In this paper we also introduce a new simple measure of tree similarity, Normalized Symmetric Similarity (NSS) that we consider useful for comparing tree topologies.

Conclusion

We demonstrate how MaxAlign is helpful in detecting misaligned or defective sequences without requiring manual inspection. We also show that it is not advisable to exclude gapped columns from phylogenetic analyses unless MaxAlign is used first. Finally, we find that the sequences removed by MaxAlign from an alignment tend to be those that would otherwise be associated with low phylogenetic accuracy, and that the presence of gaps in any given sequence does not seem to disturb the phylogenetic estimates of other sequences.

The MaxAlign web-server is freely available online at http://www.cbs.dtu.dk/services/MaxAlign where supplementary information can also be found. The program is also freely available as a Perl stand-alone package.

Background

A multiple alignment of nucleotide or protein sequences often forms the basis for phylogenetic analysis. In a perfect alignment, gaps correspond to insertion or deletion events, and as such should contain phylogenetic information on a par with substitutions. While some work has been done to make use of this type of data [1–3] there are still many unsolved issues. Additionally, gaps can also stem from misalignment, as well as from sequencing or data-management problems, in which case they obviously provide no useful information. Consequently, several bioinformatical and phylogenetic analyses are often based on alignments where gapped columns (i.e., columns containing at least one gap) have been discarded. For instance, removal of gapped columns is an option in the frequently used programs Paup [4], Paml [5] and Crann [6]. However, as the number of sequences in an alignment grows, the probability of having a gap in any given site also grows, and with it the risk of removing that site from the analysis. An alternative approach, that is sometimes used when applying maximum likelihood and other model-based methods, is to treat the gaps as unknown nucleotides (or amino acids) and sum over all the possible combinations, but this is not consensual and can become prohibitively costly for larger data sets. For some bioinformatical analyses, moreover, this alternative is not possible.

One way around this problem is to remove particularly gap-rich sequences, thereby ending up with a dataset containing more ungapped columns. This solution is of course not meaningful if the main goal of the analysis is to infer the topology of the phylogenetic tree connecting all the included taxa and one has a sufficiently long sequence. However, there are many other scenarios where the approach can be useful. For instance, it is often the case in molecular evolutionary analysis today that the focus is not on the phylogeny but on the analysis of the sequences themselves, and on properties of each position, such as the rate of evolution or the action of natural selection. In such cases keeping the sites in the analysis becomes important. The use of an automated, rapid alignment clean-up method is also clearly relevant in the case of large-scale or batch-type analyses, where phylogenies are produced from many potentially large data sets, or in the case of bioinformatical analyses not tolerant to the presence of gaps.

Results and discussion

Overview

The goal of this tool is to maximize the alignment area, defined as the number of characters that are present in gap free columns. Alignment area is thus equal to the number of sequences included in the alignment times the number of columns that have no gaps. This maximization of the alignment area is done by selectively removing sequences from the alignment. Finding the right sequences to remove, however, is not a straightforward problem (see Methods for details on the MaxAlign algorithm).

To scrutinize the performance of MaxAlign, we have analyzed very large sets of protein alignments from two different sources: (1) alignments extracted from the Pfam database[7], and (2) synthetic alignments created by simulating evolution on a set of three different trees using the program simprot [8]. The Pfam data were chosen to be representative of typical, fairly diverged biological alignments, and consisted of all the 5242 alignments from Pfam-A, release 21.0, that had between 30 and 500 sequences. The simulated data were constructed so it resembled the Pfam-data in terms of sequence divergence, and in the number and length of gaps (see below). Specifically, we analyzed the following aspects of MaxAlign behaviour: (1) performance of the heuristic version of the MaxAlign algorithm (optimality and computation time), (2) Improvement of alignment area by MaxAlign, (3) Ability of MaxAlign to remove "contaminating" (non-homologous) sequences from an alignment, (4) Impact of MaxAlign on phylogenetic accuracy (meaning how closely the reconstructed tree resembles the true tree), and (5) Impact of MaxAlign on the computation time of subsequent maximum likelihood phylogenetic analysis.

Performance of MaxAlign heuristic algorithm

MaxAlign can use one of two different algorithms: a heuristic algorithm (which is faster, but which is not guaranteed to find the best solution), and an optimal algorithm (which is an adaptation of branch-and-bound; see Methods). We tested both algorithms on the full Pfam dataset. The average runtime for the heuristic was 0.6 seconds per alignment (user time) demonstrating that use of MaxAlign is not a bottleneck for data analysis. Among the 5242 alignments, the heuristic algorithm found the optimal solution in 78% of the alignments. In only 4% of the alignments, the heuristic algorithm did not find the optimal solution. The remaining alignments either could not be improved from the start (4%), or the branch and bound algorithm could not reach a solution in 2 hours (15% of the alignments). Thus, the heuristic algorithm found the best solution in 95% of the cases in which a solution was found. Moreover, analysis of the solutions in the few cases where the heuristic was not optimal showed that the alignment area of the heuristic solution was, on average, 99.0% of the optimal solution (median: 99.6%). On the whole, we therefore recommend using the heuristic algorithm of MaxAlign.

Improving the alignment area

The alignment area of the Pfam alignments was quite variable, with a majority (52.2%) having no gap free columns (and consequently an alignment area of 0). The average number of gap free columns in the unprocessed alignments was only 19.4, and the average alignment area 1494 (see table 1). The optimization of alignment area performed by MaxAlign resulted in a 10-fold increase to an average alignment area of 16000, corresponding to an 8-fold mean increase of 143.2 gap free columns per alignment, at the average cost of only 19% of the sequences in the alignment. In a small minority (4%) of the 5242 cases it was not possible to improve the alignment area by removing sequences, and MaxAlign therefore had no effect. Thus, the majority of Pfam alignments were unusable for analyses requiring the elimination of gapped columns and became usable datasets by processing with MaxAlign.

Table 1 Changes in alignment dimensions caused by MaxAlign

Full size table

Removing non-homologous sequences

It is not rare that data sets are contaminated with non-homologous sequences. These will typically result in the generation of many gaps when the sequences are aligned. To investigate how well MaxAlign performs in removal of such contaminating sequences we therefore constructed a set of noisy datasets: for each of the 5242 original Pfam alignments we added random, non-homologous sequences (taken from other Pfam data sets) such that the final content of noise was approximately 10%. These datasets were then re-aligned using the program mafft with default parameters[9]. We then ran MaxAlign (heuristic algorithm) on the resulting alignments and analyzed the performance in terms of removal of contaminating sequences. We found that, taken over all 5242 alignments, MaxAlign performed very well, removing on average 87% of the non-homologous sequences per alignment (median removal success: 96%). In 46% of all cases, MaxAlign was able to remove all non-homologous sequences from the alignment.

Impact on phylogenetic analysis

Accuracy

It is an unsolved issue in phylogenetic analysis how to best deal with gaps. We touched upon this topic by investigating how MaxAlign and removal of gapped columns interfere with phylogenetic tree inference. To do so, we first simulated 1000 alignments along each of 3 trees of different shapes. From each of these synthetic alignments, we then constructed three derived data sets by (a) removing gapped columns, (b) applying MaxAlign, and (c) applying MaxAlign and removing gapped columns. For each of the resulting 12,000 synthetic alignments (3,000 original plus 9,000 derived ones), we then performed maximum likelihood phylogenetic analysis using the program phyml. The accuracy of phylogenetic reconstruction was then analyzed by calculating the normalized symmetric tree similarity (see Methods) between the true tree (used to simulate the alignments) on one hand, and the four trees estimated by phylogenetic analysis on the other. These tree similarities were computed based on the subset of taxa shared by the original and maxaligned datasets. For the trees inferred from the original (simulated) alignment and from the original alignment excluding gaps, we also computed the similarity to the true tree based on the full set of taxa. The outcome of this comparison can be found both in figure 1 and table 2.

Table 2 Effects of MaxAlign and removal of gapped columns on phylogenetic accuracy

Full size table

When measuring phylogenetic accuracy on the subset of taxa that remain after MaxAlign processing, it can be seen that the highest phylogenetic accuracy was achieved by using all the sequences available without discarding gapped columns (figure 1, all rows but 2^nd and last). However, it can also be seen that applying MaxAlign decreases the accuracy only very slightly (figure 1, compare "Original" and "MaxAlign"). Moreover, if one decides to discard columns with gaps, then using MaxAlign is clearly the best option (figure 1, compare "Removed Gaps" with "MaxAlign Removed Gaps").

If the accuracy of trees (based on non-MaxAligned data) is measured on the full set of taxa, then a different trend is apparent: especially in trees 1 and 2, it can be seen that processing the alignment with MaxAlign increases the accuracy of the phylogeny (figure 1, accuracy on full set of taxa is shown in bottom two rows; compare "Original Whole Tree" to "MaxAlign"). The reason for this phenomenon is that MaxAlign predominantly removes sequences with many gaps. These will necessarily also be the ones having the fewest amino acids, and therefore the ones associated with the highest phylogenetic uncertainty. It is important to note that in phylogenetic analyses that measure support for each branch, such as Bayesian analysis or Bootstrap, the position of these taxa would show up with low support values. It should also be noted that the presence of gaps in some sequences does not seem to disturb phylogenetic inference on the remaining sequences (figure 1, compare "Original" to "Original Whole Tree"). It should be noted that we have not investigated the additional impact of alignment error on this, since we directly use the simulated (and therefore perfect) alignments themselves.

Computation time

One reason for discarding gapped columns is that summing over all possible missing values is a time consuming step in a maximum likelihood calculation. We have evaluated how much time is saved by using MaxAlign with and without discarding columns with gaps. The results are presented in table 3.

Table 3 Runtimes of the phylogenetic analysis

Full size table

It can be seen that either using MaxAlign or removing columns with gaps nearly halves the computer time needed to find the phylogenetic tree. Removing columns with gaps does so by preventing the summation over unknown characters as well as reducing the number of sites to be included in the analysis. MaxAlign diminishes the number of summations required and also reduces the size of the dataset, by including fewer sequences.

The computer time required for these phylogenetic analyses was particularly short, as we used a very fast program (phyml), and the alignments contained few sequences. In situations using more time-consuming analyses and software, as well as larger alignments, these relative differences in computing time will have an increased impact.

Conclusion

In the present work, we have shown how MaxAlign is helpful in "cleaning up" big alignments, and demonstrated how it may be used in connection with phylogenetic analysis. In the analysis presented above, we found that the use of MaxAlign more than halved the running time of the analysis even when gapped columns were not excluded. When gapped columns were discarded from the alignment, the output was of much higher quality if MaxAlign was first used to pre-process the data, as the number of columns included in the analysis increased immensely. We have also shown that in the conditions tested, the accuracy of the phylogenetic estimate of the tree topology increases if one includes the gapped columns. The sequences removed by MaxAlign were predominantly those associated with the highest degree of uncertainty regarding their placement in the final trees.

The use of MaxAlign prior to bioinformatical and phylogenetic analyses optimizes the number of nucleotides or amino acids present in ungapped alignment columns (the "alignment area"), thereby increasing the amount of useful data and/or the speed of the analysis. Moreover, MaxAlign is also very helpful in detecting misaligned or defective sequences without requiring manual inspection. Given its very short running times and ease of use, it can without difficulty be used as a step in both automated and manual phylogenetic and bioinformatical analyses.

Methods

MaxAlign

MaxAlign maximizes the alignment area of an alignment by removing selected sequences. It has two algorithms to do so: one heuristic and one branch-and-bound. Both are based on the concept of gap pattern. In the example alignment shown in figure 2, one can see that the first three columns have the same gap pattern, namely the presence of gaps in the first two lines (corresponding to sequences A and B). All columns sharing that gap pattern would yield ungapped columns if sequences A and B were removed. For every gap pattern there is therefore a set of sequences requiring removal to produce ungapped columns. The heuristic algorithm works by first determining the sequence set for each gap pattern and then iteratively removing the set of sequences that most increases the alignment area per sequence removed, until it reaches the solution with the highest absolute alignment area (shown in figure 2b). The heuristic algorithm is always used in MaxAlign. The branch-and-bound algorithm, which is guaranteed to find the best possible solution at the expense of a longer running time, can optionally be used to check if the heuristic algorithm has found the optimal solution.

MaxAlign has several additional features. For example, it is possible to select a subset of sequences that should be kept in the final alignment, even if their removal would yield an improvement of the alignment area. This is done by simply adding a plus sign ("+") in front of the sequence name in the fasta file. In this way, the user can preserve key sequences in his/her analysis. Also, users interested in codon-based analyses can run MaxAlign on protein sequences and obtain their output alignments both in protein and nucleotide format (as long as they also provide the corresponding nucleotide sequences).

The server is freely accessible at [10] where a stand-alone Perl version can also be downloaded free of charge [see Additional file 1]. The input to MaxAlign is an alignment in fasta format (two in the case of codon-based analyses). The main output is an alignment (in fasta format) consisting of the set of sequences that maximizes the alignment area. The program also lists the sequences that were included and excluded from the original alignment, and furthermore provides a report on the resulting improvement.

Improving Alignment area

To test the relevance of MaxAlign in real biological problems, the Pfam-A Release 21.0 of full alignments [7] was used. All the 5242 alignments comprising between 30 to 500 sequences were submitted to MaxAlign. The mean number of sequences in this set of alignments was 140 (median:90). The alignment area and number of ungapped columns was recorded, before and after MaxAlign. In this analysis, as in all of the following unless otherwise stated, the heuristic algorithm was used.

Removal of non-homologous sequences

To test the ability of MaxAlign in removing non-homologous sequences from an alignment, we added noise sequences to each of the 5242 above-mentioned Pfam alignments in a ratio of 1:10. Thus, the alignments ended up containing between 33 to 550 sequences. These alignments were submitted to MaxAlign and the fraction of noise sequences removed was recorded. For each alignment, non-homologous sequences were chosen randomly from the rest of Pfam.