SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation
© Anderson et al; licensee BioMed Central Ltd. 2011
Received: 7 February 2011
Accepted: 21 May 2011
Published: 21 May 2011
Multiple sequence alignment (MSA) plays a central role in nearly all bioinformatics and molecular evolutionary applications. MSA reconstruction is thus one of the most heavily scrutinized bioinformatics fields. Evaluating the quality of MSA reconstruction is often hindered by the lack of good reference MSAs. The use of sequence evolution simulation can provide such reference MSAs. Furthermore, none of the MSA viewing/editing programs currently available allows the user to make direct comparisons between two or more MSAs. Considering the importance of MSA quality in a wide range of research, it is desirable if MSA assessment can be performed more easily.
We have developed SuiteMSA, a java-based application that provides unique MSA viewers. Users can directly compare multiple MSAs and evaluate where the MSAs agree (are consistent) or disagree (are inconsistent). Several alignment statistics are provided to assist such comparisons. SuiteMSA also includes a graphical phylogeny editor/viewer as well as a graphical user interface for a sequence evolution simulator that can be used to construct reference MSAs.
SuiteMSA provides researchers easy access to a sequence evolution simulator, reference alignments generated by the simulator, and a series of tools to evaluate the performance of the MSA reconstruction programs. It will help us improve the quality of MSAs, often the most important first steps of bioinformatics and other biological research.
Multiple sequence alignment (MSA) plays a central role in nearly all bioinformatics and molecular evolutionary applications. Be it to discover sequence structure and motifs or to infer the evolutionary history among sequences (phylogeny), the first step is to compare the sequences by building MSAs. The process of building an MSA is to infer homologous positions between the input sequences and place gaps in the sequence in order to align these homologous positions. These gaps represent evolutionary events of their own. Gaps (also called indels) are caused by either insertions or deletions of characters (nucleotides or amino acids) on a particular lineage of sequences during the evolution. In this sense, building an MSA is to reconstruct the evolutionary history of the sequences involved.
Due to its significant impact on many bioinformatics and molecular evolutionary analyses, MSA reconstruction is one of the most heavily scrutinized bioinformatics fields. Numerous MSA reconstruction methods have been developed . Assessment of MSAs, however, is usually reserved for power users. Often regular users simply run one MSA method and proceed directly to the next analysis without examining the alignment output. Considering the importance of MSAs, it is desirable if quality assessment of MSA methods can be performed more easily and more intuitively by all researchers who are interested in sequence analysis. There are a number of programs available that generate, display, and/or let users analyze MSAs such as SeaView , ClustalX2 , Se-Al , Jalview , webPRANK , as well as MEGA . However, none of these programs allows the user to make direct comparisons between two or more MSAs. SinicView  can visualize multiple MSAs. Its use, however, is targeted for genome-scale nucleotide alignments, and position-by-position comparison among MSAs is not possible. As Morrison [9, 10] also pointed out, visual inspection of multiple MSAs would greatly help improve the quality of MSAs and consequently the reconstruction of phylogenies.
Effective evaluation of MSA methods requires reference alignments. These are the MSAs that are considered to represent the evolutionary history of the sequences most accurately. The majority of currently available benchmark MSA datasets are based on structural alignments of real sequences (e.g., PREFAB , OXBench , HOMSTRAD , BAliBASE , SABmark , also see Edgar  for some issues with these benchmark datasets) where the actual evolutionary history is unknown. Researchers, especially those very familiar with their sequences, often adjust MSAs manually. This introduces several issues. There is no "standard" way to adjust/improve an alignment. It is very time consuming and alignments often cannot be fully resolved. A solution to these issues is offered by Hillis . He pointed to sequence evolution simulation as an alternative method to obtain reference MSAs and analyze MSA algorithms. Sequence evolution simulation methods generate a set of related nucleotide or amino acid sequences with a known evolutionary history, i.e., providing a fully-resolved MSA. The datasets generated by simulation, with various evolutionary parameter settings, are also useful for evaluating the robustness, consistency, and efficiency of phylogenetic reconstruction based on different MSA methods. The disadvantage of using simulated sequences, however, is that the events during the simulated evolution are limited by the evolutionary models available in the simulators. One must thus choose an appropriate simulator that can mimic the evolutionary history of the gene or protein sequences he/she is interested in.
Many molecular evolution simulation programs are currently available: e.g., INDELible , Rose , DAWG , MySSP , SIMPROT , EvolveAGene3 , and indel-Seq-Gen version 2.1 (iSGv2.1) . Rose  has been used to generate IRMBASE 2 and DIRMBASE benchmark alignment datasets . All of these programs require several input files and run on the command line. One exception is MySSP, which is run from a simple graphical user interface (GUI). Of the available simulation programs, iSGv2.1 is the most versatile and complex. It allows for subsequences or sites to evolve with less stringent assumptions, i.e., relaxing the assumption of the independent-and-identically-distributed sequence sites, which is prevalent in the field of molecular evolution simulation. iSGv2.1 thus can generate more realistic protein and gene families . Such complex and more realistic simulation, however, requires detailed input files and numerous options with the command line.
We introduce SuiteMSA, a suite of graphical tools for MSA comparison that also encapsulates the sequence evolution simulation program, iSGv2.1. SuiteMSA offers tools that allow for the direct comparison of multiple MSAs. These tools assist researchers to visually pinpoint the areas where alternative MSAs are inconsistent with a reference MSA, which can be either an MSA obtained from a benchmark MSA database, a manually curated MSA, or a true MSA based on simulated sequences. Statistics to aid the quantitative comparisons of MSAs are provided. SuiteMSA also allows users of any level to perform simulation of biological sequence evolution. With intuitive option panels users can quickly set up an evolutionary model for simulation. After the simulation, SuiteMSA displays and maps indel events to the true MSA and also to the simulation guide tree. This immediate feedback is useful in inspecting the simulated datasets, allowing the user to choose the set of simulation parameters that is best able to produce datasets with the desired features. Providing sequence simulation as well as MSA assessment capability is educational in understanding how various MSA methods work differently when biological sequences have different evolutionary properties.
A case study: the lipocalin protein superfamily
The MSA Comparator allows the user to perform a fine-grained comparison between two alignments. Figure 2B compares the reference MSA of the lipocalin superfamily proteins (shown also in Figure 2A using the single alignment viewer, MSA Viewer) with the alignment reconstructed by ClustalW2 v2.1 . Alignment positions under the selection and range bars are color-coded for consistency with respect to the reference MSA. Characters in consistently aligned columns are colored blue, and those in columns inconsistently aligned are colored red. In Figure 2B, for example, the highly conserved area surrounding the position 29 of the reference alignment is consistent between the two MSAs and colored blue, whereas after the position 40 the MSAs are inconsistent and so colored in red. The column-wise Sum-of-Pairs Score (SPS)  is also displayed using a bar chart in Figure 2B, with maximum-height bars shown for consistent columns (positions 26 - 39). The SPS shows the degree of consistency per column between the two alignments. For detailed description of these measures see SuiteMSA user's manual.
The Pixel Plot allows for a quick comparison between multiple MSAs. As shown in Figure 3, each character in the MSA is represented as a solid colored pixel and each gap as a blank pixel. In Figure 3, the reference alignment of the lipocalin superfamily proteins (at the top) is compared with the three MSAs reconstructed by ClustalW2 v2.1 , MAFFT v6.843 , and MUSCLE v3.8.31 . The selected characters for the reference alignment (MSA 1) under the blue selection bar and the corresponding characters for the reconstructed alignments (MSAs 2-4) under the green range bars are colored in magenta. This is the same area as selected in Figure 2B.
Simulating members of the lipocalin protein superfamily represents a challenge for many simulators because (i) due to the short length of the lipocalin proteins, each of the 19 subsequences (eight beta-strands, one alpha-helix, and ten coil regions) has a strict length constraint and (ii) all members of the family must contain the conserved motif (PROSITE PS00213 [33, 34]) near the first beta-strand. In this section, we set up options for iSGv2.1 for the lipocalin family simulation. The parameters were chosen by the following procedure:
The phylogeny reconstructed by Sánchez et al.  was used as the simulation guide tree.
The alignment presented in Sánchez et al.  was used as the root MSA.
We analyzed Sánchez et al.'s alignment using the PROTTEST Web server [35–38] using the guide tree topology. The model that best fit the data was the WAG substitution matrix  with the Gamma distribution (alpha = 3.88). The amino acid frequencies as well as the branch lengths for the phylogeny were also estimated by PROTTEST.
We estimated the indel parameters based on the reference alignment and guide tree using the lambda.pl program from the DAWG package . The geometric distribution with the average length of 6.97 as the length distribution model and the indel probability of 0.0516702 per substitution were returned. We assumed the maximum length of an indel to be 20 amino acids.
Prior to running a simulation, SuiteMSA provides error-checking for potential parameter conflicts. The actual command line used to run iSGv2.1 with all necessary options is shown at the top of the iSG Simulator window as illustrated in Figure 4. The simulation log file saves the parameters used along with any messages from iSGv2.1. This log can also be useful for retrieving the saved iSGv2.1 command-line for a later use.
Graphical interface for MSA methods
Results and Discussion
As we described before, the performance of MSA methods can be examined against a reference MSA. A reference MSA can be obtained from a benchmark MSA database or by manually-adjusting any MSA relying on our own experience and knowledge on the sequences of our interests. Or we can use a sequence simulator that generates a "true" MSA based on the given evolutionary model. In the previous section, we used the lipocalin superfamily proteins as a case study, and showed how we can simulate members of such a complex protein family. Simulated protein sequences were aligned using different MSA methods. In this section, as a further example, we will briefly discuss how these reconstructed MSAs are compared with the "true" MSA obtained from the simulation as well as the manually adjusted alignment produced by Sánchez et al. .
In Figure 2B, Sánchez et al.'s alignment of lipocalin proteins is used as the reference alignment (at the top) and compared with the alignment reconstructed by ClustalW2. In this alignment, the area containing the PROSITE motif (positions 21 through 34 in the reference alignment) is mostly colored blue showing a high degree of consistency. Note that the first five positions of the motif are not consistent (shown in red) due to the gaps inserted in the ClustalW2 MSA. The entire motif region, however, maintains high SPS values. However, the characters in the positions 40 through 44 in the reference alignment are scattered over 17 columns in the ClustalW2 MSA and colored in red, with 0 or very low SPS values. We expanded the comparison and included MSAs reconstructed by MUSCLE and MAFFT using the Pixel Plot. As shown in Figure 3, the three methods show their MSAs (MSAs 2-4) consistent with the reference MSA (MSA 1) at the left edge of the conserved motif region, indicated by the nearly straight edge marked in magenta in all alignments. However, there is a high degree of inconsistency in the downstream section between the reference alignment and reconstructed MSAs and even among the three reconstructed MSAs.
SuiteMSA provides unique MSA viewers, which allow researchers to quickly identify inconsistencies among MSAs reconstructed by different techniques. It assists in performance evaluation of MSA methods. SuiteMSA also allows users to perform sequence simulation. This further assists comparative analysis of MSAs based on the "true" reference alignment where insertion and deletion events can be mapped individually onto both the guide tree and the true MSA. SuiteMSA's intuitive and user friendly GUI allows for a quick learning curve in using the powerful simulation program iSGv2.1. This provides an opportunity to a wide range of researchers for setting up complex simulation studies quickly and accurately. With the MSA Viewer, MSA Comparator, Pixel Plot, as well as a graphical sequence simulator, the Phylogeny Viewer with graphical editing options, and the Alignment Viewer with indel-event tracking, SuiteMSA contributes a wide variety of unique features to the field of multiple sequence alignment, sequence evolution simulation, and more general bioinformatics research.
Availability and requirements
Project name: SuiteMSA
Project home page: http://bioinfolab.unl.edu/~canderson/SuiteMSA/
Operating system(s): Mac OS X 10.5 or higher, Linux, and Unix
Programming language: java 1.6
Other requirements: iSGv2.1 must be installed per instructions for sequence simulation. ClustalW2 and MUSCLE need to be installed if the user wish to use the GUIs provided with SuiteMSA.
Any restrictions to use by non-academics: none
Development of SuiteMSA has been partially supported by NSF AToL grant 0732863 to ENM.
- Kemena C, Notredame C: Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 2009, 25: 2455–2465. 10.1093/bioinformatics/btp452PubMed CentralView ArticlePubMed
- Gouy M, Guindon S, Gascuel O: SeaView Version 4: a multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol Biol Evol 2010, 27: 221–224. 10.1093/molbev/msp259View ArticlePubMed
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics 2007, 23: 2947–2948. 10.1093/bioinformatics/btm404View ArticlePubMed
- Waterhouse AM, Procter JB, Martin DMA, Clamp M, Barton GJ: Jalview Version 2 - a multiple sequence alignment editor and analysis workbench. Bioinformatics 2009, 25: 1189–1191. 10.1093/bioinformatics/btp033PubMed CentralView ArticlePubMed
- Löytynoja A, Goldman N: webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics 2010, 11: 579. 10.1186/1471-2105-11-579PubMed CentralView ArticlePubMed
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 2007, 24: 1596–1599. 10.1093/molbev/msm092View ArticlePubMed
- Shih AC, Lee DT, Lin L, Peng CL, Chen SH, Wu YW, Wong CY, Chou MY, Shiao TC, Hsieh MF: SinicView: a visualization environment for comparisons of multiple nucleotide sequence alignment tools. BMC Bioinformatics 2006, 7: 103. 10.1186/1471-2105-7-103PubMed CentralView ArticlePubMed
- Morrison DA: Why would phylogeneticists ignore computerized sequence alignment? Syst Biol 2009, 58: 150–158. 10.1093/sysbio/syp009View ArticlePubMed
- Morrison DA: A framework for phylogenetic sequence alignment. Plant Syst Evol 2009, 282: 127–149. 10.1007/s00606-008-0072-5View Article
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMed
- Raghava GP, Searle SM, Audley PC, Barber JD, Barton GJ: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 2003, 4: 47. 10.1186/1471-2105-4-47PubMed CentralView ArticlePubMed
- Stebbings LA, Mizuguchi K: HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Res 2004, 32: D203–207. 10.1093/nar/gkh027PubMed CentralView ArticlePubMed
- Thompson JD, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61: 127–136. 10.1002/prot.20527View ArticlePubMed
- Van Walle I, Lasters I, Wyns L: SABmark-a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21: 1267–1268. 10.1093/bioinformatics/bth493View ArticlePubMed
- Edgar RC: Quality measures for protein alignment benchmarks. Nucleic Acids Res 2010, 38: 2145–2153. 10.1093/nar/gkp1196PubMed CentralView ArticlePubMed
- Hillis DM: Approaches for accessing phylogenetic accuracy. Sys Biol 1995, 44: 3–16. 10.1093/sysbio/44.1.3View Article
- Fletcher W, Yang Z: INDELible: a flexible simulator of biological sequence evolution. Mol Biol Evol 2009, 26: 1879–1888. 10.1093/molbev/msp098PubMed CentralView ArticlePubMed
- Stoye J, Evers D, Meyer F: Rose: generating sequence families. Bioinformatics 1998, 14: 157–163. 10.1093/bioinformatics/14.2.157View ArticlePubMed
- Cartwright RA: DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 2005, 21: iii31-iii38. 10.1093/bioinformatics/bti1200View ArticlePubMed
- Rosenberg MS: MySSP: non-stationary evolutionary sequence simulation, including indels. Evol Bioinform Online 2005, 1: 81–83.PubMed Central
- Pang A, Smith AD, Nuin PA, Tillier ER: SIMPROT: using an empirically determined indel distribution in simulations of protein evolution. BMC Bioinformatics 2005, 6: 236. 10.1186/1471-2105-6-236PubMed CentralView ArticlePubMed
- Hall BG: Simulating DNA coding sequence evolution with EvolveAGene 3. Mol Biol Evol 2008, 25: 688–695. 10.1093/molbev/msn008View ArticlePubMed
- Strope CL, Abel K, Scott SD, Moriyama EN: Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol Biol Evol 2009, 26: 2581–2593. 10.1093/molbev/msp174PubMed CentralView ArticlePubMed
- Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B: DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 2005, 6: 66. 10.1186/1471-2105-6-66PubMed CentralView ArticlePubMed
- Flower DR, North ACT, Attwood TK: Structure and sequence relationships in the lipocalins and related proteins. Protein Sci 1993, 2: 753–761. 10.1002/pro.5560020507PubMed CentralView ArticlePubMed
- Sánchez D, Ganfornina MD, Gutiérrez G, Marín A: Exon-intron structure and evolution of the lipocalin gene family. Mol Biol Evol 2003, 20: 775–783. 10.1093/molbev/msg079View ArticlePubMed
- Strope CL, Scott SD, Moriyama EN: indel-Seq-Gen: a new protein family simulator incorporating domains, motifs, and indels. Mol Biol Evol 2007, 24: 640–649.View ArticlePubMed
- Jones T: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMed
- Thompson J, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res 1999, 27: 2682–2690. 10.1093/nar/27.13.2682PubMed CentralView ArticlePubMed
- Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 2008, 9: 286–298. 10.1093/bib/bbn013View ArticlePubMed
- Edgar RC: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5: 113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMed
- Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, Bairoch A, Hulo N: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 2010, 38: D161-D166. 10.1093/nar/gkp885PubMed CentralView ArticlePubMed
- PROSITE: Database of protein domains, families and functional sites[http://ca.expasy.org/prosite/]
- Abascal F, Zardoya R, Posada D: ProtTest: selection of best-fit models of protein evolution. Bioinformatics 2005, 21: 2104–2105. 10.1093/bioinformatics/bti263View ArticlePubMed
- PROTTEST: Selection of best-fit models of protein evolution[http://darwin.uvigo.es/software/prottest.html]
- Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52: 696–704. 10.1080/10635150390235520View ArticlePubMed
- Drummond A, Strimmer K: PAL: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics 2001, 17: 662–663. 10.1093/bioinformatics/17.7.662View ArticlePubMed
- Whelan S, Goldman N: A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001, 18: 691–699.View ArticlePubMed
- Clustal programs download site[http://www.clustal.org/download/current/]
- MUSCLE program download site[http://www.drive5.com/muscle/]
- Löytynoja A, Goldman N: Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary analysis. Science 2008, 320: 1632–1635. 10.1126/science.1158395View ArticlePubMed
- Schneider T, Stephens R: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 1990, 18: 6097–6100. 10.1093/nar/18.20.6097PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.