PFAAT version 2.0: A tool for editing, annotating, and analyzing multiple sequence alignments
© Caffrey et al; licensee BioMed Central Ltd. 2007
Received: 14 August 2007
Accepted: 11 October 2007
Published: 11 October 2007
By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis.
Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue.
Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition.
PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function.
Building a multiple sequence alignment (MSA) is a critical step towards understanding the function and evolution of a protein family. Subsequent analysis typically includes phylogenetics, homology modeling, structure prediction, and binding site prediction. There are several excellent software packages that align multiple sequences . Alignment accuracy is usually dependent on the percent amino acid identity between sequences  and manual editing is often a necessary step. Alignment editing tools are available in PFAAT as well as several other applications [3–10]. Additionally, MSA viewers provide various tools for sequence and structural analysis [3, 5–7, 11–16]. More recently, it has been recognized that MSAs can be used to validate and propagate annotations to other sequences . PFAAT specializes in the annotation and analysis of a MSA, and since the release of version 1.0 , we have continued to develop and add novel features to PFAAT. We describe some of the main features below.
PFAAT is written in Java and runs on several operating systems (Linux, Mac OS X, Solaris, and Windows). Users initially download and install the program from the home page using Java Web Start technology. Updated versions of the application are automatically downloaded on subsequent launches if the user is connected to the internet; otherwise the cached executable is used. Although PFAAT was not explicitly implemented for viewing nucleotide alignments, many of the generic features can also be applied to nucleotide sequences.
Results and Discussion
Double clicking on any of the three Name Panels will display the sequence annotations dialog box. Double clicking on a residue will display the residue annotation dialog box. The Tree viewer and structural viewer can be launched from the Analysis and File menus respectively. The tool bar (Figure 1) contains several drop-down menus that change the alignment view. The top row of drop-down menus changes the displayed annotation in Name Panels 1–3. The bottom row of drop-down menus sort sequences by annotation value and changes the font size.
When working with a large number of sequences, sequence annotations facilitate rapid sorting and triaging of sequences. For example, the Find menu allows one to find and select sequences that match one or more search terms (e.g. species equals Homo sapiens AND Pdb is not empty). The selected sequences can then be moved to the top using View -> Sort Sequences by -> selection.
Residue annotations provide a gateway for several types of subsequent analysis. For example, the Find menu allows one to quickly find and select residue annotations that match one or more search terms. The residue selection can be extended to the alignment column, and there is an option to hide all other columns. As a next step, one might apply one of the many features that can be applied to selected columns, including sorting by percent identity and most of the features in the Analysis menu.
PFAAT reconstructs phylogenetic trees using an implementation of the neighbor joining algorithm . An option to perform bootstrap analysis is also provided. Trees can be reconstructed using selected sequences or selected columns. PFAAT recognizes various tree formats (nh, nhx, nexus) and can display tree files generated by other software.
There are number of sequence analysis tools that are primarily found under the Analysis menu. For example, amino acid percent identities can be computed between all sequences or a subset of their columns. There is also an identity count, which reports the number of sequences that have a residue that is identical to a particular sequence at each column. There are a variety of conservation scores, the default being a von Neumann Entropy based score (described below) that can be applied to selected sequences as well as selected columns. The Conservation scores can be mapped to a 3D structure as discussed above. The PLSR method allows one to identify sequence trends that best correlate with numerical experimental measurements (e.g. binding data that is stored as a sequence annotation). Immediately above each alignment column is a gray box. A single click on a box will show the number and type of residues that are found at a column. In sort mode, the user can select a residue type that will determine how the sequences are sorted. For example, one might be interested in moving all sequences that have a lysine or arginine at column 100 to the top. In filter mode, all sequences that do not have a lysine or arginine would be hidden. The sort mode is often used for mutagenesis experiments as it provides a nice summary of residues that are tolerated at a given position. The filter mode can be used when designing selective drugs for a large gene family. Several other features are described in the documentation on the PFAAT home page.
Von Neumann Entropy
Although Shannon Entropy is a popular measure of residue conservation, it incorrectly treats amino acids as being orthogonal. Von Neumann Entropy overcomes this shortcoming and is the default measure of residue conservation in PFAAT. Shannon Entropy is described in equation 1, where i enumerates each mutually exclusive entity, λ i > = 0 and Σλ i = 1. The λ i are a measure of the probability of encountering the entity i in the collection.
Entropy = -Σλ i log(λ i )
As the 20 amino acids are non-orthogonal (overlapping) vectors, the set must be expressed in terms of an equivalent orthogonal basis set. The mutual overlap of the distinct amino acid vectors in each column is described by a matrix ρ encoding the pairwise similarities between these non-orthogonal vectors. We have found that the following simple 20 × 20 matrix, also called the density matrix, works well for amino acid conservation:
ρ = FS (2)
where F is a diagonal matrix of amino acid 'counts' or frequencies and S is an appropriate amino acid similarity matrix (e.g BLOSUM 62).
Now ρ can be naturally expressed in terms of an orthogonal basis through diagonalization, i.e. by calculating its eigenvectors E and eigenvalues Λ = diag (λ i ) :
ρ = E Λ E^(-1) (3)
The eigenvectors can be interpreted as 20 orthonormal amino acid properties spanning 'amino acid space'. If ρ is normalized such that Trace (ρ) = 1 (i.e. Σλ i = 1), the eigenvalues λi can be interpreted as the probabilities of encountering each of these 20 orthogonal eigenvector properties in the column. Inserting the eigenvalues λ i into the formula (1) now gives the entropy measure we desire. The entropy measure can in fact be written directly in terms of ρ itself
Von Neumann Entropy = - Trace (ρ Log ρ) (4)
as can be seen by inserting (3) into (4) to recover (1). Equation (1) is computationally more efficient than equation (4) and is implemented in PFAAT.
A MSA provides valuable information about a protein family. Additional knowledge is provided by the user in the form of annotations. By combining these annotations with sophisticated analysis, PFAAT allows researchers to test hypothesis that relate to sequence, structure and function. This release of PFAAT marks a significant improvement in functionality over version 1.0. The major improvements are described in the What's new? section of the user documentation. We eagerly anticipate user feedback and a 'request features' link is provided on the project home page. Future areas of development might include the extraction of sequence annotations from additional databases (e.g. GO, KEGG, and PFAM) and employing mechanisms to propagate annotations to other sequences .
Availability and Requirements
– Protein Family Alignment Annotation Tool
– Multiple Sequence Alignment
– Partial Least Squares Regression
We thank all authors who contributed to version 1.0 of PFAAT. We thank the ATV and Jmol project teams for making their code available. We thank end-users for their feedback, suggestions, and bug reports.
- Edgar RC, Batzoglou S: Multiple sequence alignment. Curr Opin Struct Biol. 2006, 16 (3): 368-373. 10.1016/j.sbi.2006.04.004.View ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999, 27 (13): 2682-2690. 10.1093/nar/27.13.2682.PubMed CentralView ArticlePubMedGoogle Scholar
- BioEdit. [http://www.mbio.ncsu.edu/BioEdit/bioedit.html]
- MPSA. [http://mpsa.ibcp.fr/]
- Bentz J, Baucom A, Hansen M, Gregoret L: DINAMO: interactive protein alignment and model building. Bioinformatics. 1999, 15 (4): 309-316. 10.1093/bioinformatics/15.4.309.View ArticlePubMedGoogle Scholar
- Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java alignment editor. Bioinformatics. 2004, 20 (3): 426-427. 10.1093/bioinformatics/btg430.View ArticlePubMedGoogle Scholar
- Deleage G, Combet C, Blanchet C, Geourjon C: ANTHEPROT: an integrated protein sequence analysis software with client/server capabilities. Comput Biol Med. 2001, 31 (4): 259-267. 10.1016/S0010-4825(01)00008-7.View ArticlePubMedGoogle Scholar
- Galtier N, Gouy M, Gautier C: SEAVIEW and PHYLO_WIN: two graphic tools for sequence alignment and molecular phylogeny. Comput Appl Biosci. 1996, 12 (6): 543-548.PubMedGoogle Scholar
- Nicholas KB: GeneDoc: Analysis and Visualization of Genetic Variation. EMBNET NEWS. 1997, 4: 14-Google Scholar
- Parry-Smith DJ, Payne AW, Michie AD, Attwood TK: CINEMA–a novel colour INteractive editor for multiple alignments. Gene. 1998, 221 (1): GC57-63. 10.1016/S0378-1119(97)00650-1.View ArticlePubMedGoogle Scholar
- BELVU. [http://www.cgb.ki.se/cgb/groups/sonnhammer/Belvu.html]
- Barton GJ: ALSCRIPT: a tool to format multiple sequence alignments. Protein Eng. 1993, 6 (1): 37-40. 10.1093/protein/6.1.37.View ArticlePubMedGoogle Scholar
- Goodstadt L, Ponting CP: CHROMA: consensus-based colouring of multiple alignments for publication. Bioinformatics. 2001, 17 (9): 845-846. 10.1093/bioinformatics/17.9.845.View ArticlePubMedGoogle Scholar
- Ilyin VA, Pieper U, Stuart AC, Marti-Renom MA, McMahan L, Sali A: ModView, visualization of multiple protein sequences and structures. Bioinformatics. 2003, 19 (1): 165-166. 10.1093/bioinformatics/19.1.165.View ArticlePubMedGoogle Scholar
- Li W, Godzik A: VISSA: a program to visualize structural features from structure sequence alignment. Bioinformatics. 2006, 22 (7): 887-888. 10.1093/bioinformatics/btl019.View ArticlePubMedGoogle Scholar
- Wang Y, Geer LY, Chappey C, Kans JA, Bryant SH: Cn3D: sequence and structure views for Entrez. Trends Biochem Sci. 2000, 25 (6): 300-302. 10.1016/S0968-0004(00)01561-9.View ArticlePubMedGoogle Scholar
- Thompson JD, Muller A, Waterhouse A, Procter J, Barton GJ, Plewniak F, Poch O: MACSIMS: multiple alignment of complete sequences information management system. BMC Bioinformatics. 2006, 7: 318-10.1186/1471-2105-7-318.PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson JM, Mason K, Moallemi C, Xi H, Somaroo S, Huang ES: Protein family annotation in a multiple alignment viewer. Bioinformatics. 2003, 19 (4): 544-545. 10.1093/bioinformatics/btg021.View ArticlePubMedGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Mazumder R, O'Donovan C, Redaschi N, Suzek B: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res. 2006, D187-191. 10.1093/nar/gkj161. 34 Database
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard TJ, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T: Ensembl 2007. Nucleic Acids Res. 2007, D610-617. 10.1093/nar/gkl996. 35 Database
- Saitou N, Nei M: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987, 4 (4): 406-425.PubMedGoogle Scholar
- Zmasek CM, Eddy SR: ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics. 2001, 17 (4): 383-384. 10.1093/bioinformatics/17.4.383.View ArticlePubMedGoogle Scholar
- Jmol. [http://www.jmol.org/]
- Lee B, Richards FM: The interpretation of protein structures: estimation of static accessibility. J Mol Biol. 1971, 55 (3): 379-400. 10.1016/0022-2836(71)90324-X.View ArticlePubMedGoogle Scholar
- Morgan DH, Kristensen DM, Mittelman D, Lichtarge O: an application for predicting and visualizing functional sites in protein structures. Bioinformatics. 2006, 22 (16): 2049-2050. 10.1093/bioinformatics/btl285.View ArticlePubMedGoogle Scholar
- Strang G: Orthogonality. Introduction to Linear Algebra. 1993, Wellesley-Cambridge Press, Wellesley, MAGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.