Volume 10 Supplement 11
Exploratory visual analysis of conserved domains on multiple sequence alignments
© Jankun-Kelly et al; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
Multiple alignment of protein sequences can provide insight into sequence conservation across many species and thus allow identification of those sections of the sequence most critical to protein function. This insight can be augmented by joint display of conserved domains along the sequences. By fusing this metadata visually, biologists can analyze sequence conservation and functional motifs simultaneously and efficiently.
We present MSAVis, a new approach combining luminance and hue for simultaneous visualization of conserved motifs and sequence alignment. Input for the algorithm is a multiple sequence alignment in a standard format. The NCBI Conserved Domain Database (CDD) is used for finding conserved domains along the alignment. The visualization quickly identifies conserved domains, and allows both macro (sequence-long) and micro (small amino-acid neighborhood) views.
MSAVis utilizes two visual cues, luminance and hue, to facilitate at-a-glance summary of the conservation of a user-provided protein alignment while enabling multiple comparisons among functional domains. These visual cues are preattentive and separable so that the relationship between conservation strength and domain membership can be understood. The MSAVis software, written in Python and using BioPython and OpenGL, can be found at http://agbase.msstate.edu/tools/MSAVis.html.
Sequence alignment provides information about the primary structural similarity of the proteins, however, this is not indicative by itself of their functional similarity. Biologists often want to investigate the functional domains of proteins. These are the sections of the proteins sequence that enable it to serve a particular biological role. Because these sections tend to be evolutionarily conserved (they remain the same in related organisms), they are also called conserved domains (CDs). NCBI states, "computational biologists define conserved domains based on recurring sequence patterns or motifs ." After a sequence has been identified as a functional conserved domain experimentally or using predictive methods, a computer model of the domain can be generated. That model is then used to identify the domain in sequences from other organisms. There are many online databases that take a protein sequence as a query and return matching domains; for this work, the Conserved Domain Database (CDD) from NCBI [2, 3] was used.
Both the MSA and CD diagrams provide clues to the similarities of proteins: The former indicates conservation of primary structure while the second shows inferred functional motifs. When these are viewed individually, however, it is difficult to determine how the primary structure and motifs interrelate. Sequence information shows the similarity of the multiple proteins, but no domain information. Similarly, domain information is typically only displayed for one protein at a time. To address these limitations, we have developed MSAVis, a method to visually fuse the strength of the alignment and the presence of conserved domains over a set of proteins. The remainder of this article reviews systems similar to MSAVis, and describes how MSAVis was designed using principles of visualization and its capabilities.
Bioinformatics visualization presents inherently non-spatial biological data using interactive computer graphics. The purpose of these depictions is to solve a domain scientist's task. The variety of biovisualization systems is beyond the scope of our work; for a survey, see the summary by Lungu and Xu . Instead, we review the work in sequence alignment and conserved domain visualization as a basis for our discussion about MSAVis.
As Figure 1 demonstrates, textual views of MSAs do not scale either in the number of proteins (rows per block) or in the length of the sequence (number of blocks). Graphical methods thus attempt to compress this information or provide methods for effective exploration. One of the first of these were Sequence Logos . A sequence logo compresses the rows of the MSA diagram into a single display – the vertical column summarizes an entire group of proteins or nucleic acids (DNA or RNA). For each column (a position in the protein/nucleic acid string), a stack of letters that occur in that position is drawn; the letters are scaled by occurrence so that amino acids or nucleotides that occur more often are taller than those that occur infrequently. Sequence logos provide a quick glance at the most common characters amongst the sequences; however, the structure of an individual sequence is lost. Note that a sequence logo does not compress the length of the sequence, so scrolling is still required to view very long chains of characters.
In contrast to sequence logos, SequenceJuxtaposer  displays each character in every sequence over the entire length. However, it elides the display of the amino acid/nucleotide character by representing each by a unique color. Thus, it provides a zoomed out view of the entire sequence as a colored matrix where position (i, j) indicates the j th amino acid in protein i. The user can expand the display vertically (e.g, focusing on a subset of proteins) or horizontally (e.g., focusing on a region of the sequence). As the user zooms, the characters of the sequence are shown as space becomes available. Other methods compress the sequences graphically as well [7–9]. Some, like SequenceJuxtaposer, only show sequence information; others, like Phylo-Vista , also show related information (e.g., genomic data). None of them, however, fuse the conserved domain information with the sequences.
The area of conserved domain visualization is less explored. NCBI's Conserved Domain Database (CDD) provides a suite of tools for depicting CDs [2, 3]. For a single protein, it produces images as demonstrated by Figure 2: All CDs over the protein are depicted. Similar, non-interactive views are available from InterPro [10, 11] and UniProt [12, 13]; proteins are shown (one at a time or in groups) with all the domains but no alignment details. This grouping prevents efficient visual comparison. In addition, CDD uses a phylogenetic tree-like visualization to show related domains from other species for a given selected CD. However, this tree view loses the sequence information from the original protein – one cannot view all the CDs from a group of proteins concurrently.
Jalview  is the tool that is most closely related to ours. Jalview can be used to both display and edit protein sequences, and can depict considerable metadata about a squence. Color is used in the sequence view to either show a single CD or the overall strength of the alignment; separate windows are used to breakdown the elements of the alignment strength. However, with Jalview it is not possible to effectively view more than a few domains on an alignment, especially if multiple conserved domains matches overlap on part of a sequence. Furthermore, it is not easily possible to get an overall view – across the entire alignment – of where each conserved domain lies as Jalview does not provide a compressed view as does SequenceJuxtaposer. Our MSAVis approach, though limited to MSA and CD visualization, visually fuses both in a manner not currently supported by extant tools; this visual fusion is explained in the next section.
MSAVis depicts aligned protein sequences, the strength of the alignment, and the conserved domains over the collected proteins. Starting from a pre-calculated alignment provided by the user in ClustalW  or PHYLIP  format, MSAVis queries the online NCBI CDD for each sequence and parses the results. Thus, for every position along a given sequence, the following is known:
The protein to which the amino acid belongs
Its position within the protein
The strength of the overall alignment at the position over all the proteins
Whether or not the position is part of one of the known conserved domains
Each sequence in the alignment is assigned a color, as shown in the key at the top of the interface. Each protein color has roughly the same base luminance so that brightness changes due to the sequence strength appear consistent for each protein (Figure 5). In the visualization, a bar of color is drawn wherever the respective conserved domain is present in that sequence across a specified portion of the alignment. In Figure 3, it is easy to tell that the DNA_methylase site-specific domain is present in every sequence near both ends of the alignment. Similarly, the COG1092 domain (bottom) is only extant in human (green, [GenBank:NP_004403]) and chimp (purple, [GenBank:XP_001151907]). The depiction also quickly identifies potential incomplete domain classifications; for example, the Cyt_C5_DNA_Methalase domain (top group) is missing from the beginning of the Opossum protein whereas it is present in all eight other species. Furthermore, even though most of the domains overlap, it is still easy to see where they lie on the alignment since they are displayed on separate tracks.
The alignment overview can be useful for drawing general conclusions about some parts of the alignment. Often, however, this view will motivate the user to look more closely at an alignment subset. MSAVis allows the user to drag a rectangle around the area of interest to explore the conserved domains' relationship to the alignment in greater detail.
More detailed views
Users can scroll through the alignment using the scroll button on their mouse as well as the arrow keys on the keyboard. Additionally, users can click and drag the sequence labels on the left to move that sequence higher or lower in the stack, for closer analysis of a subset of the sequences, while still keeping information about all the sequences available. Similarly, domains may be removed entirely from the display by deselecting them from the checkbox of CDs near the top of the interface (Figure 3). Finally, users can easily zoom back out to the full view by clicking the right mouse button.
Visual and interaction design
Considerable care was taken in the visual design of MSAVis. For the proteins, unique, equal brightness colors were chosen; we choose ten hues roughly evenly separated in the CIE L*a*b* perceptually uniform color space . In other words, each of the colors is equally visually distinguishable (barring color-blindness). The colors are the same brightness so that we can independently use luminance to encode the amount of conservation. Both of these visual cues are pre-attentive and separable  so that conservation strength and CD membership can be seen at a glance simultaneously.
The interaction with the interface also used cognitive principles in its design. Exploration of data is cyclic; thus, interfaces which facilitate overview-to-specific interactions mesh well with common exploration schemas . Users can iteratively test hypotheses of the relationship between the alignment and the conserved domains – a user can zoom in on the gap in DNA_methylase first (Figure 6), and then examine the dissimilarity exhibited by COG1092 later. In addition, since our users will primarily be doing comparisons among proteins, the groupings of proteins into blocks facilitates this comparison. If a user wants to compare CDs, we do allow reordering of the CDs so two domains may be compared side-by-side. While this is not the most ideal arrangement, it is sufficient for pair-wise investigation.
MSAVis is implemented in Python and uses BioPython  to load pre-aligned sequences in ClustalW or PHYLIP format and to calculate the sequence conservation at each location. This sequences are then processed live via the NCBI CD database to find conserved domains. It is cross-platform, using PyOpenGL  and wxPython  for its rendering and window management. We are currently considering a web-based option to interface with AgBase at Mississippi State ; the software is currently available from the AgBase website .
Results and discussion
MSAVis provides a single view for presenting alignment conservation and the presence or absence of a conserved domain over a group of proteins. Unlike Jalview or the CDD, it depicts multiple conserved domains that may coincident over several proteins. This facilitates comparison of the conserved domains across species with less effort that previously available. MSAVis has both an overview and zoomed in view of the sequence (like Jalview), but contains these within a single display that can be dynamically navigated (unlike Jalview or the CDD). The overall goal of MSAVis is reduce the time required to explore the relationship between multiple proteins and their conserved domains and we have achieved this by providing both a compact view of this information and reducing the number of interactions with the data. For example, to compare the five conserved domains over the nine proteins given in Figure 3 would require nine views within the CDD (one for each protein) or 5 views in Jalview (one for each domain).
Feedback from our biologist colleagues has been positive, and we are deploying the tool for their use. Currently, the tool is for browsing only: It does not allow editing of the sequences or changing the alignments as does Jalview or other tools. Per the request of our collaborators, we are looking into this feature for the future.
Exploration using MSAVis is interactive; zooming in or out takes less than a second. Online access to the CDD and processing the domains when loading takes the most time (roughly a minute for the example presented in this paper).
Our tool provides a unique approach to the simultaneous display of the alignment and conserved domains that is not currently found in widely available tools such as MEME, GenDoc, and BioEdit. MEME [25, 26] is primarily a tool for discovering sequence motifs but also provides a number of methods for displaying the discovered motifs (the information content at each position in the pattern, a logos format of the motif, and a neighbor-joining tree of the motif). The output most similar to that provided by our tool is a blocks diagram of the discovered motifs on the training set sequences; MEME does not support simultaneously viewing motifs and the alignment. GeneDoc [27, 28] provides a set of tools for visualizing, editing, and analyzing multiple sequence alignments of both proteins and nucleic acids in an evolutionary context. These tools provides two residue display modes (display all or display only those different from the master sequence) that can be combined with a number of different gray-scale or color shading modes. There are also options allowing the user to define grouping and shading options that could be used with conserved domains imported from an external source such as NCBI. However, there is no easy mechanism for simultaneous visualization of domains and alignment strength nor for providing both detailed and high level views of the alignment. BioEdit  is primarily a sequence editing program but does also allow the user to import and display features from GenBank or GenPept sequences including the region feature that often corresponds to a conserved domain. However, all regions will be displayed at one time in one color on a single alignment. Our approach focuses on a small part of the larger protein sequence analysis problem that these popular tools address.
While MSAVis has proven successful for jointly visualizing sequence alignment and functional domain data, it has its limitations. As mentioned previously, we only have a limited set of unique hues for proteins; after ten, they begin to repeat. We made an attempt to stagger the colors so that similar hues are not next to each other, but this becomes more difficult as more than 12 proteins are added. In addition, as the number of conserved domains increase, the space used by the checkbox toggles increases; we are investigating more space efficient presentation for the CD toggles.
Our prototype application demonstrates a new method for displaying multiple sequence alignments and conserved domains at different levels of detail. Both the alignment and the location of CDs can be viewed at a glance, and interactive exploration facilitates understanding of their interrelationship. We have taken a principled approach using results from visualization design to create an effective visualization. Initial feedback from our biology colleagues has been positive, and we are currently exploring their data for further system improvements.
As protein sequences become available for many non-model species, the need for mechanisms to analyze changes in conserved motifs will increase. For example, the availability of the opossum and platypus genomes provided insight into the functional motifs of mammalian genes. MSAVis provides a rapid method for investigating how these motifs are different from each other as more organisms are added.
In the future, we plan to investigate methods for displaying additional layers of information on the display, such as predicted DNA binding sites; some of this data is already present in CDD views but only for a protein at a time. In addition, we will explore a web-based version for distribution. More sophisticated methods scoring the alignment, such as information theoretic-based approaches, may also be explored to enhance the visual presentation. Finally, as mentioned, our collaborators have expressed the desire to be able to edit sequences; by editing the sequence and updating the conserved domain display, the accuracy of a sequence or its relationship to its domain might be better understood.
Availability and requirements
Project name: MSAVis
Project home page: http://agbase.msstate.edu/tools/MSAVis.html
Operating system(s): Platform independent
Programming language: Python
Other requirements: Python 2.4–2.6, BioPython 1.5 or later, PyOpenGL 2.09 or later, wxPython 2.8 or later
Any restrictions to use by non-academics: Contact authors
This research was supported by an NSF EPSCoR award, grant number EPS-0556308.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Larkin M, Blackshields G, Brown N, Chenna R, McGettigan P, McWilliam H, Valentin F, Wallace I, Wilm A, Lopez R, Thompson J, Gibson T, Higgins D: ClustalW and ClustalX version 2. Bioinformatics 2007, 23(21):2947–2948. 10.1093/bioinformatics/btm404View ArticlePubMedGoogle Scholar
- A Conserved Domain Database and Search Service[http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml]
- Marchler-Bauer A, Anderson J, Cherukuri P, DeWeese-Scott C, Geer L, Gwadz M, He S, Hurwitz D, Jackson J, Ke Z, Lanczycki C, Liebert C, Liu C, Lu F, Marchler G, Mullokandov M, Shoemaker B, Simonyan V, Song J, Thiessen P, Yamashita R, Yin J, Zhang D, Bryant S: CDD: A Conserved Domain Database for protein classification. Nucleic Acids Research 2005, (33 Database):D192-D196.Google Scholar
- Lungu M, Xu K: Biomedical Information Visualization.In Human-Centered Visualization Environments, Volume 4417 of Lecture Notes in Computer Science Edited by: Kerren A, Ebert A, Meyer J. Springer; 2006, 311–342. [http://dx.doi.org/10.1007/978–3-540–71949–6_8]Google Scholar
- Schneider TD, Stephens RM: Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Research 1990, 18: 6097–6100. 10.1093/nar/18.20.6097PubMed CentralView ArticlePubMedGoogle Scholar
- Slack J, Hildebrand K, Munzner T, John KS: SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison in Context. In Proceedings of the German Conference on Bioinformatics Edited by: Giegerich R, Stoye J. 2004, 53: 37–42.Google Scholar
- Mayor C, Brudno M, Schwartz JR, Poliakov A, Rubin EM, Frazer KA, Pachter LS, Dubchak I: VISTA: Visualizing Global DNA Sequence Alignments of Arbitrary Length. Bioinformatics 2000, 16(11):1046–1047. 10.1093/bioinformatics/16.11.1046View ArticlePubMedGoogle Scholar
- Shah N, Couronne O, Pennacchio LA, Brudno M, Batzoglou S, Bethel EW, Rubin EM, Hamann B, Dubchak I: Phylo-VISTA: Interactive visualization of multiple DNA sequence alignments. Bioinformatics 2004, 20(5):636–643. 10.1093/bioinformatics/btg459View ArticlePubMedGoogle Scholar
- Spell R, Brady R, Dietrich F: BARD: A visualization tool for biological sequence analysis. IEEE Symposium on Information Visualization 2003 2003, 219–225.Google Scholar
- Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: InterPro: The integrative protein signature database. Nucleic Acids Research 2009, (37 Database):D211-D215. 10.1093/nar/gkn785Google Scholar
- Consortium TU: The Universal Protein Resource (UniProt). Nucleic Acids Research 2008, (36 Database):D190-D195.Google Scholar
- Clamp M, Cuff J, Searle S, Barton G: The Jalview Java alignment editor. Bioinformatics 2004, 20(3):426–427. 10.1093/bioinformatics/btg430View ArticlePubMedGoogle Scholar
- Felsensein J: PHYLIP.[http://evolution.genetics.washington.edu/phylip.html]
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- McLaren K: The development of the CIE 1976 (L*a*b*) uniform colour-space and colour-difference formula. Journal of the Society of Dyers and Colourists 1976, 92: 338–341.View ArticleGoogle Scholar
- Ware C: Information Visualization: Perception for Design. 2nd edition. Morgan Kaufmann; 2004.Google Scholar
- Pirolli P, Card SK: Information Foraging. Psychological Review 1999, 4: 643–674. 10.1037/0033-295X.106.4.643View ArticleGoogle Scholar
- Cock P, Antao T, Chang J, Chapman B, Cox C, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon M: Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25(11):1422–1423. 10.1093/bioinformatics/btp163PubMed CentralView ArticlePubMedGoogle Scholar
- McCarthy F, Wang N, Magee G, Nanduri B, Lawrence M, Camon E, Burrell D, Hill D, Dolan M, Williams W, Luthe D, Bridges S, Burgess S: AgBase: A Functional Genomics Resource for Agriculture. BMC Genomics 2006., 7(229):Google Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS: MEME SUITE: Tools for motif discovery and searching. Nucleic Acids Research 2009. [http://dx.doi.org/10.1093/nar/gkp335]Google Scholar
- The MEME Suite[http://meme.nbcr.net/]
- Nicholas KB, Nicholas HB, Deerfield DW: GeneDoc: Analysis and visualization of genetic variation. EMBNEW NEWS 1997, 4: 14.Google Scholar
- Hall T: BioEdit Sequence Alignment Editor.[http://www.mbio.ncsu.edu/BioEdit/bioedit.html]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.