SeqVISTA: a graphical tool for sequence feature visualization and comparison
© Hu et al 2003
Received: 15 September 2002
Accepted: 4 January 2003
Published: 4 January 2003
Skip to main content
© Hu et al 2003
Received: 15 September 2002
Accepted: 4 January 2003
Published: 4 January 2003
Many readers will sympathize with the following story. You are viewing a gene sequence in Entrez, and you want to find whether it contains a particular sequence motif. You reach for the browser's "find in page" button, but those darn spaces every 10 bp get in the way. And what if the motif is on the opposite strand? Subsequently, your favorite sequence analysis software informs you that there is an interesting feature at position 13982–14013. By painstakingly counting the 10 bp blocks, you are able to examine the sequence at this location. But now you want to see what other features have been annotated close by, and this information is buried several screenfuls higher up the web page.
SeqVISTA presents a holistic, graphical view of features annotated on nucleotide or protein sequences. This interactive tool highlights the residues in the sequence that correspond to features chosen by the user, and allows easy searching for sequence motifs or extraction of particular subsequences. SeqVISTA is able to display results from diverse sequence analysis tools in an integrated fashion, and aims to provide much-needed unity to the bioinformatics resources scattered around the Internet. Our viewer may be launched on a GenBank record by a single click of a button installed in the web browser.
SeqVISTA allows insights to be gained by viewing the totality of sequence annotations and predictions, which may be more revealing than the sum of their parts. SeqVISTA runs on any operating system with a Java 1.4 virtual machine. It is freely available to academic users at http://zlab.bu.edu/SeqVISTA.
A significant portion of modern biological research involves the identification of the biochemical and biological functions associated with one or multiple positions of a sequence. Numerous databases have been constructed to store these sequence regions and their associated functions, defined as sequence features. An example compilation of such databases is available at http://zlab.bu.edu/~mfrith/tools.shtml. Common features for DNA sequences include introns, exons, 3' or 5' untranslated regions, transcription start sites, cis-elements and other protein binding sites, repeats, low complexity regions and single nucleotide polymorphisms (SNPs). Protein sequence features include secondary structures (α-helices and β-strands), transmembrane regions, and post-translational modifications such as phosphorylation and glycosylation sites. There can be dozens of features associated with a single sequence. Frequently, features can be nested; for example, a SNP can reside within a cis-element, which can be in an intron. Therefore, it is extremely difficult for a text record in a database to reveal all of the salient features of a sequence to the user in an intuitive fashion.
The human genome project has motivated substantial scientific and technological developments in sequencing large eukaryotic genomes. Among the many tools developed in the course of the project, web-based graphical viewers facilitate the search, display and retrieval of sequences and annotations associated with a genome. Such viewers are typically integrated with the databases that store the genomes. They are not only extremely important for delivering the final results of a sequencing project to lab-bench biologists but also indispensable in the assembly and annotation of genome drafts, since assorted evidence must be integrated. Three well-known genome viewers are available for the public working draft of the human genome: the viewers developed by the Ensembl project http://www.ensembl.org; , the human genome browser at UCSC http://genome.ucsc.edu;  and the NCBI map viewer http://www.ncbi.nlm.nih.gov. The focus of genome viewers is typically above the gene level, with the most common use of searching for evidence of novel genes. VISTA is another user-friendly program for visualizing the alignment of very long DNA sequences . With the rapid enrichment of annotated sequence features, there is a need for sequence viewers at the nucleotide or amino acid level, targeting lab-bench experimentalists. An example of such a sequence viewer, viewGene, focuses on polymorphism visualization .
Computational analysis of DNA and protein sequences is among the most frequently encountered activities in bioinformatics research. Computational tools for sequence analysis are often specialized in producing only one kind of feature, and frequently in text output format. The most widely used tools detect genes [5–8], cis-elements and general promoter regions [9–12], repeats [13, 14], protein secondary structures [15–17] and protein transmembrane helices [18, 19]. Currently, there is no visualization tool to easily compare the output of a sequence analysis program with the annotations of this sequence stored in a database, as well as to compare the outputs of multiple analysis programs. THEATRE  is an attempt to combine the sequence features produced by widely used sequence analysis tools or sequence databases; however, it only produces a static postscript graph.
We have developed SeqVISTA with the exact goal of facilitating the visualization of sequence features in annotation records such as those of GenBank  and Swiss-Prot , as well as the comparison of multiple sequence feature sets, produced by different sequence analysis programs, with the annotation record. We take advantage of the observation that all sequence features are indexed with one or several positions of the sequence, and construct a coherent framework for the representation of virtually unlimited feature types and feature sets. SeqVISTA can be a general platform for integrating numerous sequence analysis tools, and thus alleviate the need of developing program-specific visualization software. More importantly, with careful programming design and implementation, SeqVISTA targets the broad community of experimental biologists. All features are linked directly and dynamically to the sequence itself, and a user is presented with the global view of the most salient features. Furthermore, the user can extract any feature-containing sequence region easily and precisely for performing further experiments.
The three panels of SeqVISTA collectively present all aspects of an annotated sequence. The tree panel emphasizes the organization of features. The graphics panel focuses on the locations and sequence lengths of features. The sequence panel illustrates the sequence details. The three panels are dynamically linked in two ways. First, each type of feature adopts the same color in all panels. Second, if the user selects a feature in one panel by mouse clicking, the corresponding feature or sequence region in the other two panels will be highlighted accordingly (the sequence panel will be scrolled automatically to show the highlighted region, if it was not visible in the previous scroll). For clarity in the graphics panel, as well as for accommodating nested features, we allow any type of feature to be hidden (by right-clicking any instance of the feature in the tree panel). Hidden features are shown in parentheses in the tree panel (for example, intron in Figure 1), and they are not shown in the graphics panel. The nucleotides or amino acids in the sequence panel are colored according to the outermost layer of non-hidden features. SeqVISTA responds to all user requests, such as selecting or hiding features, by updating the display in all panels accordingly. The user can also mouse over a feature to obtain its annotation without updating the display in other panels (e.g., the Alu repeat in Figure 1). In every panel, the user can output the content of the entire panel as a color image in jpeg format. The user can also save the image in the graphics panel at higher resolution (300 or 600 dpi), which can be directly incorporated into scientific publications.
An important goal of our design is to make SeqVISTA extremely friendly to experimentalists. An experimentalist frequently needs to locate a fragment in a long sequence according to the starting and ending coordinates of the fragment. Manually counting the positions is a laborious and error-prone process. We have developed several functions in the sequence panel of SeqVISTA to render this task effortless: 1. The user can input the start and end coordinates and the corresponding fragment will be highlighted. 2. The user can also search using the sequence of a fragment, with the option of searching both the forward and the reverse strands of a DNA sequence. 3. We also accept regular expressions for the search, if the exact sequence of the fragment is not available. 4. The user can highlight a region and search for more occurrences in the entire sequence. All these functions can be operated entirely with copying and pasting with the mouse to avoid manually typing sequences. They can be evoked by right-clicking in the sequence panel. They are also available from the Edit tab of the top menu bar.
Another user-friendly aspect of SeqVISTA is that the user can launch the program while browsing a sequence record using Internet Explorer, by clicking the "SeqVISTA" button, which is added to the browser during the installation of SeqVISTA. Another flavor of this function is that the user can load several records into SeqVISTA by opening a text file that contains their GenBank Identification (GI) numbers. Note that even though SeqVISTA can accommodate multiple sequence records, it does not align them automatically. In the future, we plan to implement functions to integrate multiple sequence records associated with the same gene or protein.
One more example involves the dynamic display of SeqVISTA functions that are sequence type specific. SeqVISTA is capable of displaying both DNA and protein sequences and different functions are involved with different sequence types. However this could become confusing to a user if all functions are available at all times. Instead, we have ensured that the functions not applicable to a particular sequence type are grayed out.
One of our goals is to establish SeqVISTA as a general platform for visualizing the results of sequence analysis software and comparing them to the annotations of the same sequence stored in a public database. We have developed several means to facilitate communication among different software programs: 1. Common format. SeqVISTA accepts several formats: GenBank flat file (GBFF) format, GenBank HTML format, FASTA format, and the simple meta-data based SeqVISTA format. In the future, we plan to support EMBL format as well. The user can load a sequence record into SeqVISTA by supplying its GI number, or accession number, or the web address while viewing it at the NCBI website. The user can visualize the outputs of a sequence analysis software package using SeqVISTA, as long as they are in any of the above formats. SeqVISTA format allows the user to save multiple records and outputs into one file, which makes future viewing easy. 2. Plugin. Plugins can be developed to recognize the outputs of external software. Thus, any external software can use SeqVISTA as its graphical interface, instead of developing its own specialized graphical modules. Detailed instructions for developing SeqVISTA parser plugins and sample codes can be found at the SeqVISTA web site http://zlab.bu.edu/SeqVISTA. 3. Direct query. Most widely used sequence analysis programs support web servers. For these programs, we can develop SeqVISTA functions to directly query a web server and retrieve results. For all of the above three means of retrieving results, we can display the results directly in the graphics panel to facilitate the comparison between the results and database annotations. We use three examples to illustrate the above functions of SeqVISTA.
RepeatMasker screens an input DNA sequence against a library of repetitive elements (A.F.A. Smit & P. Green, unpublished data). The program produces a list of identified repeats, their locations in the input sequence, their match scores and three quantities associated with the scores: % substitution, % insertion and % deletion. The repeats identified by RepeatMasker can be displayed as if they are sequence features. The various scores can be displayed as bar graphs. We have developed a SeqVISTA function to query the web server of RepeatMasker directly. The user only needs to right-click in the graphics panel and choose the RepeatMasker option in the nucleotide analysis tab, and SeqVISTA will submit the sequence being viewed to the RepeatMasker server, retrieve the results and display them.
PSIPRED is one of the most accurate programs for predicting protein secondary structures (α-helices and β-strands) . It produces a confidence score (between 0 and 9) for each position of the input sequence to be in a secondary structure state or otherwise in the coil state. An α-helix is made of a contiguous stretch of positions in the α-helix state, likewise for a β-strand. PSIPRED accepts an input sequence in a web form, and emails the results back to the user.
Cister is a program that accepts a genomic sequence and a set of cis-element weight matrices and computes the locations of cis-elements and their clusters in the genomic sequence . The outputs of Cister include the scores and locations of cis-elements, and a graph, which plots the probability that a position of the input sequence is in a cis-element cluster. It is helpful for the user of Cister to compare the predicted cis-element locations with the GenBank record of the input sequence, which could contain the experimentally determined promoter region, as well as known cis-elements. Such a comparison would still be valuable even if the promoter region or cis-element locations are not included in the GenBank record, since knowing regions such as exons or repeats could help in analyzing the Cister output.
We have developed a SeqVISTA plugin to interpret the output files of Cister. A user can then add a Cister output while visualizing the GenBank record of the sequence in SeqVISTA. Figure 4 illustrates the Cister output of the human SV40 virus genome. Three figures are added to the graphics panel of SeqVISTA. The top figure includes the predicted locations of three cis-element types (LSF, Sp1 and Ets-1) in the SV40 genome. These predicted cis-elements are treated as typical sequence features; therefore, they are included in the tree panel as well as linked to the sequence panel. The middle bar graph in the graphics panel indicates the scores of predicted cis-elements. The bottom figure plots the probability that each base in the SV40 genome is in a cluster composed of these four cis-element types, judged according to the strengths of individual cis-elements and their local concentration . By juxtaposing the Cister output and the GenBank annotation of the same sequence, the user can easily examine the context of the Cister predictions and select the most plausible ones for experimental testing.
We have developed a sequence visualization tool called SeqVISTA. It focuses on the detailed base-by-base or residue-by-residue level of a sequence and its annotations. Our first goal is to enable a user to grasp the most salient features of a sequence at a glance, and extract the corresponding bases or residues precisely and painlessly. To this end, we have made a conscious effort to make the user interface of SeqVISTA simple, intuitive and coherent. While searching GenBank, the user can load a record into SeqVISTA by a single click. The user can also save all contents of a SeqVISTA window as publication quality images. Our second goal is to establish SeqVISTA as a general platform for visualizing the results of sequence analysis software, as well as for comparing these results to the annotations of the same sequence. We have devised three ways to achieve this goal: common file format, parser plugin and direct query of software with a web server. SeqVISTA is written in Java and has been extensively tested on Windows and Linux. It should run on any operating system with a Java 1.4 virtual machine. It is freely available to academic users at http://zlab.bu.edu/SeqVISTA.
http://genome.ucsc.edu, the UCSC genome browser.
http://www.ensembl.org, the Ensembl genome viewer.
http://www.ncbi.nlm.nih.gov, Entrez map viewer at NCBI.
http://www-gsd.lbl.gov/vista/, VISTA (visualization tools for alignments)
We thank Kevin Wiehe for proof reading the manuscript. This work was funded by NSF grant DBI-0078194 and NIH grant 1P20GM066401-01.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.