MultiSeq: unifying sequence and structure data for evolutionary analysis

Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software:


Background
In the field of bioinformatics, research activities are often split into two distinct areas: sequence analysis or structure analysis. Genomic and other sequencing projects generate enormous amounts of sequence data that are initially released with large portions annotated as either putative or hypothetical. Structural data, even in the era of structural genomics, are produced at a slower pace but analyzed to a high degree before being deposited into the public databases, such as PDB [1], SCOP [2,3]/Astral [4], CATH [5] and NDB [6]. This difference in pace has led to an increasing discrepancy in the relative sizes of these two data sets. The total size of the sequence databases (NCBI [7,8], EMBL [9], DDBJ [10], JGI [11], RDP [12], Swiss-Prot/TrEMBL [13], CRW [14], Bayreuth tRNA compilations [15], and the Genomic tRNA Database [16]) is several orders of magnitude greater than that of the structure databases. For a given protein, the large number of available sequences allows more complete evolutionary analyses. Multiple sequence alignments (MSA) are instrumental in identifying key conserved areas of a sequence, developing an evolutionary history of a molecule [17], and examining the covariance within a sequence in response to evolutionary pressure [18]. These analyses depend on having enough sequences to perform a well-balanced statistical analysis. The advantages of structural data are that they provide much more detailed information about the molecule in question, allowing specific atomic level interactions to be analyzed. Additionally, since structure is more conserved than sequence [19], structural data can be used to reconstruct many of the deeper evolutionary branches that would be difficult or impossible to determine with sequence data alone [20][21][22][23].
VMD [24], which currently has more than 30,000 registered users, provides powerful visualization and analysis capabilities for both structural and dynamics data generated from molecular dynamics simulations, as well as energetics derived from molecular mechanics force fields. It is optimized to handle large scale systems containing millions of atoms. VMD also implements a flexible scripting interface for the creation of custom tools. The previous Multiple Alignment [25] extension to VMD added only the capability to use evolutionary information obtained from multiple structures for interpreting structural results. Our goal with MultiSeq is to extend VMD's capabilities further by incorporating the more diverse evolutionary data available in sequences into the analysis process.
There are already a large number of tools available to analyze bioinformatic data, but, like the field itself, they are mostly segregated into either sequence or structure tools. In the sequence world there are tools for viewing, analyzing, and editing an MSA like AE2 [26], CINEMA [27], ClustalX [28], and Jalview [29]; there are tools for creating MSAs by aligning individual sequences and profiles, like ClustalW [30], HMMER [31] and T-Coffee [32]; there are tools for annotating sequence data such as Pfaat [33]; and BLAST [34] is used for searching through databases for related sequences. One popular package, MEGA3 [35], provides an evolutionary approach to analyzing protein and nucleic acid sequences, including many easy to use features for determining sequence based phylogenies. Similarly, in the structure world there are numerous tools for visualizing structural data and performing structural analyses, including RASMOL [36], STAMP [37], STRIDE [38], and 3DNA [39].
There are also a few programs that combine sequence and structure data either for specific purposes, such as Swiss-PdbViewer/SWISS-MODEL [40] and MolIDE/SCWRL3 [41,42] for homology modeling, or as part of a pre-computed database of attributes, of which STING [43] is the primary example. Modeller [44] allows structural features to be built using sequences and structures of homologous proteins and nucleic acids. UCSF Chimera [45], a well known molecular modeling program that originated to handle small molecule docking, provides the ability to use sequence data in conjunction with structural data. However, it lacks some of the features needed to perform wellbalanced evolutionary analyses, such as phylogenetic tree construction and elimination of bias. Friend [46], a bioinformatics application, has many of the sequence features required for performing evolutionary analyses, but has insufficient structural functionality to fully interpret the results in a structural context. InsightII and Discovery Studio (Accelrys Inc.) and MOE (Chemical Computing Group Inc.) are popular commercial packages for analyzing protein/drug interactions based on protein structure, dynamics, and energetics. Both programs can also use sequence data to perform combined analyses. NCBI's Cn3D [47] also supports both sequence and structure data, although it is primarily designed for use with precomputed 3D superpositions and MSAs.
The complementary information provided by fusion of these two data sources can give insight into evolutionary changes in sequence, structure, and function. However, the conceptual spaces of these fields differ, often resulting in mutual incomprehensibility to researchers in each field. MultiSeq, in dealing with both sequence and structure data in a way that is accessible to both areas, helps to bridge this gap. It does so through the use of integrated cross-referencing that acts as an informal version of ontology-driven knowledge extraction and discovery [48]. We plan to enhance future versions of MultiSeq to incorporate formal ontological methods, including using the work of groups such as the Gene Ontology project [49].
There is tremendous utility to be had in combining both sequence and structure data within an evolutionary framework using the four pillars of information science, information visualization, mathematics, and biology to organize the flow of information. An evolutionary profile (EP) is a concise and complete representation of the diversity that has been generated by the evolutionary process within a homologous group of proteins. A key step in the creation of an EP is the elimination of redundancy present in the sequence and structural databases [50] due to bias in the selection of organisms chosen for study. The sequence and structure QR algorithms have been developed specifically to address this problem [21,22]. These smaller, more evolutionarily balanced profiles have comparable, and in many cases better, performance in database searches than conventional profiles containing hundreds of sequences. For more diverse families or superfamilies, with sequence identity < 30%, structural alignments, based purely on the geometry of the protein structures, provide better alignments than pure sequencebased methods. Merging the structure and sequence information allows the construction of accurate profiles for distantly related groups. The success of using sequence and structure based EPs for both gene annotation [23] and the prediction of structurally conserved motifs [51] shows their effectiveness. We also anticipate the usefulness of EPs for studying, among other things, the relationship between protein structure and stability, the evolution of protein/RNA interfaces, and the basis of protein conformational motion.
The actual process of creating an EP is detailed in Sethi et. al. [22], a tutorial [52], and a forthcoming applications paper, but can be summarized as follows: 1. Load a set of sequences and structures and their associated metadata.
2. Align the data, using structural alignments as profiles for aligning widely divergent sequence groups.
3. Perform a phylogenetic analysis to determine the evolutionary relationships in the data. 4. Check and adjust the alignment using the phylogenetic tree and taxonomic information as guides.
5. Eliminate any redundant data that may be a source of bias.
Within this process there are many difficulties, such as identifying horizontal gene transfer (HGT) events and misannotated data, both important for proper grouping of evolutionary data, and developing a statistically wellbalanced set of sequences and structures. MultiSeq attempts to lower some of these barriers to combining sequence and structural data into EPs by consolidating the tools necessary to perform such analyses in an intuitive software package (Figure 1).

Importing protein data for analysis
The primary function of MultiSeq is to provide an environment for the evolutionary analysis of bioinformatic data from both structure and sequence. Before any analysis can be performed, however, the data must first be imported into the environment, which is often a non-trivial task given the wide variety of sources from which data may be acquired. MultiSeq provides a consistent interface to allow data from numerous sources to be quickly and easily brought into the environment and consolidated for further analysis.
Structural data for biomolecules come in a bewildering array of file formats, a large number of which can be read and processed by VMD. To take advantage of this capability, MultiSeq relies on VMD to parse structure files and present a 3D representation of the data. After VMD has loaded the structural data, MultiSeq creates a copy of the sequence portion of the data, stores that in its own internal data structures for use when displaying 1D representations of the data (see Figure 1), and then establishes a link between its internal data structures and those of VMD. This synergy means that MultiSeq works with every format of structural data that VMD supports, including such common formats as PDB, XYZ, NetCDF, and CHARMM. Mul-tiSeq also makes it easy to load multiple structures, which is necessary during the construction of a structural profile. Additionally, MultiSeq extends VMD's ability to load protein structures over the Internet by allowing multiple PDB codes to be specified and individual domains of protein structures to be loaded directly from the Astral database [4].
Sequence data are often stored in a single file containing multiple sequences, in either an aligned or an unaligned state. MultiSeq can load sequence files formatted in ALN, FASTA, Nexus, PIR, and PHY file formats. For FASTA formatted files, description lines are preserved and made available through the electronic notebook, described below. Upon loading a sequence file, MultiSeq can automatically download corresponding structural data, if it is available in one of the known structural databases. Currently supported structural databases are the PDB, Astral, and the subset of Swiss-Prot that is derived from the PDB.
A final method of loading protein sequence data into MultiSeq is through the use of a BLASTP [53] search. Given a target sequence or profile, BLAST discovers a variety of homologous sequences which can then be incorpo-rated into the analysis. MultiSeq uses a locally installed version of BLAST to search local sequence databases using a single sequence, a profile of sequences [34], or a fragment of a sequence or profile. The search can be performed a single time or iteratively using PSI-BLAST, and the search results are displayed and filtered before being imported into MultiSeq, as shown in Figure 2. Current filtering options include BLAST e-score, taxonomic classification, and a redundancy filter based on the sequence QR algorithm [22] on the BLAST generated alignment. As when loading sequence data from a file, any corresponding structural data for the search results can be automatically downloaded when the search results are imported. This opens up the possibility of running a BLAST search against the PDB or Astral databases to load structures that share sequence similarity with a source sequence or profile, a feature that is particularly useful for finding a template during homology modeling of a protein of unknown structure [51].
Depending on the size of the protein, MultiSeq can easily load hundreds of sequences and/or structures. The time required to perform an analysis of such a large set, however, depends on the analysis method being used.

Organizing data to accommodate various analysis frameworks
The number of different sources of sequences and structures can be intimidating and calls for an organizational framework in which to work with the data. At the same time, the varied uses of these data demand that the framework be flexible enough to accommodate a wide variety of users. MultiSeq addresses this issue by implementing a flexible grouping system. Each sequence in MultiSeq is displayed beneath its group in the main display, as shown in Figure 3A. The group header acts as an interface anchor to allow the user to perform operations on the group as a whole, and the status bar shows overview information about the currently selected group. The default grouping is based on the source of the data, i.e., structures loaded through VMD appear in the VMD Structures group and sequences loaded by a BLAST search appear in the BLAST Results group, but the user can easily expand, rename, and reorder these groupings as appropriate for the situa- Sequence Name .  3) The taxonomy dialog allows the user to select the level of taxonomy by which to group the data. (4) Taxonomic information about the data is then used to create the groupings.    tion at hand. Additionally, the data can be automatically grouped by taxonomic classifications ( Figure 3B). Separating the data into evolutionarily distinct groupings allows any analysis to be easily performed on each related group independently.

Finding metadata automatically via the internet
Metadata (or "data about data") -such as taxonomy, enzymatic function, or structural classification -related to sequence and structural data can provide valuable insight during many bioinformatic analyses. Various databases accessible via the Internet store this information and present it when displaying results but otherwise make little use of it. MultiSeq correlates this metadata by cross-referencing both the name of the sequence or structure and any source information contained in the original file. Currently, MultiSeq can extract NCBI taxonomy information [54], Enzyme Commission (EC) numbers, and SCOP structural classifications [2]. MultiSeq integrates this metadata into the evolutionary analysis process through grouping and phylogenetic tree functions.
Metadata can be added, viewed, and edited using the electronic notebook (Figure 4). The electronic notebook provides a consistent way to interact with all available metadata for a sequence, regardless of its source. It also provides a place to store notes regarding the sequence and any processing that has been performed on it. Changes to the metadata are saved along with a MultiSeq session, described below.

Alignment of sequence and structural data
In order to properly analyze multiple homologous sequences and structures, they must first be aligned. For structural data, a version of STAMP [37] that has been modified to better align end regions (details of the modifications are available in the methods section) is used to perform the alignment. For sequence data ClustalW [30] is used. In the next version of MultiSeq, a plug-in framework is planned to allow other sequence and structural alignment programs, such as HMMER [31] and T-Coffee [32], to be used. It is generally accepted that structural alignments are more reliable than sequence alignments for distantly related proteins and RNA molecules [55,56,19], so MultiSeq allows a structural alignment to be passed to ClustalW as a profile to seed the alignment process. This technique can be particularly effective when the structural profile is in fact an evolutionary profile and contains a non-redundant sampling of distantly related structures. In either case, though, the alignment is rarely perfect and some manual editing is usually required using MultiSeq's built-in editor.

QR algorithms to eliminate redundancy and bias in data
Although the vast quantity of data available in this postgenomic era brings many new possibilities for analysis, it also opens up the potential for introducing systematic errors in these analyses due to the biases inherent in the makeup of the various databases. MultiSeq includes both the sequence QR [22] and structure QR [21] algorithms to help detect and eliminate this redundancy during any step of the analysis process. The sequence and structure QR algorithms orthogonally encode a multiple sequence or structure alignment as a multidimensional matrix, and then perform a QR factorization on this matrix [57]. The result is an ordering of the sequences or structures from most linearly independent to least independent. A nonredundant set from amongst the available data (See Figure  4) is constructed by specifying a cutoff in either sequence or structural similarity. The QR algorithms can also be run on a specific region of the MSA so that the non-redundant set can be generated based on, for example, one domain of a multi-domain protein or an insertion in the sequence. The sequence and structure QR algorithms combined with the grouping and selection capabilities of MultiSeq constitute a powerful environment for constructing EPs for use in bioinformatics-intensive tasks such as homology modeling [51] or gene annotation [22,23].

Analyzing phylogenetic relationships
Phylogenetic trees, which show the relationships between related proteins or nucleic acids, are invaluable when performing evolutionary analyses. They provide a guide for investigating why and how certain attributes developed as well as identifying misalignments and HGT events. The accuracy and speed of various tree reconstruction methods, however, varies widely from simple distance based methods such as unweighted pair group method with arithmetic averages (UPGMA) [58] and neighbor-joining (NJ) to complex methods such as maximum likelihood [59], which take into account an underlying theory of evolution. In general, distance based trees are sufficient for many common uses [60]. MultiSeq creates UPGMA trees using the structural measures Q H [20] and root mean square deviation (RMSD) as well as the sequence measure of percent identity. It also creates trees based on similarity using the NJ method of CLUSTALW [30]. After a tree has been computed, it can be decorated and colored with various attributes such as species name, domain of life, and enzymatic function, as shown in Figure 4. Additionally, various manipulations such as collapsing, rotating, and labeling nodes to assist in visualization can be performed.
One further use for phylogenetic trees within MultiSeq is in conjunction with the QR algorithms to eliminate redundancy from data. When either the SeqQR or StructQR tools are used on data being displayed in a phylogenetic tree, those data are highlighted both within the main environment and within the tree viewer ( Figure 4). This feature allows for evaluation of the non-redundant selection so that the user can adjust the cutoff. The orderings from the QR algorithm indicating which data are most linearly independent are also displayed in the tree to assist in this process.

Using visualization to illuminate trends
One way to effectively present complex information is to color code the data using attributes that are not normally visible [61]. MultiSeq presets attributes of the data as coloring in both the 1D representation of the sequence portion of the data and the 3D representation of the structural portion of the data. It maintains a consistent coloring between the two representations in order to facilitate an easy mental transition between them. Many different sequence and structural metrics are currently implemented as coloring choices and the addition of custom coloring methods is supported through a programming interface. The current list of standard metrics is: sequence conservation, sequence entropy, percent sequence identity, sequence similarity, Q res structural similarity, residue type, and structural RMSD. In addition to calculating attribute values, MultiSeq can import them from a tab or space delimited file. This enables the importing of other types of attribute data, such as those from HD exchange or Φ-value experiments.
Many of the above coloring metrics are calculated by comparing two or more sequences or structures to get a value representing the specified attribute. The default behavior is to use all of the loaded data in the calculation of a metric, however, one can optionally have MultiSeq process each group independently. Using this feature one can, for example, view sequence identity across all domains of life and then quickly switch to sequence identity within the individual domains. This capability can be very useful for identifying a signature of a specific group of sequences or structures.
Another method of assisting in visualization is to hide attributes that provide no relevant information in the current context. Often, eliminating this extraneous information can lead to patterns being more quickly understood. One common technique of dealing with this issue in the world of structural biology is through the use of secondary structure representations, showing only the backbone and secondary structure elements of a protein. VMD supports secondary structure, as calculated by STRIDE [38], as a 3D representation and MultiSeq can display it for structures The QR ordering of the non-redundant set is also displayed, lower numbers indicate data that are more linearly independent. (4) The plotter allows a metric to be plotted along the length (or a subset) of the sequence. All of the coloring metrics can also be used by the plotter.  as a graphical 1D representation. MultiSeq also provides a bar and line representation that is particularly useful when visualizing attributes that are zero over portions of the sequence, such as experimental data.

Nucleic acid sequences and structures
MultiSeq also supports bioinformatic analysis of both nucleic acid sequence and structure data, but the tools are somewhat more limited in the present release. Nucleic acid sequences may be imported as unaligned sequences or as MSAs using any of file formats supported for protein sequences. These data may be obtained from a variety of databases including IMG, NCBI (Genbank), Bayreuth tRNA compilations, CRW, RDP, and the Genomic tRNA Database. We provide external scripts to convert files from AE2 format (provided by Gary Olsen) and Bayreuth flatfiles to the FASTA format [62]. BLASTN support for finding related nucleic acid sequences is planned for the next release.
Once nucleic acid sequence data have been loaded, multiple sequence alignments can be computed using the Clus-talW interface within MultiSeq. Only coloring by sequence identity works with nucleic acids, other sequence-based coloring metrics specific for nucleic acid will be incorporated in the next release. STAMP has been modified to align nucleic acid structures by their backbone phosphorous atoms resulting in a structural alignment analogous to α-carbon based alignment for proteins. When the alignment is complete the 3D representation displays the structural superposition of the aligned molecules. The built-in structural alignment analysis tools, such as structure-based trees and coloring metrics, work correctly with nucleic acid structural alignments.
RNA molecules frequently incorporate nonstandard modified nucleotides that can affect folding, structure, and function. For example, the TψC loop in tRNA typically contains a ψ, or pseudouridine, base. There are on the order of 100 RNA-associated modified bases identified at this time [63]. The RNA molecule, as opposed to its DNA gene, must be sequenced to determine the modified bases included. When this information is available in structure or sequence files MultiSeq recognizes and appropriately displays modified bases in the 1D representation. In the next release of MultiSeq, QR will be available for nucleic acids, and a canonical, evolutionarily balanced 16S rRNA will be incorporated to help with phylogenetic analysis. At that time secondary structure analysis tools for nucleic acid structures will also be included.

Exporting data
It is often desirable to preserve an entire bioinformatic analysis so that work can be resumed at a later time. Mul-tiSeq implements this by saving the entire environment as a session. When a session is saved, all of the sequence and structure data loaded into VMD and MultiSeq are saved along with any alignments and transformations that have been applied to them. Metrics, annotations, metadata changes, and representation choices are also saved with the session. Once a session has been saved, it can later be loaded and work resumed quickly and easily.
MultiSeq also supports numerous formats for exporting all or a subset of the data in the environment to a file. This can be useful if an analysis needs to be run using external bioinformatics software. For example, MultiSeq can export all of the files necessary to run a maximum likelihood/parsimony based phylogenetic analysis of sequence data using PAUP* [64], PHYLIP [65], and PHYML [66] or a Bayesian based analysis using MrBayes [67].
A final feature of note is MultiSeq's ability to export publication quality graphics. The sequence window, tree viewer, and plotter can all save PostScript files of their current representation. Since these are vector graphics, they can be scaled and manipulated using illustration software with no loss of quality.

Methods
Q res -We use a measure called Q res to calculate structural similarity of each residue in a set of aligned structures. It is derived from Q, which is used in protein folding to compare the pair distances in a protein conformation to the native one [68]. We have previously used this measure for deriving protein cores by looking at structural conservation [21,20]. Q res computes the similarity of the C α -C α distances between a residue and all other residues in the protein, excluding nearest neighbors, to the corresponding distances in a given set of proteins. The result is a value between 0 and 1 that describes the similarity of the structural environment of a residue in a particular protein to the environment of that same residue in all other proteins in the set. Lower scores indicate low similarity and higher scores high similarity. Formally, Q res is defined as follows: where is the structural similarity of the i th residue in the n th protein, is the C α -C α distance between residues i and j in protein n and is the C α -C α distance between residues i' and j' in protein to that correspond to residues i and j in protein n. The variance is related to the sequence Q H -For measuring the similarity between two structures, we use Q H , which we have previously derived [21,20]. Like Q res , it is also adapted from Q, but accounts for the presence of insertions in the structure. Briefly, Q H calculates an overall score for the similarity of two structures by summing the similarity of all residues and then adding a term for each gap in the alignment. The more that an insertion perturbs the structure of nearby regions, the lower the resulting Q H value.
QR factorization -The sequence and structure QR algorithms eliminate the redundancy from a collection of sequences or structures, respectively. The output is the smallest set of sequences or structures that represents the evolutionary diversity present in the initial group. These algorithms are based on a QR factorization with column pivoting of a matrix encoding the sequence or structure alignment. We have described each of these algorithms and their utility in developing EPs previously [21,22].
STAMP -The STAMP structural alignment program generates both structural superpositions and sequence alignments using tertiary structure comparisons [37]. Two modifications were made to the STAMP structural alignment program included with MultiSeq. First, the program was modified to work with RNA and DNA by allowing it to read structure files containing the phosphate backbone atoms of nucleic acid molecule and to recognize the residues contained in these files. Second, the program was modified to insert gaps into the multiple sequence alignment so that the trailing, poorly aligned ends of different structures will be gapped with respect to one another. These end-gaps are a natural result of the dynamic programming local alignment algorithm used by STAMP.
C++ bioinformatics library -Many of the algorithms are written in C++, since TCL is less suited for computationally intensive work. To facilitate the development and implementation of these algorithms, we have developed libbiokit, a bioinformatics toolkit. This library is comprised of classes that perform file I/O, such as FASTA and PDB readers and writers; classes that represent commonly used bioinformatic data structures, like sequence and structure alignments; and stand-alone utilities that execute the QR, Q H , Q res , and phylogenetic algorithms along with other standard measures used in the analysis of bioinformatic data. Libbiokit is packaged with MultiSeq and is also available separately as open source software from our website [62].

Conclusion
MultiSeq allows new approaches to be taken in bioinformatic analysis: new relationships can be found and investigated by combining sequence and structure data; automatic download and use of metadata along with flexible grouping encourages organized analysis of unfamiliar data; the ability to remove redundancy from large sets of data helps to focus and speed up evolutionary analyses; and integration with several popular bioinformatics tools along with a versatile input and output ability reduce the time and "busy work" overhead of performing any analysis. MultiSeq extends VMD's capabilities into the realm of sequences based data and we hope that MultiSeq will help bring more widespread use of sequence data to the world of structural biology and vice versa.
MultiSeq is included with VMD starting with version 1.8.5 [69]. MultiSeq benefits from VMD's cross platform nature and currently runs on numerous operating systems, including Linux, Mac OS X, Solaris, and Windows. Metadata databases are automatically downloaded and updated via the Internet and can be stored either on the user's local machine or a workgroup file server. The use of BLAST for searching requires a locally installed version of the BLAST software from NCBI and sequence databases stored either on the local machine or a workgroup server. Detailed instructions on configuring the software are available in the MultiSeq manual available online [70]. A tutorial is also available from the NIH Resource for Macromolecular Modeling and Bioinformatics to assist in learning how to use the features of MultiSeq described in this article [52].