BioWord: A sequence manipulation suite for Microsoft Word
© Anzaldi et al.; licensee BioMed Central Ltd. 2012
Received: 3 January 2012
Accepted: 10 May 2012
Published: 7 June 2012
The ability to manipulate, edit and process DNA and protein sequences has rapidly become a necessary skill for practicing biologists across a wide swath of disciplines. In spite of this, most everyday sequence manipulation tools are distributed across several programs and web servers, sometimes requiring installation and typically involving frequent switching between applications. To address this problem, here we have developed BioWord, a macro-enabled self-installing template for Microsoft Word documents that integrates an extensive suite of DNA and protein sequence manipulation tools.
BioWord is distributed as a single macro-enabled template that self-installs with a single click. After installation, BioWord will open as a tab in the Office ribbon. Biologists can then easily manipulate DNA and protein sequences using a familiar interface and minimize the need to switch between applications. Beyond simple sequence manipulation, BioWord integrates functionality ranging from dyad search and consensus logos to motif discovery and pair-wise alignment. Written in Visual Basic for Applications (VBA) as an open source, object-oriented project, BioWord allows users with varying programming experience to expand and customize the program to better meet their own needs.
BioWord integrates a powerful set of tools for biological sequence manipulation within a handy, user-friendly tab in a widely used word processing software package. The use of a simple scripting language and an object-oriented scheme facilitates customization by users and provides a very accessible educational platform for introducing students to basic bioinformatics algorithms.
In a relatively short time, editing and processing of DNA and protein sequences have left the realm of molecular biology to become a routine practice for biologists working in myriad different fields. At the same time, the number of tools and servers for performing analyses on biological sequences and related data has exploded, creating a need for resource integration . There have been several attempts to reconcile this vast and expanding array of services with data and service integration. Many of these approaches have relied on the creation of web-based service portals that seek to integrate and simplify data collection analysis with a wide variety of available tools [2–4], while other efforts have focused on service and data integration through the use of browser-enabled interoperability between services, data providers and even desktop applications [5–7].
The sheer scope and power of data and service integration portals and browser add-ons is also one of the main obstacles to their wide acceptance, since many users rarely need to use more than one or two services (e.g. BLAST and Entrez search) and lack the necessary training in bioinformatics to navigate easily through interconnected repositories of data and services . Still, a wide range of practicing biologists must routinely perform relatively simple manipulation, editing and processing of DNA and protein sequences on a daily basis. To perform these routine manipulations, this substantial segment of users has resorted to proprietary desktop software, like DNAStar or the GCG Wisconsin Package [8, 9], ingenious bookmarking of specific web servers, or to services that integrate several tools for sequence manipulation, like the Molecular Toolkit or the Sequence Manipulation Suite (SMS) [10, 11].
The class structure is functionally wrapped within a module structure that basically handles the interface with Microsoft Word document objects. This design strategy is aimed at decoupling the basic BioWord objects from their running environment, thus avoiding the need for derivation of specialized classes when, for instance, specific output formats are desired. The RibbonControl module handles basic communication between the ribbon, the ColSequences objects and the document. It contains the methods the ribbon buttons are linked to, thereby defining the functionality of the ribbon. Upon capture of a button-click event, the RibbonControl parses the user selection, instantiates the necessary ColSequences object and calls the appropriate ColSequences method to process the selected sequences, thus implementing the fundamental control flow of BioWord (Figure 1). The RibbonControl module also centralizes reception of ColSequences methods results and calls the appropriate method to handle their output according to sequence type and formatting options. Methods for output generation are stored in the Resources module, which handles both the specific format (e.g. FASTA or table) and destination of the output. BioWord allows output to be redirected to the clipboard, a new document, immediately following the selection or overwriting it. In addition, the Resources module defines a broad set of handy functions to manipulate both sequence and non-sequence objects, like sorting or removing duplicates from a collection. Two additional modules complement this basic module architecture. The XMLHandler module manages the interaction with the XML Options file (which defines the option fields for BioWord) and handles the loading, saving and updating of the option fields available in the ribbon.
Integration, editing and distribution
BioWord is written fully in VBA and is compliant with the Visual Basic 6 standard, thus maintaining backwards compatibility with earlier versions of Microsoft Office. Due to its explicit detachment of basic Sequence and ColSequences classes, which encode sequence processing functionality, from the document interface, the core of the code is readily adaptable to all versions of Microsoft Word supporting VBA, as well as to other Microsoft Office programs, such as Excel. BioWord is fully encapsulated within a macro-enabled (.dotm) template facilitating its distribution and installation through the use of the Open XML format . The code and the XML Options file are embedded within the .dotm structure, which also contains the ribbon stored as a XML file. BioWord code can be edited with any text editor or, more conveniently, within the integrated VBA editor of Microsoft Word. The XML Options file and the XML ribbon can be edited also with any text/XML editor. For convenience, the XML ribbon can also be edited with the freely available Open XML Custom UI Editor .
Results and discussion
Format and sequence manipulation
In its current implementation, BioWord can parse and convert to and from three widespread formats for biological sequences: FASTA , GenBank Flat File  and bare/raw sequence. Conversion buttons are available in the Manipulation group, along with reverse and complement (DNA/RNA) buttons, but output conversion can also be made implicit by setting the Format option of the Basic Options group to the desired format.
Translation and sequence statistics
Search methods and consensus logos
BioWord also exploits the ability to handle PSFM models to address a pressing need in the representation of sequence motifs. It is well known that consensus sequences are an unsuitable representation of sequence motifs because they omit information on the importance of consensus bases and the relative frequency of non-consensus bases at each position of the motif . Sequence logos are able to integrate these two missing elements, together with the consensus, in an encapsulated representation and are therefore a superior and preferred method for the representation of sequence motifs . Unfortunately, sequence logos are graphic elements and many authors continue to use consensus sequences to represent motifs in order to avoid the need for additional figures or to allow in-text discussions about the motif. BioWord provides a solution to this problem by allowing the representation of sequence motifs in text format using the consensus sequence, but depicting simultaneously its information content. For instance, the LexA-binding motif of Escherichia coli would be represented as . In this representation (the consensus logo), the vertical bar character is used to represent the y-axis scale, with the maximum value, in bits, provided next to it. The height of the consensus letter at each position corresponds to the positional information content of that position (using either mutual information or relative entropy measures). This representation does not provide frequency information of non-consensus bases and, therefore, a sequence logo should be used preferentially whenever possible. Nonetheless, the consensus logo provides the means to convey information about positional conservation in text format and its use of information theory units allows straightforward comparison of motifs (e.g. the LexA-binding motif of E. coli can be directly compared to that of the α-Proteobacteria ).
Motif discovery and alignment
BioWord integrates many commonly used methods for sequence manipulation and editing in a single add-on for Microsoft Word, providing a powerful and easily-accessible toolkit for biological sequence processing in an environment familiar and accessible to most practicing biologists. Among other functions, the current version of BioWord implements bi-directional translation, ORF detection, consensus logos, Gibbs sampling and several powerful sequence search methods. Its simple class structure and modular design based on an accessible object-oriented language (VBA) facilitate customization, code expansion and sharing. Together with its encapsulation in an environment that most students know well, these features make it also a powerful educational instrument.
Availability and requirements
Project name: BioWord
Project home page: http://sourceforge.net/projects/bioword/
Operating system(s): Microsoft Windows
Programming language: Visual Basic for Applications (VBA)
Other requirements: Microsoft Office 2007 or higher
License: GNU GPL
The authors would like to thank Naim Raja Díaz, for contributing to the development of an early forerunner of BioWord. We thank the anonymous reviewers for their insightful suggestions, which helped us improve BioWord significantly. This work was supported by the UMBC Office of Research. Writing of this paper was supported by a UMBC SFF award.
- Neerincx PB, Leunissen JA: Evolution of web services in bioinformatics. Briefings in Bioinformatics. 2005, 6 (2): 178-188. 10.1093/bib/6.2.178.View ArticlePubMed
- Navas-Delgado I, Rojano-Munoz Mdel M, Ramirez S, Perez AJ, Andres Leon E, Aldana-Montes JF, Trelles O: Intelligent client for integrating bioinformatics services. Bioinformatics (Oxford, England). 2006, 22 (1): 106-111. 10.1093/bioinformatics/bti740.View Article
- Carver T, Bleasby A: The design of Jemboss: a graphical user interface to EMBOSS. Bioinformatics (Oxford, England). 2003, 19 (14): 1837-1843. 10.1093/bioinformatics/btg251.View Article
- Subramaniam S: The Biology Workbench–a seamless database and analysis environment for the biologist. Proteins. 1998, 32 (1): 1-2. 10.1002/(SICI)1097-0134(19980701)32:1<1::AID-PROT1>3.0.CO;2-Q.View ArticlePubMed
- Basu MK: SeWeR: a customizable and integrated dynamic HTML interface to bioinformatics services. Bioinformatics (Oxford, England). 2001, 17 (6): 577-578. 10.1093/bioinformatics/17.6.577.View Article
- Bare JC, Shannon PT, Schmid AK, Baliga NS: The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications. BMC Bioinforma. 2007, 8: 456-10.1186/1471-2105-8-456.View Article
- Shahid M, Alam I, Fuellen G: Biotool2Web: creating simple Web interfaces for bioinformatics applications. Appl Bioinforma. 2006, 5 (1): 63-66. 10.2165/00822942-200605010-00009.View Article
- Womble DD: GCG: The Wisconsin Package of sequence analysis programs. Methods Mol Biol (Clifton, NJ). 2000, 132: 3-22.
- Burland TG: DNASTAR’s Lasergene sequence analysis software. Methods Mol Biol (Clifton, NJ). 2000, 132: 71-91.
- Molecular Toolkit: http://www.vivo.colostate.edu/molkit/,
- ISO/IEC: Information technology -- Document description and processing languages -- Office Open XML File Formats. 2008, International Organization for Standardization, , 1
- OpenXMLDeveloper: http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2009/08/07/7293.aspx,
- Lipman DJ, Pearson WR: Rapid and sensitive protein similarity searches. Science (New York, NY). 1985, 227 (4693): 1435-1441. 10.1126/science.2983426.View Article
- Fristensky B: Feature expressions: creating and manipulating sequence datasets. Nucleic Acids Res. 1993, 21 (25): 5997-6003. 10.1093/nar/21.25.5997.PubMed CentralView ArticlePubMed
- Nakamura Y, Gojobori T, Ikemura T: Codon usage tabulated from the international DNA sequence databases. Nucleic Acids Res. 1997, 25 (1): 244-245. 10.1093/nar/25.1.244.PubMed CentralView ArticlePubMed
- Cornish-Bowden A: Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 1985, 13 (9): 3021-3030. 10.1093/nar/13.9.3021.PubMed CentralView ArticlePubMed
- Kyte J, Doolittle RF: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982, 157 (1): 105-132. 10.1016/0022-2836(82)90515-0.View ArticlePubMed
- Schneider TD: Information Content of Individual Genetic Sequences. J Theor Biol. 1997, 189 (4): 427-441. 10.1006/jtbi.1997.0540.View ArticlePubMed
- Stormo GD, Fields DS: Specificity, free energy and information content in protein-DNA interactions. Trends Biochem Sci. 1998, 23 (3): 109-113. 10.1016/S0968-0004(98)01187-6.View ArticlePubMed
- Erill I, O’Neill MC: A reexamination of information theory-based methods for DNA-binding site identification. BMC Bioinforma. 2009, 10 (1): 57-10.1186/1471-2105-10-57.View Article
- Erill I, Escribano M, Campoy S, Barbe J: In silico analysis reveals substantial variability in the gene contents of the gamma proteobacteria LexA-regulon. Bioinformatics (Oxford, England). 2003, 19 (17): 2225-2236. 10.1093/bioinformatics/btg303.View Article
- Schneider TD: Consensus sequence Zen. Appl Bioinforma. 2002, 1 (3): 111-119.
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18 (20): 6097-6100. 10.1093/nar/18.20.6097.PubMed CentralView ArticlePubMed
- Erill I, Jara M, Salvador N, Escribano M, Campoy S, Barbe J: Differences in LexA regulon structure among Proteobacteria through in vivo assisted comparative genomics. Nucleic Acids Res. 2004, 32 (22): 6617-6626. 10.1093/nar/gkh996.PubMed CentralView ArticlePubMed
- Hertz GZ, Hartzell GW, Stormo GD: Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci. 1990, 6 (2): 81-92.PubMed
- Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC: Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (New York, NY). 1993, 262 (5131): 208-214. 10.1126/science.8211139.View Article
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMed
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48 (3): 443-453. 10.1016/0022-2836(70)90057-4.View ArticlePubMed
- Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol; ISMB. 1994, 2: 28-36.PubMed
- Luo Y, Pfuetzner RA, Mosimann S, Paetzel M, Frey EA, Cherney M, Kim B, Little JW, Strynadka NC: Crystal structure of LexA: a conformational switch for regulation of self-cleavage. Cell. 2001, 106 (5): 585-594. 10.1016/S0092-8674(01)00479-2.View ArticlePubMed
- Munch R, Hiller K, Barg H, Heldt D, Linz S, Wingender E, Jahn D: PRODORIC: prokaryotic database of gene regulation. Nucleic Acids Res. 2003, 31 (1): 266-269. 10.1093/nar/gkg037.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.