MolTalk – a programming library for protein structures and structure analysis
© Diemand and Scheib; licensee BioMed Central Ltd. 2004
Received: 27 October 2003
Accepted: 19 April 2004
Published: 19 April 2004
Two of the mostly unsolved but increasingly urgent problems for modern biologists are a) to quickly and easily analyse protein structures and b) to comprehensively mine the wealth of information, which is distributed along with the 3D co-ordinates by the Protein Data Bank (PDB). Tools which address this issue need to be highly flexible and powerful but at the same time must be freely available and easy to learn.
We present MolTalk, an elaborate programming language, which consists of the programming library libmoltalk implemented in Objective-C and the Smalltalk-based interpreter MolTalk. MolTalk combines the advantages of an easy to learn and programmable procedural scripting with the flexibility and power of a full programming language.
An overview of currently available applications of MolTalk is given and with PDBChainSaw one such application is described in more detail. PDBChainSaw is a MolTalk-based parser and information extraction utility of PDB files. Weekly updates of the PDB are synchronised with PDBChainSaw and are available for free download from the MolTalk project page http://www.moltalk.org following the link to PDBChainSaw. For each chain in a protein structure, PDBChainSaw extracts the sequence from its co-ordinates and provides additional information from the PDB-file header section, such as scientific organism, compound name, and EC code.
MolTalk provides a rich set of methods to analyse and even modify experimentally determined or modelled protein structures. These methods vary in complexity and are thus suitable for beginners and advanced programmers alike. We envision MolTalk to be most valuable in the following applications:
1) To analyse protein structures repetitively in large-scale, i.e. to benchmark protein structure prediction methods or to evaluate structural models. The quality of the resulting 3D-models can be assessed by e.g. calculating a Ramachandran-Sasisekharan plot.
2) To quickly retrieve information for (a limited number of) macro-molecular structures, i.e. H-bonds, salt bridges, contacts between amino acids and ligands or at the interface between two chains.
3) To programme more complex structural bioinformatics software and to implement demanding algorithms through its portability to Objective-C, e.g. iMolTalk.
4) To be used as a front end to databases, e.g. PDBChainSaw.
The major demand from Life Sciences towards bioinformatics today is to combine the often heterogeneous information available and make it easily accessible to a broad range of users. In the past, these efforts concentrated on coping with the overwhelming amount of data that entered and still enter nucleotide and protein sequence databases [1, 2]. Today, other information sources, such as protein structures, subsequently come under the spotlight of a broader scientific community.
In contrast to the sequence world, only one central data resource exists for protein structures, the Protein Data Bank (PDB) . Despite the undisputed advantage of having all structural data available from one source in a common file format, protein structures impose a new level of complexity. They carry information about where in space the adjacent residues of a protein sequence are located. Furthermore, protein structures provide insights into the spatial environment of an amino acid, which is different from its sequence neighbourhood, as well as into its interactions with other residues or heterogeneous ligands. This wealth of information contains answers to questions as diverse as to how proteins function or what compounds may interact with a given protein. However, these answers often remain inaccessible to a broader scientific community.
To overcome this information gap, we developed MolTalk. MolTalk consists of a programming library implemented in Objective-C  that maps PDB structure files to object space as well as of a scripting language based on Smalltalk . Moreover, MolTalk provides numerous methods that enable both the novice as well as the expert structural bioinformatician to rapidly develop software tailored towards their individual needs and to allow for novel insights from protein structure analyses. As an application for MolTalk we describe PDBChainSaw, a mirroring and data extraction routine for PDB files.
The ten classes summarised in group "structure" with labelled difficulty level and a selection of methods available.
Returns 4-letter PDB code
Returns HEADER, TITLE, REVDAT lines
Extracts date from HEADER line
Returns type of experimental method
Returns resolution as in REMARK2 lines
Writes out complete structure to a stream in PDB format
Returns enumerator over all chains
Returns chain for a given code
Removes a chain from structure
Reads structure from directory or file
Offers parsing options from directory or file
Returns code of this chain (as string/number)
Returns chain identifier consisting of PDB and chain code
Returns COMPND and SOURCE lines, EC code Transforms all residues/atoms in chain by transformation matrix
Returns number of residues (amino acids and nucleic acids), standard amino acids, heterogeneous residues, solvent residues
Provides access to residues, heterogeneous residues, solvent residues
Adds new residue, heterogen, new solvent molecule to chain
Removes a residue, heterogen, solvent molecule from chain
Derives amino acid sequence from connected residues
Derives amino acid sequence with filled gaps ("X") where missing residues occur
Returns amino acid sequence from SEQRES entry
Computes geometric hashing table of all residues
Finds residues in chain which are close to given co-ordinates based on geometric hashing
Creates a new chain with given code
Returns the residue name/number
Returns the name of the standard residue as the base of this modified residue as in MODRES lines
Returns description of residue modification as given in MODRES lines
Translates residue name into amino acid one letter code
Adds new atom to residue
Creates new residue with number and name
Returns atom name/number
Returns temperature factor for an atom
Returns chemical element
Returns partial charge of atom
Returns enumerator over all bonded atoms
Adds bond from this atom to given atom2
Removes all bonds
Removes bond to atom2
Sets atom to be of chemical type
Calculates Euclidian distance between two co-ordinates
Returns x, y, z from co-ordinates
Transforms co-ordinates by transformation matrix
Pairwise Structural Alignment
Provides access to first/second chain
Computes transformation based on superimposed chains
Re-computes transformation from selection of residue pairs
Calculates RMSD of structural alignment
Counts alignment positions in structural alignment
Counts aligned pairs only
Counts aligned pairs with distance below given cut-off
Reads external pairwise alignment from stream in T_Coffee format and re-computes structural alignment from this
Writes structural alignment to stream in T_Coffee library format
Counts number of residues in this selection
Returns enumerator over selected residues
Includes/excludes a single residue to/from selection
Adds all selected residues from selection2 to this selection
Structurally aligns selection1 to selection2 and returns the resulting transformation matrix
Each class consists of a set of methods, which again are labelled either "Basic" or "Xtra". Independent of their class, methods can be organised into (1) "basic features", (2) "extended features", (3) "mathematical functions", and (4) "others". "Basic features" enable mapping into object space and querying. "Extended features" can be further sub-divided into "operations" and "manipulations". "Operations" include e.g. superimposition, structural alignment, and transformation, respectively. With "manipulations" chains, residues or atoms can be added to or removed from a structure. "Mathematical functions" allow the calculation of vectors and matrices to perform spatial transformations. The features summarised in "others" regulate input and output. In Table 1, a list of the potentially most important methods and classes of the group "Structure" is provided.
Results and Discussion
A: Information stored in PDBChainSaw for the G chain of bovine mitochondrial F1-ATPase (1OHHG) as an exemplar. In the field "chainid", the PDB four-letter code is followed by the single character chain identifier. The field "chainid2" is the concatenation of the PDB code and the ASCII value of the chain identifier, in this case, "71" corresponding to "G". B: The two other options to return the sequence from a protein structure stored in PDB file format.
sequence from co-ordinates (inferred)
ATP SYNTHASE GAMMA CHAIN, MITOCHONDRIAL SYNONYM: BOVINE MITOCHONDRIAL F1-ATPASE GAMMA SUBUNIT
sequence from co-ordinates without 'X'
sequence from SEQRES
What MolTalk can do
The fact that MolTalk is an elaborate programming language represents also its major advantage over other freely available software with embedded macro languages, such as MolMol , SPDBV , and Rasmol . GNUstep provides a rich object-oriented Application Programming Interface, API, and services, such as object locaters, calling methods in distant objects. These are all available in both MolTalk, the Smalltalk interpreter, and libmoltalk, the Objective-C programming library. A complete and thus powerful programming language, MolTalk can be used for the development of more complex software in structural bioinformatics, as revealed with PDBChainSaw. Other applications could be as diverse as to determine H-bonds and salt bridges in proteins, to measure distances and angles between atoms in order to assess the quality of a structure or to define contacting atoms between amino acids, nucleic acids and a ligand or residues of different chains. Moreover, two structures or a selection of residues therein may be superimposed and the RMSD calculated. Table 1 gives an overview of a selection of the most important methods summarised in group "structure". Several of these applications have been made available via the iMolTalk server http://i.moltalk.org.
To achieve a significant gain in execution speed, scripts in MolTalk can be easily ported to Objective-C and the functionality of these programs, compiled and linked with the libmoltalk library, remains the same. GNUstep classes and classes defined in MolTalk are still accessible.
MolTalk is particularly aimed at handling large and complex datasets and at implementing demanding algorithms. Short scripts as well as full programs can be written to quickly parse a PDB structure file or to recurrently analyse large-scale and in-depth efforts. Therefore, the target audience can range from a sophisticated software developer to an interested bench biologist. MolTalk positions itself clearly on the side of computational analysis in the emerging new field of structural bioinformatics. The applications provided so far on the MolTalk project page support this standpoint. Other programming libraries that have been developed recently appear to aim at a slightly different target audience and are closely related to experimental structure determination initiatives [14–16].
What MolTalk does not aim at
MolTalk aims at the computational analysis of protein structures rather than protein structure prediction. The former can be performed using existing methods and measuring known parameters, whereas the latter requires the implementation of new methods and algorithms. However, as shown for PDBChainSaw, MolTalk can provide valuable input data for prediction methods. Currently, MolTalk as a purely structure-based programming suite does not include sophisticated methods to examine protein sequences. Neither does it provide a graphical user interface nor supply interactive modelling capabilities, where the strengths of MolMol, SPDBV, and VMD  clearly lie.
Accessibility and effort to learn
Access to MolTalk software and an extensive tutorial is provided via the MolTalk project page at http://bioinformatics.org/moltalk or http://www.moltalk.org. The tutorial is sub-divided into four parts: (1) MolTalk library, (2) Smalltalk interpreted scripting language, (3) information about the GNUstep Foundation, and (4) a comprehensive index. The MolTalk library section itself is sub-grouped into "Requisites", a "Class diagram", and "Classes". "Requisites" are information about installing a local version of libmoltalk and pre-requisites for compiling Objective-C code. The "Class diagram" provides an overview of the dependencies of the classes in MolTalk (Figure 1) together with a hint to the overall difficulty of the methods combined in this class. More detailed information on the attributes and methods of each class is provided in the sub-section "Classes". Again, the methods of each class are labelled either "Basic" or "Xtra" to indicate potential difficulties. Noteworthy, classes flagged as "Basic" can also contain methods of complexity "Xtra" and vice versa.
Sub-grouping, together with an index system to highlight potential difficulties, i.e. for novice users, already proved useful internally to enable for straight-forward navigation through the tutorial pages. Therefore, users can easily identify available or desired classes and methods and apply the example scripts provided on the tutorial pages to their own problems. In the future, learning to MolTalk will become even easier through the iMolTalk server , where users can upload scripts and retrieve results without being obliged to install MolTalk locally.
MolTalk is a freely available, well documented and thus easy to learn, robust, and clean implementation of object-oriented mapping of PDB files. The included interpreter for the Smalltalk scripting language allows for rapid software development, while retaining the option to port code to Objective-C, compile, and link with the MolTalk library. This opens the field for MolTalk to become a prominent protein structure analysis suite from small- to large-scale efforts, particularly when large and complex data sets need to be analysed automatically. Moreover, we regard MolTalk as a potential tool for benchmark analysis, i.e. for protein structure prediction methods and model evaluation, respectively. MolTalk may also serve as a database front end, as demonstrated with PDBChainSaw, to extract information encoded in PDB files, e.g. sequence from co-ordinates.
Availability and Requirements
Linux and other Unix derivates, Windows and MacOSX
GNU General Public License
Any restrictions to use by non-academics
Free of charge as long as GNU GPL is respected
AVD was partially supported by GlaxoSmithKline. Both authors are grateful to the Swiss Institute of Bioinformatics.
- Stoesser G, Baker W, van den Broek A, Garcia-Pastor M, Kanz C, Kulikova T, Leinonen R, Lin Q, Lombard V, Lopez R, Mancuso R, Nardone F, Stoehr P, Tuli MA, Tzouvara K, Vaughan R: The EMBL Nucleotide Sequence Database: major new developments. Nucleic Acids Res 2003, 31: 17–22. 10.1093/nar/gkg021PubMed CentralView ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–70. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–42. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Objective-C FAQ[http://www.faqs.org/faqs/computer-lang/Objective-C/faq]
- StepTalk – GNUstep scripting framework[http://steptalk.agentfarms.net]
- GNUstep project[http://www.gnustep.org]
- Yip LY, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot Variant Page and the ModSNP Database: A Resource for Sequence and Structure information on Human Protein Variants. Hum Mutat 2004, 23: 464–470. 10.1002/humu.20021View ArticlePubMedGoogle Scholar
- PostgreSQL relational database system[http://www.postgresql.org]
- PDB file format[http://www.rcsb.org/pdb/docs/format/pdbguide2.2/guide2.2_frame.html]
- Koradi R, Billeter M, Wuthrich K: MOLMOL: a program for display and analysis of macromolecular structures. J Mol Graph 1996, 14: 51–5. 10.1016/0263-7855(96)00009-4View ArticlePubMedGoogle Scholar
- Guex N, Peitsch MC: SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 1997, 18: 2714–23.View ArticlePubMedGoogle Scholar
- Sayle RA, Milner-White EJ: RASMOL: biomolecular graphics for all. Trends Biochem Sci 1995, 20: 374. 10.1016/S0968-0004(00)89080-5View ArticlePubMedGoogle Scholar
- Diemand AV, Scheib H: iMolTalk: an interactive, internet-based protein structure analyis server. Nucl Acids Res, in press.Google Scholar
- Painter J, Merritt EA: mmLib Python toolkit for manipulating annotated structural models of biological macromolecules. J Appl Crystal 2004, 37: 174–178. 10.1107/S0021889803025639View ArticleGoogle Scholar
- Computational Crystallography Toolbox[http://cctbx.sourceforge.net]
- CCP4 software library[http://www.ccp4.ac.uk]
- Humphrey W, Dalke A, Schulten K: VMD: visual molecular dynamics. J Mol Graph 1996, 14: 33–8. 10.1016/0263-7855(96)00018-5View ArticlePubMedGoogle Scholar
- GNU compiler collection GCC[http://gcc.gnu.org]
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.