Columba: an integrated database of proteins, structures, and annotations
© TrißI et al; licensee BioMed Central Ltd. 2005
Received: 15 November 2004
Accepted: 31 March 2005
Published: 31 March 2005
Structural and functional research often requires the computation of sets of protein structures based on certain properties of the proteins, such as sequence features, fold classification, or functional annotation. Compiling such sets using current web resources is tedious because the necessary data are spread over many different databases. To facilitate this task, we have created COLUMBA, an integrated database of annotations of protein structures.
COLUMBA currently integrates twelve different databases, including PDB, KEGG, Swiss-Prot, CATH, SCOP, the Gene Ontology, and ENZYME. The database can be searched using either keyword search or data source-specific web forms. Users can thus quickly select and download PDB entries that, for instance, participate in a particular pathway, are classified as containing a certain CATH architecture, are annotated as having a certain molecular function in the Gene Ontology, and whose structures have a resolution under a defined threshold. The results of queries are provided in both machine-readable extensible markup language and human-readable format. The structures themselves can be viewed interactively on the web.
The COLUMBA database facilitates the creation of protein structure data sets for many structure-based studies. It allows to combine queries on a number of structure-related databases not covered by other projects at present. Thus, information on both many and few protein structures can be used efficiently. The web interface for COLUMBA is available at http://www.columba-db.de.
Biological databases have become a major resource for researchers in life science. With the constantly increasing number of data deposited and the computational tools evolving, the focus of research has shifted from the study of a single gene towards an intra- and inter-species comparison of genes and gene products. This trend can also be seen in the field of structural biology, where the number of protein structures deposited in the Protein Data Bank, PDB  is increasing rapidly. However, looking at the structure alone is not sufficient for a comprehensive study of the various types of relationships between proteins. Other types of information, such as functional and structural annotations of proteins, also have to be taken into account.
Oberg and colleagues  compared the results from infrared and circular dichroism spectroscopy with the actual 3D structure of a protein to gain insight into the relationship between assigned protein secondary structures and spectral band shape. To carry out this study, they had to compile a set of proteins based on the folding classification as defined by CATH, the content of secondary structure elements computed by the DSSP program, and the commercial availability of the proteins. Martin and colleagues  systematically explored the relationship between the folding classification from CATH and the classification of proteins into ENZYME classes. For that purpose, they needed groups of structurally resolved proteins belonging to one of the six main ENZYME classes. In both examples, the first step in the experiments was the compilation of a set of protein structures based on the structure itself and on folding classification, sequence properties, enzymatic activity, and other types of information.
Researchers have several possibilities to collect information on protein structures. First, entries in the PDB itself contain a set of full text information and often are annotated with links to external data sources. However, PDB entries are not curated, only archived by the PDB team. This has two consequences. First, the data are not constantly updated and therefore quickly become out-of-date. Second, the annotation provided by different submitters is highly heterogeneous and does not follow a standardized nomenclature. As a consequence, searching the PDB for annotations is an error-prone task. Annotations may be incomplete or inconsistent with standard nomenclature, spelling errors and uncontrolled usage of abbreviations prevent an efficient textual search, and literature references or links to functional and structural databases may be outdated or missing. Examples of such problems are described in . This lack has led to a number of second-party databases that digest PDB entries and attach a wealth of links to relevant databases. The two best-known sources of that kind are probably PDBsum  and the IMB Jena Image Library . Both store hyperlinks to external databases and not the actual information. Therefore they are well suited for human browsing of single entries, but inadequate for working with sets of structures and their properties. Imagine a researcher wants to compile the set of DNA binding proteins from mammals resolved by X-ray crystallography with a resolution lower than 3.2 Å. Solving this task can be achieved by using either PDBsum or the IMB databases, but it requires extensive manual work or the coding of specialized scripts .
To overcome this problem, we created COLUMBA, a database of information on protein structures that physically integrates information from twelve protein structure related data sources into a single data warehouse. Besides the protein structures themselves, COLUMBA covers structural and sequence-based classification schemes, functional annotation, secondary structure elements, and participation in metabolic pathways. Links between these data and the protein structures, both on the chain, compound, and entry level, are either taken from the second-party databases or are computed inside COLUMBA, leading to more accurate and more current information than available in the PDB itself and as current as possible, we compute links between chains and Swiss-Prot entries based on sequence similarity, thus cross-referencing 68% of the PDB entries to a Swiss-Prot sequence.
Construction and content
COLUMBA is centered around PDB entries . For each entry we store general information like the experimental method, resolution, deposition date, and author. Each PDB entry is organized in compounds, which represent biological units, and each compound has one or more chains. A compound, for which an enzyme classification (E.C.) number exists, is annotated with information from ENZYME  for the enzyme name and biochemical reaction, and with data from the Kyoto Encyclopedia of Genes and Genomes, KEGG  for the participation of that enzyme in metabolic pathways. COLUMBA also integrates data from the Roche Biochemical Pathway Map .
To gain information about protein domains, entries from the protein-fold classification databases SCOP  and CATH  are linked to protein chains. Furthermore, each chain is assigned to a PISCES cluster . PISCES groups protein chains according to their sequence identity and experimental properties into culled sets. For each chain, the secondary structure is computed using the DSSP program . Links to Swiss-Prot entries  were retrieved from the PDBSprotEC database . Exploiting the links from Swiss-Prot to other databases, PDB chains are connected to the NCBI Taxonomy database  and functional annotation from Gene Ontology .
Architecture and database schema
Integration of data sources into COLUMBA
Data sources integrated in COLUMBA.
The annotation workflow populates the COLUMBA data warehouse and establishes connections between PDB entries and the other data sources. Each data source is represented by a software module implementing a fixed interface. Once a new PDB entry is written into COLUMBA, a workflow manager triggers each module, which adds annotations to the entry. The implementation of modules varies according to the nature of the data source. For instance, the DSSP module calls the DSSP program to compute the secondary structure for each chain, whereas the SCOP module searches PDB and chain identifiers in external files. Our annotation pipeline is able to handle logical dependencies between different modules. This architecture allows to include a new data source by just extending the database schema for the new tables, and implementing an appropriate module.
Content of COLUMBA
Number of entries from the PDB.
Number of entries
chains (total 60 241)
Gene Ontology via GOA
SCOP & CATH
SCOP & CATH & Swiss-Prot
7162 compounds (total 33 779)
7 162 7 162 9 172
Enzyme & KEGG
Enzyme & SCOP & CATH
Enzyme & SCOP & CATH & KEGG
Enzyme & Swiss-Prot
entries (total 26 104)
all minus PISCES
COLUMBA is a relational, integrated database of information on protein structures and is specially designed to support the creation of sets of protein structures sharing annotations in any of the data sources. Sets as those described in the introduction can be compiled with a few mouse-clicks using COLUMBA.
COLUMBA can be searched through a web interface available at http://www.columba-db.de. The interface allows two types of queries: Full text search as well as data source and attribute specific searches. In both cases, the query results in a list of PDB entries with their corresponding chains.
Once the user has specified all desired conditions, COLUMBA computes the qualifying set of protein structures. This list of results (see Figure 3A) gives basic information, such as PDB ID, experimental method, compound name, and chains for each entry. The PDB entry ID links to the COLUMBA Explorer view for that particular entry. The Explorer (see Figure 3B) shows all information stored in COLUMBA for that PDB entry. This includes the experimental method and resolution for each entry and compound name, metabolic information, and the source organism for each compound. Detailed information is given for each chain, including protein-fold classification from SCOP and CATH, data from the according Swiss-Prot entry, Gene Ontology annotation, and NCBI taxon name. These data can also be viewed or downloaded in XML format. We also provide on-line molecular visualisation via JMol , and links to the original data items in the respective databases.
Example of use
Consider a query for all compounds from ENZYME class '1.-.-.-' containing a chain with a TIM barrel fold (see Figure 3A). To compute this set, a user first specifies 'TIM barrel' in the full text search form, which returns all PDB chains with the keyword 'TIM barrel' in any of the data sources, including the PDB, SCOP, and CATH annotation. Next, the set of all proteins fulfilling this condition can be intersected with the result of the search for the ENZYME class in the metabolism form. The intersection contains 95 PDB structures. However, using the full text search is only one option for finding the appropriate answer. In general, different answers are possible for a given question depending on the preferences and trust of the user in different databases. Consider again the example given above. If a user has high confidence in either CATH or SCOP, he may specify a condition using the CATH or SCOP form, respectively, instead of performing the full text search for 'TIM barrel'. This results in 79 entries when relying only on CATH and 90 entries for SCOP. The user might even want to restrict the search to only those chains that are annotated as containing a 'TIM barrel' fold in both CATH and SCOP. The returned set has 79 PDB protein structures. These differences result from the fact that COLUMBA usually only takes the cross-references given in the original data and does not curate or amend the content of the integrated databases.
Applications of COLUMBA
The web interface is designed to compile sets of protein structures sharing properties from protein structure related sources, but it is possible to tackle more sophisticated issues by exploiting the relational data warehouse of COLUMBA. We show a number of applications where we used SQL (Structured Query Language) to retrieve information.
A research question concerning the participation of enzymes in metabolic pathways arose from an article from Martin et al.  that investigated the relationship between the protein classification of ENZYME and the folding classification of CATH. One finding at that time was that the known enzymes in the glycolytic pathway contained a very limited set of different CATH architectures and topologies. This naturally raises the questions whether this is the case for other metabolic pathways as well. We used COLUMBA to address this problem, combining PDB data, information on metabolic pathways from KEGG, and the CATH classification.
The number of enzymes for selected metabolic pathways from KEGG.
a / b
Fatty acid biosynthesis (path 1)
Selenoamino acid metabolism
Pentose phosphate pathway
One carbon pool by folate
Phe, Tyr and Trp biosynthesis
Glycolysis / Gluconeogenesis
Reductive carboxylate cycle (CO2 fixation)
The first striking fact is that only 26% of the enzymes participating in KEGG pathways do have annotated chains in CATH. This is because just 34% of the enzymes in KEGG are structurally resolved, of which several are not annotated by CATH. The enzymes within the annotated set contain four times as many domains with an Alpha/Beta class in CATH than Mainly Alpha and Mainly Beta, respectively. In comparison, for all proteins annotated by CATH the Alpha/Beta class only occurs twice as often as each of the other two folds.
The most frequent approach to the interconnection of data on protein structures that are spread over multiple original data sources is the usage of hyperlinks. Examples are PDBsum  and the IMB Jena Image Library . This method is well suited for human browsing of single entries, but as soon as it comes to handling sets of objects, following many hyperlinks becomes a tedious and time consuming task. Efficient handling of sets can only be achieved if data are physically integrated into a single system. In the protein structure world, there are three main such databases apart from COLUMBA. 3DinSight  focuses on visualization of sequence features such as PROSITE patterns or altered positions in the 3D structure. iProClass  concentrates on protein sequence and integrates 50 different databases using so-called 'rich links'. Finally, BioMolQuest  integrates in total four data sources, thus storing only a subset of the information available in COLUMBA. Currently, the Protein Data Bank itself is preparing a new web interface to provide not only the links to related sources, but the actual information from SCOP, CATH, and the Gene Ontology. These are only a subset of the sources integrated in COLUMBA. COLUMBA's functionality could also have been achieved by implementing specific modules for SRS. However, we early on decided to use relational database technology instead of the highly proprietary SRS languages and methods.
Two groups currently address the problem of inconsistent use of terminology in the PDB: the PDB uniformity project  and the Macromolecular Structure Database MSD . Both projects aim at correcting PDB entries, unifying terminology, and adding or updating links to scientific references. The MSD also addresses the linkage of PDB chains to Swiss-Prot entries. We hope that these efforts will make our work easier in the near future, for instance if the PDB entries themselves come with consistent and structured taxonomic information.
COLUMBA currently integrates twelve data sources concerned with different aspects of protein sequences and structures. Notably, COLUMBA does not store the coordinates of structures themselves but is designed to enable users to find 'the right' set of structures based on annotations. This is by intention, since there are already many programs that can efficiently parse, visualize, or compare protein structures from PDB files.
An important design principle of COLUMBA is that it never mixes data from different sources into a single table. Each data source is considered as a dimension in which PDB entries, compounds, and chains are annotated. We call this approach multidimensional data integration , which is inspired by data warehouse design, where facts, e.g., sales, are described by dimensions, such as store, product, or customer . The resulting database schema is called star schema in correspondence with the visual appearance. We also use a star schema like structure with the tables holding information from protein structures in the center of a set of tables containing the data from other data sources.
Our approach is in contrast to projects that aim at a tighter semantic integration, merging logically similar types of information into a single table. Such a semantic integration approach was for instance followed in the TAMBIS project . However, we strongly believe that merging data from different databases is counterproductive for the biologist because it blurs important differences. On the other hand, keeping data separated inevitably leads to a certain degree of semantic redundancy, i.e., different schema elements provide the same type of information. For instance, functional annotation of proteins is encoded both in Swiss-Prot keywords and Gene Ontology terms; 'TIM barrels' are annotated in CATH, SCOP, and the PDB annotation itself. But this redundancy does not originate from data duplication, but rather from evidence obtained independently by different people or by different experiments. These evidences are important in their own right.
We believe that the advantages of our approach prevail for mainly two reasons:
Users recognize the origin of the data they query and obtain as result. In our experience, biologists often have their favorite set of databases, where they know about the pitfalls and peculiarities. By keeping data separated, personal preferences or differences in trust in particular databases can be expressed and the results can be judged based on prior experience.
Subtle differences in the semantics of fields of different databases are conserved. For instance, both Swiss-Prot keywords and GO annotations express functional annotation. However, the process of creating this annotation is quite different, and it is often meaningful to discriminate between the two.
Furthermore, separating data and software for the different data sources greatly simplifies system maintenance. Changes to data sources, including the deletion or addition of data sources, only affect a well defined part of the schema and of the web interface.
Our perception of considering annotation sources as dimensions describing some primary objects is also followed in the EnsMart project . EnsMart uses a 'reversed star schema' to connect genes with different types of information, such as genomic position, transcription factors, or expression data. The data are queried through a generic web interface, which also allows source-specific queries and their combinations. Conceptually, EnsMart and COLUMBA are very similar, but they work on totally different types of data. Moreover, COLUMBA is directly designed for handling annotations of protein structures, which has advantages in terms of result visualization and search options.
COLUMBA has proven to be very useful for a number of tasks in our own structural research. Generating sets of structures, which previously required days of manual browsing or writing of parsers, now only takes a few mouse clicks, or an SQL query. Once the set of PDB entries and chains is obtained, there are many other programs for visualizing or comparing structures. COLUMBA's future development will further concentrate on annotation of structure in contrast to the structure and its coordinates itself. The next data sources to be integrated are those covering protein domains and motifs, i.e., InterPro  and its relatives. In the long run, we will push COLUMBA towards a medical orientation. Obvious candidates for being integrated are literature abstracts from Medline and the OMIM database . The LIGAND database  will provide information about small molecules interacting with proteins to use COLUBMA for the prediction of drug target sides. Moving towards medical data is a natural next step since much of structural research, including our own  is concerned with drug development.
The database is available at http://www.columba-db.de.
This work is supported by BMBF grant no. 0312705B (Berlin Center for Genome-Based Bioinformatics). We thank Raphael Bauer, Rene Heek, and Stefan Günther for implementing many parts of Columba. We acknowledge the excellent work of the database maintainers of the different source databases and thank for their consent to integrate their data into COLUMBA.
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Oberg K, Ruysschaert J, Goormaghtigh E: Rationally selected basis proteins: A new approach to selecting proteins for spectroscopic secondary structure analysis. Protein Sci 2003, 12: 2015–2031. 10.1110/ps.0354703PubMed CentralView ArticlePubMedGoogle Scholar
- Martin AC, Orengo CA, Hutchinson EG, Jones S, Karmirantzou M, Laskowski RA, Mitchell JB, Taroni C, Thornton JM: Protein folds and functions. Structure 1998, 6: 875–884. 10.1016/S0969-2126(98)00089-6View ArticlePubMedGoogle Scholar
- Bhat T, Bourne P, Feng Z, Gilliland G, Jain S, Ravichandran V, Schneider B, Schneider K, Thanki N, Weissig H, Westbrook J, Berman HM: The PDB data uniformity project. Nucleic Acids Res 2001, 29: 214–218. 10.1093/nar/29.1.214PubMed CentralView ArticlePubMedGoogle Scholar
- Laskowski RA: PDBsum: summaries and analyses of PDB structures. Nucleic Acids Res 2001, 29: 221–222. 10.1093/nar/29.1.221PubMed CentralView ArticlePubMedGoogle Scholar
- Reichert J, Sühnel J: The IMB Jena Image Library of Biological Macromelecules: 2002 update. Nucleic Acids Res 2002, 30: 253–254. 10.1093/nar/30.1.253PubMed CentralView ArticlePubMedGoogle Scholar
- Stein L: Creating a bioinformatics nation. Nature 2002, 417(6885):119–120. 10.1038/417119aView ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–305. 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kavashima S, Okuno Y, Hattori M: The KEGG resource for deciphering thegenome. Nucleic Acids Res 2004, 32: D277-D280. 10.1093/nar/gkh063PubMed CentralView ArticlePubMedGoogle Scholar
- Michal G: Biochemical Pathways. Boehringer Mannheim GmbH 1993.Google Scholar
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Orengo C, Michie A, Jones S, Jones D, Swindells M, Thornton J: CATH – a hierarchic classification of protein domain structures. Structure 1997, 5(8):1093–1108. 10.1016/S0969-2126(97)00260-8View ArticlePubMedGoogle Scholar
- Wang G, Dunbrack RL Jr: PISCES: a protein sequence culling server. Bioinformatics 2003, 19(12):1589–1591. 10.1093/bioinformatics/btg224View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, Pilbout S, Schneider M: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31: 365–370. 10.1093/nar/gkg095PubMed CentralView ArticlePubMedGoogle Scholar
- Martin AC: PDBSprotEC: a Web-accessible database linking PDB chains to EC numbers via SwissProt. Bioinformatics 2004, 20: 986–8. 10.1093/bioinformatics/bth048View ArticlePubMedGoogle Scholar
- Wheeler D, Chappey C, Lash A, Leipe D, Madden T, Schuler G, Tatusova T, Rapp B: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2000, 28: 10–14. 10.1093/nar/28.1.10PubMed CentralView ArticlePubMedGoogle Scholar
- Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32: D258–261. 10.1093/nar/gkh036View ArticleGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32: D262-D266. 10.1093/nar/gkh021PubMed CentralView ArticlePubMedGoogle Scholar
- An J, Nakarna T, Kubota Y, Sarai A: 3DinSight: an integrated relational database and search tool for the structure, function and properties of biomolecules. Bioinformatics 1998, 14(2):188–195. 10.1093/bioinformatics/14.2.188View ArticlePubMedGoogle Scholar
- Wu CH, Huang H, Nikolskaya A, Hu Z, Barker WC: The iProClass integrated database for protein functional analysis. Comput Biol Chem 2004, 28: 87–96. 10.1016/j.compbiolchem.2003.10.003View ArticlePubMedGoogle Scholar
- Bukhman YV, Skolnick J: BioMolQuest: integrated database-based retrieval of protein structural and functional information. Bioinformatics 2001, 17(5):468–478. 10.1093/bioinformatics/17.5.468View ArticlePubMedGoogle Scholar
- Boutselakis H, Dimitropoulos D, Fillon J, Golovin A, Henrick K, Hussain A, Ionides J, John M, Keller P, Krissinel E, McNeil P, Naim A, Newman R, Oldfield T, Pineda J, Rachedi A, Copeland J, Sitnov A, Sobhany S, Suarez-Uruena A, Swaminathan J, Tagari M, Tate J, Tromm S, Velankar S, Vranken W: E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Res 2003, 31: 458–462. 10.1093/nar/gkg065PubMed CentralView ArticlePubMedGoogle Scholar
- Rother K, Müller H, Trissl S, Koch I, Steinke T, Preissner R, Frömmel C, Leser U: Columba: Multidimensional Data Integration of Protein Annotations. In DILS, Volume 2994 of Lecture Notes in Computer Science. Edited by: Rahm E. Springer; 2004:156–171.Google Scholar
- Chaudhuri S, Dayal U: An Overview of Data Warehousing and OLAP Technology. SIGMOD record 1997, 26: 65–74.View ArticleGoogle Scholar
- Baker PG, Brass A, Bechhofer S, Goble C, Paton N, Stevens R: TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources. In 6th Int. Conf. on Intelligent Systems for Molecular Biology. Edited by: Glasgow J, Littlejohn T, Major F, Lathrop R, Sankoff D, Sensen C, Montreal. Canada: AAAI Press, Menlo Park; 1998:25–34.Google Scholar
- Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca-Serra P, Cox T, Birney E: EnsMart: a generic system for fast and flexible access to biological data. Genome Res 2004, 14: 160–169. 10.1101/gr.1645104PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Attwood T, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning M, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder N, Oinn T, Pagni M, Servant F, Sigrist C, Zdobnov E: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29: 37–40. 10.1093/nar/29.1.37PubMed CentralView ArticlePubMedGoogle Scholar
- McKusick V: Mendelian Inheritance in Man. A Catalog of Human Genes and Genetic Disorders. Johns Hopkins University Press, Baltimore; 1998.Google Scholar
- Goto S, Okuno Y, Hattori M, Nishioka T, Kanehisa M: LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 2002, 30: 402–404. 10.1093/nar/30.1.402PubMed CentralView ArticlePubMedGoogle Scholar
- Preissner R, Goede A, Frömmel C: Dictionary of interfaces in proteins (DIP). Data bank of complementary molecular surface patches. J Mol Biol 1998, 280(3):535–550. 10.1006/jmbi.1998.1878View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.