Accessing the SEED Genome Databases via Web Services API: Tools for Programmers
© Disz et al; licensee BioMed Central Ltd. 2010
Received: 5 January 2010
Accepted: 14 June 2010
Published: 14 June 2010
The SEED integrates many publicly available genome sequences into a single resource. The database contains accurate and up-to-date annotations based on the subsystems concept that leverages clustering between genomes and other clues to accurately and efficiently annotate microbial genomes. The backend is used as the foundation for many genome annotation tools, such as the Rapid Annotation using Subsystems Technology (RAST) server for whole genome annotation, the metagenomics RAST server for random community genome annotations, and the annotation clearinghouse for exchanging annotations from different resources. In addition to a web user interface, the SEED also provides Web services based API for programmatic access to the data in the SEED, allowing the development of third-party tools and mash-ups.
The currently exposed Web services encompass over forty different methods for accessing data related to microbial genome annotations. The Web services provide comprehensive access to the database back end, allowing any programmer access to the most consistent and accurate genome annotations available. The Web services are deployed using a platform independent service-oriented approach that allows the user to choose the most suitable programming platform for their application. Example code demonstrate that Web services can be used to access the SEED using common bioinformatics programming languages such as Perl, Python, and Java.
We present a novel approach to access the SEED database. Using Web services, a robust API for access to genomics data is provided, without requiring large volume downloads all at once. The API ensures timely access to the most current datasets available, including the new genomes as soon as they come online.
Genomes in the SEED database as of November 26th, 2009.
Since their inception by the Fellowship for Interpretation of Genomes (FIG), these tools were built around an open-source framework that encourages development of new tools and ideas. Although the primary servers are maintained at Argonne National Laboratory and the University of Chicago, several remote SEED installations have been provided for groups requiring programmatic access to the SEED data. However, the main difficulty with remote installations is the maintenance and constant updates that are required, often beyond the capability of the average bioinformatics group. A series of Web services has therefore been developed to provide an API to the annotations of microbial genomes without requiring any downloads or installation.
SOAP services are available from EBI [9, 10], KEGG [11, 12], and NCBI [13, 14]. The existing methods were taken into consideration when developing the SEED Web services interface and our aim is to provide compatible services. However, as further web service APIs are developed, a common set of methods, or a thesaurus to compare methods, should be defined to ensure maximum compatibility and computability between services.
To aid programmatic access to the SEED family of services, an application programming interface (API) was developed based on the Simple Object Access Protocol (SOAP) standard. Here, we describe the basic implementation of the API, and provide example code to query the databases.
The Web services are implemented as a Perl abstraction to the SEED database on the remote server, however the distal implementation does not limit the user's choice in language or implementation methods. The examples shown here include Perl, Python, and Java, and many other programming languages support SOAP allowing the user to choose their favourite language for their implementation.
Before the services are described, a couple of formalities about the underlying SEED database are introduced. These are provided to orient new users of the database.
The SEED family of databases and services has their own internal identifiers, called FIG identifiers (FIDs), in the format fig|xxxxx.i.type.yyyy. In this representation, the fig| denotes that it is a FIG internal identifier, the xxxxx is usually the NCBI taxon ID of the genome, the .i is the increment of the genome (advanced when major changes are performed), the type is the feature type, and the yyyy is the number of the feature on the genome. Feature types are typically peg (p rotein e ncoding g ene), rna, pp (prophage), pi (pathogenicity island), and so on. The feature type is lower case, and the number is usually incremented along the chromosome. However, features that are inserted will get the next available, unused, feature number, and the numbers from deleted features are not recycled. Therefore although features with adjacent numbers are usually adjacent to each other on the chromosome, that is not guaranteed.
Thus, "fig|243277.1.peg.4400" refers to the 4400th protein encoding gene in the 1st increment of the genome with taxonomy ID 243277 (Vibrio cholerae O1 biovar eltor str. N16961). The functional annotation of this protein is "β-subunit of the DNA-directed RNA polymerase (EC 220.127.116.11)". The identifier "fig|243277.1.rna.23" refers to the 23rd RNA feature of the same genome. For simplicity, these two examples are used throughout this discussion. The genome identifier in this case is 243277.1 (note that we include the increment number with the taxonomy identifier). For access to web pages and user controlled material, the link-in URLs based on http://www.theseed.org/linkin.cgi provide access to pages related to the genome, proteins, subsystems and associated data. For example, http://www.theseed.org/linkin.cgi?genome=243277.1 links to the organism overview for Vibrio cholerae O1 biovar eltor str. N16961, and http://www.theseed.org/linkin.cgi?id=fig|243277.1.peg.4400 links to the page related to the protein sequence.
In addition to these internal identifiers, the SEED database maintains mapping to other commonly used identifiers wherever possible. For example, the peg shown above also has the following aliases: GeneID:2615094 NP_229982.1 VC0328 gi|15640355 gi|41019520 kegg|vch:VC0328 sp|Q9KV30 uni|Q9KV30. Typically, the source database is abbreviated, precedes the identifier, and is separated from the identifier with a vertical bar (e.g. sp is SwissProt and uni is UniProt).
Accessing the SEED via Web services
Methods, input and output parameters in the SEED Web services API
Parameters & Order
Get the pegs that may be coupled to this peg through abstract coupling. Input is a peg, output is list of [protein, score] for things that are coupled to this peg
Retrieve the set of pegs in order along the chromosome. Input is a comma separated list of pegs, and output is the pegs in order along the genome.
Get the FIG ID(s) (peg) for a given external identifier. Input is an identifier used by another database, output is a list of our identifiers. Note that an alias can refer to more than one protein since the mapping is done via protein sequence.
Get the aliases of a peg. These are the identifiers that other databases use. Input is a peg, output is an array of aliases
Retrieve the protein sequence for a given identifier. Input is an alias, output is a sequence
Get all the FIG protein families (FIGfams). No input needed, it just returns a list of all families
Get all the FIG protein families (FIGfams) with their assigned functions. No input needed, it just returns a list of all the families and their functions.
complete, restrictions, domain
Get a set of genomes. The inputs are a series of constraints - whether the sequence is complete, other restrictions, and a domain of life (Bacteria, Archaea, Eukarya, Viral, Environmental Genome). Output is a list of genome ids. An example use is with the parameters ("complete", undef, "Bacteria") that will return all complete bacterial genomes.
Get a list of all the subsystems and their classifications. No input needed, it just returns a list of all the subsystems and their classifications
Get the boundaries of a feature location. A feature can have multiple locations on a contig (e.g. split locations, introns, etc). This just returns an array of [contig, beginning, end]. You can pass it the output from feature_location directly
Get all the pegs in some FIGfams, their functions, and aliases. Input is a tab-separated list of pegs, returns a 3-column comma separated table [peg, Function, Aliases]
Get the protein sequences for a list of proteins. Input is a tab-separated list of peg, returns a 2-column comma separated table of [peg, sequence]
Get the clusters for a peg by bidirectional best hits. Input is a peg, output is two column table of [peg, cluster]
Get the clusters for a peg by similarity. Input is a peg, output is two column table of [peg, cluster]
Get a comma-separated list of all the contigs in a genome
Get the length of the DNA sequence in a contig in a genome. Input is a genome id and a contig name, return is the length of the contig
Get the pegs that are coupled to any given peg. Input is a peg, output is list of [protein, score] for things that are coupled to this peg
Get the DNA sequence for a region in a genome. Input is a genome ID and a location in the form contig_start_stop, output is the DNA sequence in fasta format.
Get the name for a given E.C. number. Input is an EC number, output is the name
Get the annotations for a peg from all other known sources. Input is a peg, output is two column table of [peg, other function]
Get the location of a peg on its contig. Input is a peg, output is list of locations on contigs. Usually this will be a single location, but sometimes it can either be more than one region on a contig, or even on multiple contigs. For convenience it is a comma joined list, often you will want to pass that to boundaries_of
Get the DNA sequence for a given protein identifier. Input is a peg, output is the DNA sequence in fasta format.
Get the DNA sequence for a set of protein identifiers. Input is a comma-joined list of pegs, output is the DNA sequence in fasta format.
Get the functional annotation of a given protein identifier. Input is a peg, output is a function
complete, restrictions, domain
Get a set of genomes. The inputs are a series of constraints - whether the sequence is complete, other restrictions, and a domain of life (Bacteria, Archaea, Eukarya, Viral, Environmental Genome). Output is a list of genome ids with the genus species appended. An example use is with the parameters ("complete", undef, "Bacteria") that will return all complete bacterial genomes.
Get the genome(s) that a given protein identifier refers to. Input is a peg, output is a single column table of genomes
Get the genus and species of a genome identifier. Input is a genome ID, output is the genus and species of the genome
get_ corresponding_ ids
Get the corresponding ids of a peg. These are the identifiers that other databases use. Input is a peg, output is an array of aliases
Retrieve the DNA sequence for a particular feature. Note that this will take a feature id (peg, rna, etc), and return the DNA sequence for that id. There is also a separate method to get the DNA sequence for an arbitrary location on a genome
Get the translation (protein sequence) of a peg. Input is a peg, output is translation. (Note that this is a synonym of translation_of);
Test whether an organism is Archaeal. Input is a genome identifier, and output is true or false (or 1 or 0)
Test whether an organism is Bacterial. Input is a genome identifier, and output is true or false (or 1 or 0)
Test whether an organism is Eukaryotic. Input is a genome identifier, and output is true or false (or 1 or 0)
Tries to put a protein sequence in a family. Input is a tab-separated id and sequence, delimited by new lines. The output is a comma-separated 2-column table [your sequence id, FamilyID] if the sequence is placed in a family.
Test whether an organism is a Prokaryote. Input is a genome identifier, and output is true or false (or 1 or 0)
Get all the pegs in some FIGfams. The input is a tab-separated list of family IDs, and the output is a two column table of [family id, peg]
Get all the protein identifiers associated with a genome. Input is a genome id, output is a list of pegs in that genome
Get the FIG IDs associated with the MD5 sum of a protein sequence. Input is the md5 checksum, output is an array of strings of FIG ids. This should be faster, and more complete, than using aliases or other ways to match protein sequences.
Get the FIG IDs associated with the MD5 sum of a protein sequence. Input is the md5 checksum, output is a comma separated list of FIG ids as a single string. This should be faster, and more complete, than using aliases or other ways to match protein sequences.
peg_id, n_pch_pins, n_sims, sim_cutoff, color_sim_ cutoff, sort_by
Input is a FIG (peg) ID and ..., output is the pinned regions data
Get a tab-separated list of [subsystem name, functional role, peg, subsystem variant code for that genome] for any given reaction id and genome id. Maps the reaction id to peg, peg to genome, and genome to variant code
If this genome replaces another one (it is a more upto date version), what is the ID of the older genome?
Get all the RNA identifiers associated with a genome. Input is a genome ID, and output is a list (an array) of the RNAs in that genome
Search and grep through the database. Input is two patterns, first one is used in search_index, second used to grep the results to restrict to a smaller set. Output is an array of hashes with keys id, organism, otherIds, functionalAssignment, and annotator.
Search the database. Input is a pattern to search for, output is list of pegs and roles
peg, maxN, maxP
Retrieve the sims (precomputed BLAST hits) for a given protein sequence. Input is a peg, an optional maximum number of hits (default = 50), and an optional maximum E value (default = 1e-5). The output is a list of sims in modified tab separated (-m 8) format. Additional columns include length of query and database sequences, and method used.
Returns the taxonomy of a given genomeid
Get the translation (protein sequence) of a peg. Input is a peg, output is the protein sequence. (Note that this is a synonym of get_translation).
The examples discussed below all use Perl http://www.perl.org/ and the SOAP::Lite Perl module available from http://search.cpan.org/. In the examples below we use the simple SOAP::Lite interface making HTTP calls via port 80. This will be sufficient for most API calls, and more details about the SOAP::Lite interface can be found online or in the O'Reilly Programming Web Services With Perl . Python and Java code that works with the Web services interface is included in the online examples.
To initiate a connection using Perl and SOAP::Lite, the constructor SOAP::Lite->service is provided the URL for the publicly available WSDL file. The dedicated Web services server machine at http://ws.theseed.org/ is optimized for handling Web services calls rather than user-initiated calls (Code 1 in the Additional File 1). The constructor generates method stubs that can then be called as methods of the service. Most commonly used methods are described below.
Searching the SEED
To search the SEED, two different access methods are provided. The simple_search accepts a query string, and returns all data that matches the string. For example, searching for "VC0328" returns the text separated by tabs as shown in Code 2 in the Additional File 1.
The first item is the internal identifier, and the second the genome from which it came. The third item is a list of all other aliases for this peg. The alias list is constructed based on sequence identity . Fourth is the functional annotation of the protein, and the last item is the person that made the annotation - in this case a master (or trusted) annotator made the annotation.
The second method provided for searching the SEED is via search_and_grep. This method takes two arguments, the first is what to search for, and the second is a regular expression that should be found within the search string. This provides a server-side mechanism for reducing the output of the search. For example, a simple_search for dnaA returns 2,774 items, but a search_and_grep for "dnaA" and "Vibrio" reduces the list to 56 items (the grep is always case sensitive).
Working with genomes
To retrieve all the genomes in the current instantiation of the SEED database, a call to genomes is made. This method takes three optional constraints, the first is a boolean, if true only "complete" genomes will be returned, if false all genomes will be returned. The second, is a set of restrictions that can be applied on a genome-by-genome basis, and the third option is a domain to return (Bacteria, Archaea, Eukarya, or Environmental Sample). The block of code shown in Code 3 in the Additional File 1 returns all complete Bacterial genomes in the SEED. The same code is shown in Java, Python, and Perl to demonstrate the portability of the Web services approach.
The returned data is an array of tuples of [genome ID, genome name] separated by a tab. Additionally, the genome name can be retrieved using the genus_species call, that accepts a genome ID as its sole argument.
For any genome ID, every protein encoding gene (peg) of the genome can be retrieved by using the call pegs_of. This simply returns a list of FIDs in each genome that can be parsed using the methods described below. As noted above, the pegs are typically in numerical order along the chromosome but that is not guaranteed as pegs may be added to fill in missing genes. The method adjacent takes a list of pegs and sorts them in order along the chromosome. Thus, the combined call shown in Code 4 in the Additional File 1 will return a list of ordered pegs. Of course, as shown below, the location of each peg can be retrieved and sorted locally by the user, if desired.
Working with genes and proteins
Many methods are available to retrieve the data underlying the SEED, and most work at the level of the protein. As noted above, both internal and external identifiers are maintained, but typically API requests are made with internal identifiers (FIDs), as shown here. Simple functional calls include the ability to retrieve the location of a FID on a contig, the DNA or protein sequence, the annotation, as shown in Table 2.
For example the block of code shown as Code 5 in the Additional File 1 will retrieve the location of the sequence (contig, start position and stop position), and the protein sequence of the peg from Vibrio cholerae. The protein sequence is in fasta format, suitable for feeding into other bioinformatics applications. Similarly, the fid2dna method returns the DNA sequence in fasta format.
An underlying resource in the SEED database is the precomputed coupling of proteins along and between genomes . Coupling is an evidence-based metric of the co-occurrence of any pair of proteins in unrelated genomes, and infers that proteins are involved in the same cellular process. Coupling evidence is one of the pieces of information SEED annotators use to infer function. Two methods are currently provided to return coupling data. First, coupled_to takes a given peg and returns a list of pegs that it is coupled to, along with a normalized score for that coupling . The score is the number of genomes in which similar coupling is retained in nearby pegs.
Pegs that are coupled to fig|243277.1.peg.4400 either directly through close association, or in an abstract manner
Abstract Coupling Score
Translation elongation factor Tu
Transcription antitermination protein NusG
LSU ribosomal protein L11p (L12e)
LSU ribosomal protein L1p (L10Ae)
LSU ribosomal protein L10p (P0)
LSU ribosomal protein L7/L12 (L23e)
DNA-directed RNA polymerase beta' subunit
SSU ribosomal protein S12p (S23e)
SSU ribosomal protein S7p (S5e)
Preprotein translocase subunit SecE
Translation elongation factor G
Similarities returned for fig|243277.1.peg.4400
Timing Web services
Web services provide a mechanism for computational access to the data housed in our databases. The API allows all users to access our systems, retrieve data, and develop tools for mining genomes and metagenomes essentially without restriction. The API provides a flexible interface that has evolved in response to common requests for our end users and will continue to morph in response to demand. The primary advantage of accessing our data via the API is that the data are constantly updated. Although stand-alone SEED installations are available, almost as soon as the installation is complete, it is out of date and needs updating. In contrast, the Web services access data that is mirrored nightly to ensure constant quality and timeliness.
The main drawback with using the Web services approach to access the data rather than via a local installation is the additional overhead associated with transferring the data over the internet. Accessing the data indirectly over a typical internet connection takes about ten times longer than having direct access to the data. However, as the computational processing time increases, those delays are mitigated. On the back-end, the overhead is being mitigated with server-side controls to limit the amount of data transferred. For example, the search_and_grep method described here significantly reduces the data returned from database searches. On the front-end prefetching the data, and maintain local caches of limited parts of the data may prove an attractive alternative to continually retrieving large data sets.
In this work to date, we have chosen to implement an RPC/Encoded style of Web service. There are two common Web services approaches: RPC/Encoded and Document/Literal . The general advantage of the former is that it is significantly easier to implement and has a more "natural" style. For example, BLAST search results are returned as tab separated text, just as if they had been computed locally. The disadvantage is that it is much harder for the programmer accessing the data, as they have to individualize each call and the way the data is processed. In contrast, the Document/Literal style uses XML for both the call and response. The XML returned is self-descriptive and self-validating, allowing more automated analysis of the data. Currently we only support the RPC/Encoded style of Web services. In part it was a design decision based on the Perl back end of the SEED API (RPC/Encoded support is conveniently and dynamically supplied by the Perl module POD::WSDL ). In addition, this decision allowed us to provide immediate unfettered access to our data while we develop and deploy the more formal Document/Literal style of encoding. We anticipate future releases of the SEED Web services API will move towards Document/Literal even while we continue to support the RPC/Encoded style.
The current SEED API does not limit access in any way. For example, there is no limit on how frequently calls may be made. However, too many repeated calls may be misconstrued as a denial of service (DOS) attack by the host, and therefore users are cautioned to throttle their requests appropriately.
We have provided many code examples both in the Additional File 1 and online at http://ws.theseed.org/. The service is also included in the BioCatalogue http://www.biocatalogue.org/ and future services will also be included there. Users are encouraged to contact the authors to share code and to provide reusable code fragments.
The SEED family of databases and associated software (Fig. 1), are a comprehensive set of microbial genome annotation and analysis databases. Every microbial genome sequenced to date is stored in these databases, and the annotation servers provide a flexible framework for both complete genomes and metagenomes. Researchers are encouraged to try the programmatic access to the SEED as an alternative means of retrieving data.
Availability and requirements
Project name: SEED Web services API
Project home page: http://ws.theseed.org/
Operating system(s): Platform independent
Programming language: Language independent
Other requirements: SOAP
License: SEED Toolkit Public License
Any restrictions to use by non-academics: no limitations
(A FIG ID, an internal identifier in the format fig|xxxxxx.i.peg.yyyy)
(Fellowship for Interpretation of Genomes)
(Server for metagenome annotations based in part on RAST technology)
(Server for Rapid Annotations using Subsystem Technology)
(The database and infrastructure for comparative genomics).
Part of this project has been funded with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN266200400042C.
We thank the beta-testers of this service especially Bahador Nosrat and Scott Kelley at San Diego State University.
- Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999, 96(6):2896–2901. 10.1073/pnas.96.6.2896View ArticlePubMedPubMed CentralGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, et al.: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691–5702. 10.1093/nar/gki866View ArticlePubMedPubMed CentralGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, (32 Database):D277–280. 10.1093/nar/gkh063Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556View ArticlePubMedPubMed CentralGoogle Scholar
- Overbeek R, Disz T, Stevens R: The SEED: A peer-to-peer environment for genome annotation. Commun ACM 2004, 47(11):46–51. 10.1145/1029496.1029525View ArticleGoogle Scholar
- Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, et al.: The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008, 9: 75. 10.1186/1471-2164-9-75View ArticlePubMedPubMed CentralGoogle Scholar
- McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards RA, Gerdes S, Hwang K, Kubal M, et al.: The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation. Nucleic Acids Res 2007, (35 Database):D347–353. 10.1093/nar/gkl947
- Meyer F, Paarmann D, D'Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A, et al.: The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008, 9: 386. 10.1186/1471-2105-9-386View ArticlePubMedPubMed CentralGoogle Scholar
- Brooksbank C, Cameron G, Thornton J: The European Bioinformatics Institute's data resources. Nucleic Acids Res (38 Database):D17–25.
- Leinonen R, Akhtar R, Birney E, Bonfield J, Bower L, Corbett M, Cheng Y, Demiralp F, Faruque N, Goodgame N, et al.: Improvements to services at the European Nucleotide Archive. Nucleic Acids Res (38 Database):D39–45.
- Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res (38 Database):D355–360.
- Kawashima S, Katayama T, Sato Y, Kanehisa M: KEGG API: A web service using SOAP/WSDL to access the KEGG system. Genome Informatics 2003, 14: 673–674.Google Scholar
- Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, He S, Liu C, Shi W, Bryant SH: The NCBI BioSystems database. Nucleic Acids Res 2010, (38 Database):D492–496. 10.1093/nar/gkp858Google Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2010, (38 Database):D5–16. 10.1093/nar/gkp967Google Scholar
- Ray RJ, Kulchenko P: Programming Web Services with Perl. O'Reilly 2003.Google Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Akhter S, Bailey B, Salamon P, Edwards R: Shannon's Uncertainty and Kullback-Leibler Divergencein Microbial Genome and Metagenome Sequences. 1st International conference on Bioinformatics and Computational Biology: 2009; New Orleans, LA 2009.Google Scholar
- Shannon CE: A mathematical theory of communication. Bell Syst Tech J 1948, 27(3):379–423.View ArticleGoogle Scholar
- Which style of WSDL should I use?[https://www.ibm.com/developerworks/webservices/library/ws-whichwsdl/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.