search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information
© Mrozek et al.; licensee BioMed Central Ltd. 2013
Received: 1 August 2012
Accepted: 22 February 2013
Published: 1 March 2013
Due to the growing number of biomedical entries in data repositories of the National Center for Biotechnology Information (NCBI), it is difficult to collect, manage and process all of these entries in one place by third-party software developers without significant investment in hardware and software infrastructure, its maintenance and administration. Web services allow development of software applications that integrate in one place the functionality and processing logic of distributed software components, without integrating the components themselves and without integrating the resources to which they have access. This is achieved by appropriate orchestration or choreography of available Web services and their shared functions. After the successful application of Web services in the business sector, this technology can now be used to build composite software tools that are oriented towards biomedical data processing.
We have developed a new tool for efficient and dynamic data exploration in GenBank and other NCBI databases. A dedicated search GenBank system makes use of NCBI Web services and a package of Entrez Programming Utilities (eUtils) in order to provide extended searching capabilities in NCBI data repositories. In search GenBank users can use one of the three exploration paths: simple data searching based on the specified user’s query, advanced data searching based on the specified user’s query, and advanced data exploration with the use of macros. search GenBank orchestrates calls of particular tools available through the NCBI Web service providing requested functionality, while users interactively browse selected records in search GenBank and traverse between NCBI databases using available links. On the other hand, by building macros in the advanced data exploration mode, users create choreographies of eUtils calls, which can lead to the automatic discovery of related data in the specified databases.
search GenBank extends standard capabilities of the NCBI Entrez search engine in querying biomedical databases. The possibility of creating and saving macros in the search GenBank is a unique feature and has a great potential. The potential will further grow in the future with the increasing density of networks of relationships between data stored in particular databases. search GenBank is available for public use at http://sgb.biotools.pl/.
KeywordsNCBI entrez Entrez databases Entrez search engine Entrez programming utilities Data exploration Data searching Data querying Web services Orchestration Choreography
Not so long ago, the cooperation of medical science and informatics was not so common, but nowadays, collaboration of scientists from completely different fields of science is not unusual. Moreover, the cooperation of scientists from different domains initiated the formation of intermediate and interdisciplinary scientific fields; those which can combine knowledge and experience from apparently distant research areas. It appears that without the cooperation across boundaries resulting from differences between the various fields of science, the current image of scientific work would look quite different.
Several years ago, medicine and life sciences began to generate a huge amount of data. This unimaginably large amount of data had to be processed and stored in some way. Various research centers around the world began to create data repositories trying to solve the problem of collecting and processing large volumes of biological information. The solutions and techniques, which focus on storing medical data, are increasingly fine-tuned, due to the fact that the amount of information is constantly growing. One of the major reasons for the creation of biological databases is the very large amount of genetic information, including nucleotide sequences. Since 1981, when the Sanger sequencing method was invented, the problem of storing genetic information has been prevalent. GenBank[1, 2] is one of the most famous databases in the world that stores genetic information. GenBank and also other biomedical databases are maintained by the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov). The NCBI not only hosts the databases, but also provides a uniform system for data retrieval and additional programming tools that allow software developers to create specialized applications for biomedical data integration.
Retrieving information from NCBI databases
The most commonly used system for searching and retrieving information from databases that are maintained by the NCBI is Entrez[3, 4]; the Entrez system is also a tool for indexing records in NCBI databases. The first version of the system was distributed on CD-ROM (1991). At that time, Entrez provided data on nucleotide sequences from the GenBank, amino acid sequences from the Protein database, which stored protein sequences corresponding to nucleotide sequences in the GenBank, and also scientific abstracts from the PubMed database[6, 7].
The system is constantly being developed and improved; the number of nodes supported by the Entrez is increasing all the time. The original system, which contained three nodes, i.e. GenBank (Nucleotide), Proteins and PubMed, has evolved in recent years, adding new nodes, including:
Taxonomy, organized around the names and phylogenetic relationships between organisms;
Structure, organized around the three-dimensional structures of proteins and nucleic acids;
Genome, representing completely sequenced organisms and those for which sequencing is in progress, together with links to genomic data available for these organisms;
PopSet, a set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population;
OMIM, which is a database of all known diseases with a genetic basis;
SNP (dbSNP), the Single Nucleotide Polymorphism database - a public-domain archive for a broad collection of simple genetic polymorphisms.
Access to these resources can be achieved by a graphical user interface of the NCBI Entrez system or by using NCBI Web services.
A Web service is a software system designed to support interoperable machine-to-machine interactions over a network. Web services provide a way in which we can build distributed web-accessible programs that are key components for the implementation of different types of software applications in the Service-Oriented Architecture (SOA). Unlike traditional client/server applications, Web services do not provide the Graphical User Interface (GUI). They provide their programmatic interfaces across the network in order to share processes, data and operational logic. Descriptions of these interfaces are published by the Web services creators and can be discovered and then used by Web services consumers, i.e. software developers. Software application developers can enrich their programs by using functionality that is available through the particular Web service.
Web services are widely used in business software applications. What decides on their popularity is the fact that they provide the interoperability between different software applications, running on a variety of platforms and/or frameworks. This is achieved by the use of several technologies, including: XML (Extensible Markup Language), which allows a structure to be described and assigned to the transferred data, WSDL (Web Services Description Language), which is used to describe the functionality offered by a Web service, SOAP (Simple Object Access Protocol), which allows the exchange of messages and data between client applications and Web services and also to call operations shared by the Web service, and UDDI (Universal Description, Discovery and Integration), which is a mechanism to register and locate web service applications.
Web service orchestration and choreography
The availability and popularity of Web services supported by the development of associated standards, protocols and technologies, like XML, SOAP, WSDL and UDDI, provided the opportunity for building complex processes that are based on the message flow between many distributed Web services. However, the construction of such systems requires the appropriate coordination of all components taking part in the composite process. Orchestration and choreography permit this task. Orchestration[14, 20] refers to a process of coordination regarding how component Web services will be invoked, what information they will receive and what information they will return. The interaction between the coordinating service and Web services is carried out using messages. A central middleware service, called an orchestration engine or orchestrator, coordinates the flow of messages during the process and models various communication paths. It maintains complete control over the running process and mediates the exchange of data. Choreography is similar to orchestration, but the main difference is that there is no central coordination between calls of particular Web services. Choreography is not described from the view of the single participant (orchestrator), but from the global perspective of the cooperating participants. The use of a particular coordination method depends on the situation and the complex process that is modeled.
There are several tools that allow the development of complex workflows and the coordination of information flow between Web services, including Taverna, SADI, Altova MapForce, and others. Taverna is a universal, open-source tool for designing and running scientific workflows. These workflows combine different Web services, local services and scripts that can be used to access various biological databases and to perform different analyses in molecular biology and bioinformatics. SADI is a framework for the discovery of, and interoperability between, distributed biological data and other bio-oriented analytical resources. SADI adds semantics to descriptions of interfaces of existing Web services in order to simplify and automate service design and deployment. Altova MapForce is a domain-independent, visual data mapping tool for advanced data integration projects. It allows the user to design the flow of data between various data sources, transform the data and generate the code of the transformation. Web services can be one of the data sources and MapForce allows the construction of complex pipelines combining these Web services. MapForce is a commercial tool. Microsoft SQL Server Integration Services (SSIS) is another commercial product allowing the integration and transformation of data from various data sources, including Web services. Workflows similar to those built in Taverna may contain many component tasks accessing Web services that can be invoked in the planned sequence.
In this article, we present a new tool for the efficient and dynamic exploration of data in GenBank and its allied databases. We have designed and developed a special search GenBank system and web portal, which enables searching for information in GenBank and also in other NCBI databases through the interactive orchestration of NCBI Web services. Moreover, by using macros, users transparently design their own choreography of the Web service utilities, extending the possibilities of standard data searching of the NCBI Entrez.
While exploring biomedical databases in the face of the growing number of bio-resources available worldwide, we have to answer one fundamental question: whether to integrate the available data or use existing systems through the shared endpoints? Since the integration of biological data resources and tools requires investments in computer infrastructure and its maintenance, in our solution we adopted the latter approach. search GenBank uses Web services and a set of available utilities in order to explore biomedical databases of the NCBI.
simple searching based on the specified user’s query,
advanced searching based on the specified user’s query, and
advanced exploration with the use of macros.
Each of the above mentioned paths starts with the initial query, which is specified by a user. This query can be given as a single word or phrase, which usually takes place in the simple searching mode, or can have dedicated syntax specified by NCBI, which is typically used in the advanced searching mode. Advanced exploration with the use of macros allows not only the submission of a query to one particular database, but also traversing a predetermined path between various databases and discovering related information. From a technical point of view, we can treat simple and advanced searching modes as a special case of the advanced exploration with the use of macros, wherein we use a single database for the exploration.
Query syntax and construction, how phrases provided by users are automatically broken into terms and how they are mapped to appropriate fields that are searched by the NCBI search engine are described in the following subsections. These are internal features of the NCBI Entrez search engine. However, we describe them here because they are implicitly used while invoking NCBI Web services by search GenBank. We also present the architecture of the search GenBank system and the flow of information in various paths of data exploration. Before we go deeper into functional details of the system, we will consider the following scenario.
Motivating use case
This scenario shows one of possible exploration paths when diagnosing diseases based on the analysis of a patient’s DNA. Changes in the DNA sequence may be spontaneous or caused by chemical or physical factors. When they occur at a frequency of less than 1% of the population they are identified as a mutational change - genomic mutations. These changes in the nucleotide sequence can cause changes in the protein sequence, structure, and function, such as loss of function, or less frequently, the acquisition of a new function. Unfortunately, while recognizing the change in the DNA sequence, we can rarely clearly answer the question of how it affects the function of the protein. If a change in the DNA causes the emergence of a stop codon that ends the translation process, we can clearly state that there is a change of the protein function due to the truncated amino acid chain produced or complete loss of function if the cell degrades truncated protein. Any other change in the DNA can affect the structure and function of the protein, but the identification of the effect of the nucleotide change is primarily based on the evaluation of the patient’s phenotype or evaluation of how the presence of a disease correlates with the presence of mutation in members of the studied family. Such an analysis is typical when diagnosing many diseases which have their origin in DNA mutations.
The analysis presented in this scenario may generate various, shorter or longer exploration paths depending on the conclusions drawn from the investigation. search GenBank provides various exploration modes for such scenarios, and each of them begins with an initial query to particular database.
Query syntax and construction
While searching the NCBI databases, users usually input key words or search phrases to the search box and submit them to the NCBI Entrez search engine. The strings entered into Entrez are converted to queries with the following format:
Term1[field1]Op Term2[field2]Op Term3[field3]Op…
where: Term, specifies the query phrase, field is a searched field (always enclosed in square brackets), and Op is one of the available logical operators: AND, OR or NOT (operators must be written in capital letters).
Example: breast cancer AND human[organism]
In the situation where the query consists only of a list of UIDs (unique identifiers of records in NCBI databases) or accession numbers, Entrez will return only those records to which given identifiers refer. Therefore, no additional query processing will occur.
Automatic mapping of phrases in queries submitted to Entrez
Taxonomic node - each phrase is limited to the [organism] field or [All Fields]. For example, for the mouse phrase, the system automatically maps the phrase to: “Mus musculus”[organism] OR mouse[All Fields].
Names of journals - the database is searched against the names of journals, e.g.: science → science[Journal].
Name of the author - the result of the query is narrowed to [Author] field. Not every phrase can be mapped to the name of the author field, because the correct name of the author can only be the word, followed by one or two letters. For example: Mrozek D → “Mrozek D”[Author].
If the system does not return results after the auto-mapping process, then the rightmost phrase of the query is removed and mapping process is repeated until the system returns results. If, despite this, system still does not return the results, all query phrases are limited to the All Fields field, and they are joined by the logical operator AND.
Examples of automatic mapping of phrases in queries submitted to Entrez
Query after automatic mapping process
breast cancer inhibitor
(“Clin Breast Cancer”[Journal] OR “Breast Cancer”[Journal] OR (“breast”[All Fields] AND “cancer”[All Fields]) OR “breast cancer”[All Fields]) AND inhibitor[All Fields]
cancer inhibitor breast
(“Cancer”[Organism] OR cancer[All Fields]) AND inhibitor[All Fields] AND breast[All Fields]
(“Homo sapiens”[Organism] OR human[All Fields]) AND rab5a[All Fields]
bos polymerase gene
(“Bos”[Organism] OR bos[All Fields]) AND polymerase[All Fields] AND gene[All Fields]
bos a polymerase gene
bos a[Author] AND polymerase[All Fields] AND gene[All Fields]
Entrez Programming Utilities
Entrez Programming Utilities (eUtils)[27-29] is a set of eight programs running on the NCBI server side. These utilities provide a stable interface to the NCBI Entrez and NCBI resources. We can use eUtils in two ways. The first way is that our application can send a properly composed URL address to the server, which makes the tools available so that it can receive the response from the tool in the XML format. The second, more elegant way, is to use Web services that guarantee the interoperability across platforms, applications, and programming languages, and rely on standardized protocols (SOAP, WSDL, and UDDI). On its website, the NCBI makes available links to the WSDL files with a description of the Web services that provide the access to eUtils.
Description of available eUtils tools
Provides information about available databases or a specific database, such as: the number of indexed records for each searched field, the date of the last update, and available links to other NCBI databases.
Responds to the query returning the number of records that match the specified, search phrases in any Entrez database.
Responds to the query returning a list of unique identifiers (UID) of records that match the specified query.
For the specified list of UIDs, returns summaries of records from a particular database.
Accepts a list of UIDs and sends it to the History Server, returning the appropriate address in the form of parameters: WebEnv and query_key.
Responds to the list of UIDs returning complete records from a specified database.
Allows traversing between databases using related UIDs. Returns the list of UIDs of records from a destination database related to the UIDs of records from the source database.
Returns suggestions of the correct spelling for the query entered by the user.
Interactive orchestration of eUtils
Orchestration allows the user to combine various utilities in order to create a composite exploration process. search GenBank is a kind of interactive orchestrator allowing various exploration paths to be modeled. It allows users not only to submit queries to particular databases and receive results; users are also able to browse related records in other databases. Working interactively with the search GenBank, they translate their own logic of the analysis process into the flow of information from the search GenBank to the NCBI Web service and opposite. Actually, it is search GenBank that makes the translation of user’s navigation paths to a set of commands sent and received to and from the NCBI Web services and coordinates the compound analysis process. This interactive orchestration is available in all three modes of data exploration, which will be discussed in more details in the Results section.
For example, the basic combination of the tools included in the eUtils package is:
ESearch → EFetch/ESummary
The list of UIDs can be explicitly entered on the id input of the next tool (Figure 4a) or can be implicitly passed to the next tool from the Entrez History Server using a pair of WebEnv-QueryKey parameters (Figure 4b). The first of these methods is very useful when users interactively choose records through the graphical user interface (GUI) and traverse between data sources passing UIDs of these records on to the next program in the pipeline. The second method allows the list of UIDs to be stored outside of the search GenBank, at the Entrez History Server, which is a valuable feature when there is a full navigation path, i.e. pipeline, already prepared and predetermined (see choreography in the next chapter) or when dealing with huge amounts of data. In the latter case, the use of the Entrez History Server makes it feasible to limit the number of data transferred to and from the search GenBank between successive invocations of Web services.
Now, let us suppose that the user decides to see the full details of the presented records. In this situation, he/she interactively triggers an invocation of the EFetch tool (Figure 5) through the search GenBank web page. If the user wants to see related records in other database (e.g. Gene), he/she uses dedicated links, which implies an invocation of the ELink tool followed by the invocation of the ESummary tool again, passing appropriate parameters to each of the tools.
In other words, if the objective of the user is to retrieve records from one database and then find associated records in another database (or the same database), the following combination of tools must be used to accomplish this task:
ESearch → ELink → EFetch/ESummary
In this pipeline, the ELink is responsible for generating a list of UIDs for records from a destination database, which are related to the records from the source database, against which the user’s query was executed. It is worth noting that the destination database can be the same as the source database, so that links can be followed between records of the same type, often called “neighbors”, in sequence and structure nodes.
Navigation paths are described in more details in the Results and Discussion section. However, we have to be aware that they involve interactive orchestration behind the scenes; that is the role of search GenBank as an orchestrator.
Ad-hoc choreography in advanced data exploration
what tools will be used during data exploration,
what resources and databases will be involved in the process,
what is the order of using resources and tools,
and, indirectly, what will be the message flow between components taking part in the process.
Similarly, if we are interested in a wider spectrum of relations, for example, our goal is to find the amino acid sequences associated with genes that are, in some way, related to nucleotide sequences from the population set of sequences belonging to the mouse, then the number of calls to the ELink tool is three:
ESearch → ELink → ELink → ELink → EFetch
For such a case, the ESearch returns a list of UIDs for all of the population sets of sequences for the mouse retrieved from the PopSet database. The first call of ELink finds a list of UIDs of nucleotide sequences from the Nucleotide database, which were included in the found population sets. The second call of ELink generates a list of genes from the Gene database, which are linked to these nucleotide sequences. The result of the last call of ELink is a list of UIDs of proteins from the Protein database that are associated with the previously obtained list of genes.
Similarly to the interactive orchestration in the simple and advanced searching modes, the destination database can be the same as the source database if we want to find neighbors in sequence and structure nodes.
Results and discussion
search GenBank provides a web portal that allows users to search and retrieve information from NCBI databases. In the beginning, it was designed to display only nucleotide sequences of the GenBank database, hence the name of the portal. At that time, we cooperated with the Department of Internal Diseases, Diabetology and Nephrology, Medical University of Silesia, Zabrze, Poland, as we needed referential sequences for the comparison of DNA samples obtained from patients’ serum in order to find mutations while diagnosing different types of diabetes mellitus. However, since this time we have redeveloped the entire system and the web portal, extended its functionality towards more sophisticated searching, which includes queries and macros and the involvement of other NCBI databases. However, for historical reasons, the name of the portal remained unchanged.
Nucleotide - the main database of nucleotide sequences (including GenBank),
dbEST  - the database of EST sequences (Expressed Sequence Tag),
dbGSS  - the database of GSS sequences (Genome survey sequence),
Genome - representing completely sequenced organisms and those for which sequencing is in progress,
PopSet - the database of sequences from a single population study,
Taxonomy - the taxonomy database,
Gene  - the database of known genes,
OMIM - the database of all known diseases with genetic components,
SNP - the database of single nucleotide polymorphisms,
PubMed - the database of citations and abstracts for biomedical publications,
PMC  - the database of free full-text biomedical and life sciences journal literature,
Journals  - the database of scientific journals,
Protein - the database of amino acid sequences.
The web portal allows the exploration of resources from the databases mentioned above in the same manner, as it is resolved in the NCBI Entrez service (http://www.ncbi.nlm.nih.gov/guide/). Apart from the standard search method, in which a user manually enters a query to the search box, we also made available an advanced search module, which allows the user to define appropriate limits and constraints restricting the results of the query.
The search GenBank internet portal is also equipped with a module for building macros. Macros are used in order to automate searching other databases for records which are related to those that are the result of the initial query. This module is an innovative part of search GenBank and we did not find any equivalent software that would offer the same functions.
The application interface was designed and composed taking into account the expectations of modern users of Internet services and is compatible with the accepted principles of transparency in presenting information on web sites.
The program allows registered users to utilize the simple system for saving entered queries and built macros. This amenity introduces to the service the possibility of reusing saved elements, without remembering the configuration of built macros or writing a complex query again.
The web portal is available at: http://sgb.biotools.pl. In the following sections we will present the functionality of the search GenBank system.
Quick and simple data searching
Result for query - shows the user’s query translated to the form accepted by NCBI Entrez search engine, together with the number of records satisfying the query, and gives also the possibility to save queries (save query);
Other variants - shows the list of suggested alternative queries together with the expected number of results;
Did you mean? - is optional and points to possible errors in the spelling of the provided query.
It is also possible to find neighbors (related records of the same type) by using available links. For example, browsing a genetic record from the Nucleotide database we can find highly similar sequences (by BLAST score) to current records or genomic sequence records that have the current mRNA record as an annotated feature marking the exons of genes.
Advanced data searching
Terms extracted from the phrases entered by a user are not always mapped to the database fields, which the user would expect (see Figure 7, Results for query section). As a result, the list of returned records can often be too large or may contain inappropriate records. Advanced searching in the search GenBank portal allows for the precise composition of queries, using search fields that are specific for the selected database and additional limiting elements. Queries are constructed according to the rules presented in the section Query syntax and construction, typically combining many simple filtering criteria by the use of Boolean operators. For example:
“hnf1a”[GENE] AND “human”[ORGN] AND “MODY”[ALL]
Choose the database from the main query form (database drop-down list at the top of the page).
Create your own complex query:
Choose a logical operator combining appropriate phrases of the query.
Choose the search field.
Enter the search phrase.
Click Add to search box button.
Repeat steps a through d, if necessary.
Press the Search button, which is located at the top of the web page, next to the drop-down list with the names of databases.
Moreover, the module of advanced searching also allows users to enter additional information in order to limit the number of query results. Constraints can be imposed on the range of publication dates or modification dates for records in the specified database. To use this feature, appropriate fields of the search form should be completed in the query builder (Figure 9, point 3).
Results of the constructed query are presented in the same way as results of the simple searches. The clipboard options and links are also available, and work in the same way for the results of queries made using the advanced search form.
In order to construct a macro, it is necessary to specify the name of the initial database in which to search records that match the entered query. Then, from a shared list, the user chooses additional links that allow traversing to other databases or the same database, when looking for neighbors. After the macro is completed, it can be executed and saved in the dictionary of macros, provided that you are logged in as a registered user.
It should be noted that macros do not show records from the initial and intermediate databases and do not require reviewing and selecting any records. Users obtain a list of records from the last of the linked databases. Theoretically, the number of links entered into the macro is infinite, but we should take into account the fact that not all of the records in the NCBI databases have annotations linking related items in other databases. However, over time, the number of links between the NCBI databases will grow. Therefore, we can be optimistic that, in the future, macros will prove to be a great alternative to the laborious, manual exploration between databases.
Problem: Find all genes for amino acid sequences, corresponding to a protein called: topoisomerase
Found: 20 records
Problem: Find nucleotide sequences for mouse, and then all articles available in the PubMed that are related with the nucleotide sequences
Found: 785 records
Problem: Find all possible records from the PopSet database corresponding to the breast cancer, then search the related nucleotide sequences. Bind the found nucleotide sequences with protein sequences.
Found: 1081 records
Construction of the search GenBank and possibilities provided by the system confirm that the role of Web services in the integration and exploration of various biological resources is important. It also shows how the coordination of the complex workflows over Web services can be achieved by orchestration and choreography.
Depending on the exploration mode, the search GenBank serves as an interactive orchestrator or choreographer executing various exploration paths. Comparing the search GenBank system to other tools that orchestrate Web services, like Taverna and SADI, and also commercial MapForce and SQL Server Integration Services, we can say that these tools have a universal purpose and provide the possibility to explore a broader range of resources and Web services. SADI gives even more by adding semantics to the discovery of distributed data resources. However, the above mentioned tools have different target users. These can be, for example, specialists in bioinformatics and data flow architects, whereas search GenBank is dedicated, e.g. to biochemists, biologists and medical doctors. Users of search GenBank do not have to take care of the implementation details, Web service interfaces, domain specific query languages, and how to connect inputs and outputs of particular tools. These details are hidden under the search GenBank web GUI that our system provides. In search GenBank, exploration paths can be constructed dynamically and users are able to traverse between various data sources whilst still having an open door to take another step. On the other hand, while working with macros, users construct predefined traverse paths that are similar to workflows in the Taverna and SQL Server Integration Services, although with limited possibilities, but in a simpler manner through the friendly graphical user interface that is appropriate for them.
In the future, we plan a further development of the system by extending its capabilities in accessing biomedical resources available world-wide. We want search GenBank to provide an integrated access to distributed tools and biological data from various scientific repositories, hiding the implementation details of a particular set of functionality under a friendly graphical interface.
search GenBank provides an internet portal that allows simple and advanced data searching in GenBank and other databases which are maintained by the National Center for Biotechnology Information. Furthermore, the possibility of creating macros allows for cross-database exploration of related data. This is a unique feature of the search GenBank system. Currently, the strength of this solution may not be fully utilized, due to the fact that the current relationships between records of different databases are not very complex. However, we believe that the potential of the idea that lies in the automation of finding useful information in biomedical databases will grow with the increasing density of relationships between data stored in particular databases.
The search GenBank system has been designed for people involved in the analysis of biological data, including biochemists, molecular biologists, medical doctors, staff of genetic laboratories and molecular pathologists. Registered and logged in users of the system can save queries and macros in the special dictionaries, so that, in the future, when conducting similar studies, they can reuse them.
search GenBank complements the capabilities of the NCBI Entrez portal. It concentrates largely on genetic data, based on the assumption that genetic data are currently the most frequently used data in life sciences. However, it also allows data searching and data exploration in other NCBI databases.
Availability and requirements
Project name: search GenBank
Project home page: http://zti.polsl.pl/dmrozek/science/sgb/sgb.htm
Operating systems: Platform independent
Programming language: PHP
Other requirements: Apache 2.2.10 or higher, MySQL 5.1.30 or higher, PHP 5.2.6 or higher with the following extensions: com_dotnet, ctype, session, filter, ftp, hash, iconv, json, odbc, pcre, date, libxml, standard, tokenizer, zlib, SimpleXML, dom, SPL, wddx, xml, xmlreader, xmlwriter, apache2handler, curl, gd, mbstring, mysql, mysqli, PDO, pdo_mysql, soap, SQLite
License: free for academics
Any restrictions to use by non-academics: licence needed
The scientific research presented in the paper was partially supported by the Ministry of Science and Higher Education, Poland in years 2008-2011, Grant No. N N516 265835 and supported by the European Union from the European Social Fund (grant agreement number: UDA-POKL.04.01.01-00-106/09).
- Bilofsky HS, Burks C, Fickett JW, Goad WB, Lewitter FI, Rindone WP, Swindell CD, Tung CS: The GenBank genetic sequence databank. Nucleic Acids Res 1986,14(1):1-4. 10.1093/nar/14.1.1PubMed CentralView ArticlePubMedGoogle Scholar
- Mizrachi I, GenBank: The nucleotide sequence database. The NCBI handbook [internet] . Edited by: McEntyre J, Ostell J. Bethesda (MD): National Center for Biotechnology Information (US); 2002.http://www.ncbi.nlm.nih.gov/books/NBK21105/ Updated 2007) [Google Scholar
- Hogue C, Ohkawa H, Bryant S: A dynamic look at structures: WWW-entrez and the molecular modeling database. Trends Biochem Sci 1996, 21: 226-229.View ArticlePubMedGoogle Scholar
- Ostell J: The entrez search and retrieval system. The NCBI handbook [internet] . Edited by: McEntyre J, Ostell J. Bethesda (MD): National Center for Biotechnology Information (US) 2002; 2003.http://www.ncbi.nlm.nih.gov/books/NBK21081/Google Scholar
- Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, Feolo M, Fingerman IM, Geer LY, Helmberg W, Kapustin Y, Krasnov S, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, Marchler-Bauer A, Miller V, Karsch-Mizrachi I, Ostell J, Panchenko A, Phan L, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Slotta D, Souvorov A, Starchenko G, Tatusova TA, Wagner L, Wang Y, Wilbur WJ, Yaschenko E, Ye J: Database resources of the national center for biotechnology information. Nucleic Acids Res 2012,40(Database issue):D13-D25.PubMed CentralView ArticlePubMedGoogle Scholar
- McEntyre J, Lipman D: PubMed: bridging the information gap. CMAJ 2001,164(9):1317-1319.PubMed CentralPubMedGoogle Scholar
- Canese K, Jentsch J, Myers C, PubMed: The bibliographic database. The NCBI handbook [internet] . Edited by: McEntyre J, Ostell J. Bethesda (MD): National Center for Biotechnology Information (US); 2002.http://www.ncbi.nlm.nih.gov/books/NBK21094/Google Scholar
- Federhen S: The NCBI taxonomy database. Nucleic Acids Res 2012,40(Database issue):D136-D143.PubMed CentralView ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Addess KJ, Chappey C, Geer L, Madej T, Matsuo Y, Wang Y, Bryant SH: MMDB: entrez’s 3D structure database. Nucleic Acids Res 1999,27(1):240-243. 10.1093/nar/27.1.240PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusova TA, Karsch-Mizrachi I, Ostell JA: Complete genomes in WWW entrez: data representation and analysis. Bioinformatics 1999,15(7-8):536-543.View ArticlePubMedGoogle Scholar
- Amberger J, Bocchini CA, Scott AF, Hamosh A: McKusick’s online mendelian inheritance in man (OMIM). Nucleic Acids Res 2009, 37: D793-D796. 10.1093/nar/gkn665PubMed CentralView ArticlePubMedGoogle Scholar
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29: 308-311. 10.1093/nar/29.1.308PubMed CentralView ArticlePubMedGoogle Scholar
- Haas H, Brown A: Web services glossary. W3C Working Group Note 2004. http://www.w3.org/TR/ws-gloss/Google Scholar
- Erl T Concepts, technology, and design. In Service-oriented architecture (SOA). Upper Saddle River, NJ: Prentice Hall; 2005.Google Scholar
- Sosinsky B: Cloud computing bible. 1st edition. Indianapolis, IN: Wiley; 2011.Google Scholar
- Bray T, Paoli J, Sperberg-McQueen CM, Maler E, Yergeau F, Cowan J W3C recommendation. Extensible markup language (XML) 1.1 Second edition. 2006. http://www.w3.org/TR/2006/REC-xml11-20060816/Google Scholar
- Chinnici R, Gudgin M, Moreau JJ, Weerawarana S: Web services description language (WSDL) version 1.2. W3C Working Draft 2002. http://www.w3.org/TR/2002/WD-wsdl12-20020709/Google Scholar
- Gudgin M, Hadley M, Mendelsohn N, Moreau JJ, Nielsen HF, Karmarkar A, Lafon Y: SOAP version 1.2 part 1: messaging framework. Second edition. http://www.w3.org/TR/soap12-part1/
- Clement L, Hately A, von Riegen C, Rogers T: UDDI version 3.0.2. UDDI Spec Technical Committee Draft 2004. http://uddi.org/pubs/uddi_v3.htmGoogle Scholar
- Peltz C: Web services orchestration and choreography. Computer 2003,36(10):46-52. 10.1109/MC.2003.1236471View ArticleGoogle Scholar
- Barker A, Walton CD, Robertson D: Choreographing Web services. IEEE Transact Serv Comp, IEEE Comp Soc 2009,2(2):152-166.View ArticleGoogle Scholar
- Hull D, Wolstencroft K, Stevens R, Goble C, Pocock MR, Li P, Oinn T: Taverna: a tool for building and running workflows of services. Nucleic Acids Res 2006,34(Web Server issue):729-732.View ArticleGoogle Scholar
- Wilkinson MD, Vandervalk BP, McCarthy EL Proceedings of the Asia-pacific services computing conference. In SADI SemanticWeb services - ‘cause you can’t always GET what you want!. Singapore: IEEE Press; 2009:13-18.Google Scholar
- Altova MapForce 2013 User & Reference Manual http://www.altova.com/documents/MapForceEnt.pdf
- Haselden K: Microsoft SQL server 2008 integration services unleashed. 1st edition. Indianapolis, IN: Sams; 2009.Google Scholar
- Skupien J, Gorczynska-Kosiorz S, Klupa T, Cyganek K, Wanic K, Borowiec M, Sieradzki J, Malecki MT: Molecular background and clinical characteristics of HNF1A MODY in a polish population. Diabetes Metab 2008,34(5):524-528. 10.1016/j.diabet.2008.05.004View ArticlePubMedGoogle Scholar
- Sayers E, Wheeler D NCBI short courses [internet]. In Building customized data pipelines using the entrez programming utilities (eUtils). Bethesda (MD): National Center for Biotechnology Information (US); 2004. http://www.ncbi.nlm.nih.gov/books/NBK1058/Google Scholar
- Sayers E Entrez programming utilities help [internet]. In A general introduction to the E-utilities. Bethesda (MD): National Center for Biotechnology Information (US); 2010. Updated 2011) [http://www.ncbi.nlm.nih.gov/books/NBK25497/ Updated 2011) [Google Scholar
- Sayers E, Miller V Entrez programming utilities help [internet]. In Overview of the E-utility Web service (SOAP). Bethesda (MD): National Center for Biotechnology Information (US); 2010. Updated 2012) [http://www.ncbi.nlm.nih.gov/books/NBK43082/ Updated 2012) [Google Scholar
- Sayers E Entrez programming utilities help [internet]. In The E-utilities in-depth: parameters, syntax and more. Bethesda (MD): National Center for Biotechnology Information (US); 2010. Updated 2012) [http://www.ncbi.nlm.nih.gov/books/NBK25499/ Updated 2012) [Google Scholar
- Schuler GD: Pieces of the puzzle: expressed sequence tags and the catalog of human genes. J Mol Med 1997, 75: 694-698. 10.1007/s001090050155View ArticlePubMedGoogle Scholar
- Maglott D, Ostell J, Pruitt KD, Tatusova T: Entrez gene: gene-centered information at NCBI. Nucleic Acids Res 2011, 39: D52-D57. 10.1093/nar/gkq1237PubMed CentralView ArticlePubMedGoogle Scholar
- Sequeira E: PubMed central - 3 years old and growing stronger. ARL 2003, 228: 5-9.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.