SPANG: a SPARQL client supporting generation and reuse of queries for distributed RDF databases
© The Author(s). 2017
Received: 21 July 2016
Accepted: 6 February 2017
Published: 8 February 2017
Toward improved interoperability of distributed biological databases, an increasing number of datasets have been published in the standardized Resource Description Framework (RDF). Although the powerful SPARQL Protocol and RDF Query Language (SPARQL) provides a basis for exploiting RDF databases, writing SPARQL code is burdensome for users including bioinformaticians. Thus, an easy-to-use interface is necessary.
We developed SPANG, a SPARQL client that has unique features for querying RDF datasets. SPANG dynamically generates typical SPARQL queries according to specified arguments. It can also call SPARQL template libraries constructed in a local system or published on the Web. Further, it enables combinatorial execution of multiple queries, each with a distinct target database. These features facilitate easy and effective access to RDF datasets and integrative analysis of distributed data.
SPANG helps users to exploit RDF datasets by generation and reuse of SPARQL queries through a simple interface. This client will enhance integrative exploitation of biological RDF datasets distributed across the Web. This software package is freely available at http://purl.org/net/spang.
KeywordsSemantic Web SPARQL RDF Database integration Unix command
Because of advances in biotechnologies, various types of biological data have drastically increased in the past decade. Because of the volume, heterogeneity, and continual growth of biological data, it has become increasingly difficult for individual researchers to manage an entire dataset in a single repository. In this context, Semantic Web technology  has attracted attention as a promising approach of knowledge management . In the Semantic Web, all information is described in the Resource Description Framework (RDF) , in which every piece of information is in the form of a triple containing a subject, predicate, and object and each resource is represented by a Uniform Resource Identifier (URI). The RDF works as a general framework of knowledge representation and the URI assures valid integration of data collected from different sources. Furthermore, knowledge extraction from the RDF can be implemented using a powerful query language called the SPARQL Protocol and RDF Query Language (SPARQL) . SPARQL specifications include federated query functionality , by which distributed databases can be queried in an integrative manner. Thus, Semantic Web technology provides a basis for flexible integration of the increasing amount of heterogeneous data. In fact, many biological databases have already adopted the Semantic Web [6–9].
Despite the well-designed basis of Semantic Web technologies, several obstacles that prevent users including bioinformaticians from utilizing RDF databases still remain. The main hurdle for most users is writing SPARQL, which often includes cumbersome coding tasks. For example, SPARQL permits inclusion of subqueries for distinct endpoints in a federated query; however, writing such a nested query is a complicated task and can be a technical obstacle for most users. Several approaches for supporting SPARQL coding currently exist. Examples include SPARQL editors with useful functionalities such as URI autocompletion , and graphical support for step-by-step construction of SPARQL queries [11, 12]. Despite these approaches, constructing executable SPARQL code, even for a simple query, still remains a time-consuming task; thus, a mechanism that saves time of preparing SPARQL code is necessary to maximize the use of available RDF datasets. As an alternative approach to this issue, a wiki-based portal for sharing SPARQL queries was constructed , which can bypass the burdensome coding task. Although the queries registered on this service can be executed on the portal site, a mechanism for reusing these queries in other environments would maximize the usefulness of the accumulated queries.
Here, we developed SPANG, a client that supports querying by generation and reuse of SPARQL codes through a simple interface. Taking advantage of the common “triple” form of RDF data, SPANG generates typical queries without the need for SPARQL coding. Even in complicated queries, SPANG can construct runtime queries using predefined templates. Regarding the federated query, SPANG realizes a similar functionality by combining multiple queries through a Unix pipe. SPANG, with its unique features, minimizes the burden of coding SPARQL, thereby enhancing integrative exploitation of distributed databases.
Shortcut mode, in which users need only specify command-line options to generate a simple query. Specific command-line options, including -S SUBJECT, -P PREDICATE, -O OBJECT, -L LIMIT, and other modifiers, are interpreted as shortcuts for generating typical SPARQL queries (see Additional file 1).
Template mode, in which users can generate a query using a SPARQL template and parameters. The template can be either a local file or a remote file published on the Web. The specified parameters replace the placeholders included in the template to generate a runtime query.
Although each spang process submits a query to a specified database, the spang process can be combined with other Unix processes through a Unix pipe. Notably, multiple spang commands, each with distinct target database, can be combined through a Unix pipe by transferring variable bindings between queries, thereby realizing federated use of multiple databases.
The SPANG package is implemented in Perl. Specifically, the spang command accesses remote SPARQL endpoints using the Perl LWP module. To lower the initial hurdle of querying with SPARQL, the SPANG package provides predefined configurations, including i) nicknames for SPARQL endpoints, ii) frequently used prefix declarations for URIs, and iii) SPARQL template libraries. Furthermore, users can extend the configurations by preparing user-defined configuration files.
Simple queries using SPARQL shortcuts
spang uniprot -S uniprot:P02649 -a
spang uniprot -S uniprot:P02649 -P up:organism
spang uniprot -S uniprot:P02649 -P up:organism/up:scientificName
which retrieves the scientific name of the organism. Thus, the shortcut mode can be typically used to retrieve resources that are associated with a specific subject via arbitrary predicates. More generally, the shortcut mode can generate a SPARQL code containing a certain triple pattern (see Additional file 1). Adding -q option to the command line outputs the generated SPARQL query without executing it, thereby allowing inspection of the internal operation. For the full list of available command-line options, simply type the command spang.
Using SPARQL templates with parameters
spang uniprot uniprot_annot P02649
spang uniprot regex_class’^apolipoprotein’
where regex_class is a SPARQL template to search for specific classes matching a given pattern of regular expression (see Additional file 2 for the SPARQL code). Although this query is submitted to the UniProt database in the example command line, the template can also be used to search other databases (see the practical use case of SPANG given below).
spang mbgd mbgdl:get_ortholog K9Z723
where mbgd is the MBGD SPARQL endpoint  and mbgdl: is a prefix for abbreviating the URI of the template get_ortholog in the MBGD SPARQL library (see Additional file 2 for the code). The template can be specified in the full URI or in abbreviated form using the predefined prefix declarations. This example query searches the MBGD database for the orthologs of the specified protein K9Z723 (Photosystem II lipoprotein Psb27).
Combinatorial execution of multiple queries
spang mbgd mbgdl:get_ortholog K9Z723 | spang uniprot -S 1 -P rdfs:label
spang mbgd mbgdl:get_ortholog K9Z723 | spang uniprot uniprot_xref PDB
where uniprot_xref is a SPARQL template (see Additional file 2 for the code), which retrieves cross-references from the UniProt IDs given in the standard input to the database specified as the parameter (in this example, PDB). This example command line searches for entries in the Protein Data Bank (PDB)  among orthologs of K9Z723.
Practical use case of SPANG
spang atlas regex_class’^alzheimer’
spang atlas -S efo:EFO_0000249 -a
spang atlas diff_expr EFO_0000249 > result
cut –f1 result | spang uniprot filter_by_go GO_0045202 -a
The result includes the protein Q9Y2J0 (Rabphilin-3A; RPH3A). Recently, it was experimentally shown that reduction of rabphilin-3A in Alzheimer’s disease correlates with dementia severity and amyloid beta accumulation . Thus, stepwise execution of SPANG commands is a useful approach for RDF data integration and knowledge discovery.
All examples of SPANG commands used in this paper are summarized in a table, where they are compared with the corresponding plain SPARQL queries (Additional file 3). It shows that the burden of querying with SPARQL can be reduced by using SPANG commands.
In this paper, we presented SPANG, a SPARQL querying client that has several unique features. First, SPANG provides a shortcut mode that can generate a simple query containing a certain triple pattern. This mode aids querying with SPARQL and is helpful for beginners to start exploring RDF datasets. It is also useful for experienced users of SPARQL, as useful information can often be obtained by retrieving adjacent nodes in RDF graphs using the shortcut mode and efficiently submitting such simple queries is crucial in data mining. Second, for more complicated queries, SPANG provides a template mode, by which existing SPARQL codes can be reused among users. This mode enhances the usage of SPARQL through development of SPARQL template libraries that represent reusable query patterns. The template libraries constructed by experienced users can help other users to efficiently utilize RDF databases. Third, the queries in either shortcut or template mode can be combined in the Unix command line to realize a more complex query. This modular structure of queries has several merits: it reduces complexity of each SPARQL query, leading to easier implementation and debugging of the query; and it extends potential application of each query through combination with other queries or Unix commands.
The predefined SPARQL templates included in the SPANG package are available to help users query some biological RDF databases. However, the range of queries included in the package is limited to rather common ones. The potential use of SPANG can be further extended by database users or database providers through development of SPARQL template libraries. Although a service for sharing SPARQL queries exists , it is difficult to execute them directly for instant reuse by users. In SPANG, users can directly call SPARQL templates across the Web. Thus, if an RDF database provider, who knows best the manner in which the database should be used, publishes SPARQL template libraries, database usage can be considerably enhanced. This study suggests the possibility of an open framework of sharing query in a reusable form. Future work may include the standardized use of the query templates, which will further facilitate the sharing of useful queries. Sharing not only data but also queries (i.e., means of interpreting data) on the Semantic Web platform will help the biological research community collaborate in knowledge integration and discovery.
SPANG enables easy generation of typical queries, thereby reducing the burden of writing SPARQL. SPANG also provides a framework for reusing and sharing arbitrary queries across the Web. Moreover, it enables users to execute complex queries by combining existing query templates. SPANG, with these unique features, facilitates integrative exploitation of published RDF datasets and supports knowledge discovery.
Experimental Factor Ontology
Microbial Genome Database for Comparative Analysis
Protein Data Bank
Resource Description Framework
SPARQL Protocol and RDF Query Language
Uniform Resource Identifier
Computational environments were supported by the Data Integration and Analysis Facility, National Institute for Basic Biology. We thank the advisers to the Tool Prototype for Integrated Database Analysis, the Database Integration Coordination Program of the National Bioscience Database Center, Japan Science Technology Agency.
This work was supported by the Database Integration Coordination Program of the National Bioscience Database Center, Japan Science Technology Agency (to I.U.) and the Tool Prototype for Integrated Database Analysis (to H.C.).
Availability of data and materials
Project name: SPANG
Project home page: http://purl.org/net/spang
Operating systems: Linux, Mac OS X, Unix
Programming language: Perl
HC performed the study and drafted the manuscript. IU participated in its design, and helped to draft the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Berners-Lee T, Hendler J, Lassila O. The Semantic Web. Sci Am. 2001;284:28–37.View ArticleGoogle Scholar
- Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform. 2009;10(4):392–407.View ArticlePubMedGoogle Scholar
- RDF 1.1 Concepts and Abstract Syntax. http://www.w3.org/TR/rdf11-concepts/. Accessed 7 Feb 2017.
- SPARQL 1.1 Query Language. http://www.w3.org/TR/sparql11-query/. Accessed 7 Feb 2017.
- SPARQL 1.1 Federated Query. http://www.w3.org/TR/sparql11-federated-query/. Accessed 7 Feb 2017.
- Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J Biomed Inform. 2008;41(5):706–16.View ArticlePubMedGoogle Scholar
- Jupp S, Malone J, Bolleman J, Brandizi M, Davies M, Garcia L, Gaulton A, Gehant S, Laibe C, Redaschi N, et al. The EBI RDF platform: linked open data for the life sciences. Bioinformatics. 2014;30(9):1338–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Katayama T, Wilkinson MD, Aoki-Kinoshita KF, Kawashima S, Yamamoto Y, Yamaguchi A, Okamoto S, Kawano S, Kim JD, Wang Y, et al. BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains. J Biomed Semantics. 2014;5(1):5.View ArticlePubMedPubMed CentralGoogle Scholar
- Chiba H, Nishide H, Uchiyama I. Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data. PLoS One. 2015;10(4):e0122802.View ArticlePubMedPubMed CentralGoogle Scholar
- Rietveld L, Hoekstra R. YASGUI: not just another SPARQL client. In: The Semantic Web: ESWC 2013 Satellite Events. 2013. p. 78–86.View ArticleGoogle Scholar
- Schweiger D, Trajanoski Z, Pabinger S. SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases. BMC Bioinformatics. 2014;15:279.View ArticlePubMedPubMed CentralGoogle Scholar
- Yamaguchi A, Kozaki K, Lenz K, Wu H, Kobayashi N. An intelligent SPARQL query builder for exploration of various life-science databases. In: The 3rd International Conference on Intelligent Exploration of Semantic Data (IESD). 2014.Google Scholar
- Garcia Godoy MJ, Lopez-Camacho E, Navas-Delgado I, Aldana-Montes JF. Sharing and executing linked data queries in a collaborative environment. Bioinformatics. 2013;29(13):1663–70.View ArticlePubMedGoogle Scholar
- Uniprot Consortium. Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 2014;42(Database issue):D191–8.Google Scholar
- Uchiyama I, Mihara M, Nishide H, Chiba H. MBGD update 2015: microbial genome database for flexible ortholog analysis utilizing a diverse set of genomic data. Nucleic Acids Res. 2015;43(Database issue):D270–6.View ArticlePubMedGoogle Scholar
- Berman H, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003;10(12):980.View ArticlePubMedGoogle Scholar
- Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, Rustici G, Williams E, Parkinson H, Brazma A. Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res. 2010;38(Database issue):D690–8.View ArticlePubMedGoogle Scholar
- Malone J, Holloway E, Adamusiak T, Kapushesky M, Zheng J, Kolesnikov N, Zhukova A, Brazma A, Parkinson H. Modeling sample variables with an Experimental Factor Ontology. Bioinformatics. 2010;26(8):1112–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Bronner IF, Bochdanovits Z, Rizzu P, Kamphorst W, Ravid R, van Swieten JC, Heutink P. Comprehensive mRNA expression profiling distinguishes tauopathies and identifies shared molecular pathways. PLoS One. 2009;4(8):e6826.View ArticlePubMedPubMed CentralGoogle Scholar
- Blake JA, Dolan M, Drabkin H, Hill DP, Li N, Sitnikov D, Bridges S, Burgess S, Buza T, McCarthy F, et al. Gene Ontology annotations and resources. Nucleic Acids Res. 2013;41(Database issue):D530–5.PubMedGoogle Scholar
- Tan MG, Lee C, Lee JH, Francis PT, Williams RJ, Ramirez MJ, Chen CP, Wong PT, Lai MK. Decreased rabphilin 3A immunoreactivity in Alzheimer's disease is associated with Aβ burden. Neurochem Int. 2014;64:29–36.