Simple queries using SPARQL shortcuts
SPANG can generate and execute simple queries by specifying a set of SPARQL shortcuts and additional options. An example of such queries is,
where the first argument is the target SPARQL endpoint and the ensuing arguments are SPARQL shortcuts and an option. The uniprot in the first argument is a predefined nickname for the UniProt SPARQL endpoint [14]. The SPARQL endpoint can be specified in a URL or in a nickname for simplicity. The uniprot: in the third argument is a prefix for URIs of UniProt entries. This example command line searches the UniProt database for statements that have the specified entry ID as a subject (Fig. 1). Using the -a option transforms the URIs in the search result into abbreviated forms using predefined prefix declarations. For example, a URI <
http://www.w3.org/2000/01/rdf-schema#label
> is transformed into rdfs:label. The result is output to the standard output in the form of tab-separated values by default, as it is suitable for processing by line-oriented Unix programs. In addition, a combination of subject, predicate, and object is possible according to the following:
where the predicate up:organism is specified to confine the results to organism information. Instead of specifying a predicate, a property path can be used as follows:
which retrieves the scientific name of the organism. Thus, the shortcut mode can be typically used to retrieve resources that are associated with a specific subject via arbitrary predicates. More generally, the shortcut mode can generate a SPARQL code containing a certain triple pattern (see Additional file 1). Adding -q option to the command line outputs the generated SPARQL query without executing it, thereby allowing inspection of the internal operation. For the full list of available command-line options, simply type the command spang.
Using SPARQL templates with parameters
Although the SPARQL shortcuts are useful for generating simple queries, they do not cover a complicated query that contains combinations of triple patterns. Thus, SPANG provides a mechanism to generate arbitrary query patterns using SPARQL templates. An example of such is,
where the first argument is the target SPARQL endpoint, the second argument is the name of the SPARQL template, and the ensuing argument is the parameter of the template. The specified parameter replaces the placeholder (represented as $1) included in the template before execution (Fig. 2). uniprot_annot is the name of a SPARQL template included in the predefined SPARQL library and P02649 is a parameter. This example query retrieves annotation for a protein P02649 from the UniProt database.
Whereas the templates usually assume specific target databases, some templates are generally applicable to any SPARQL endpoint; for example,
where regex_class is a SPARQL template to search for specific classes matching a given pattern of regular expression (see Additional file 2 for the SPARQL code). Although this query is submitted to the UniProt database in the example command line, the template can also be used to search other databases (see the practical use case of SPANG given below).
Available SPARQL templates are not limited to the local library. When SPARQL libraries are published on the Web, users can call the templates by means of URIs across the Web. We have prepared a SPARQL template library for the Microbial Genome Database (MBGD) [15], which is available at http://mbgd.genome.ad.jp/sparql/library/. This library can be utilized in a command line such as
where mbgd is the MBGD SPARQL endpoint [9] and mbgdl: is a prefix for abbreviating the URI of the template get_ortholog in the MBGD SPARQL library (see Additional file 2 for the code). The template can be specified in the full URI or in abbreviated form using the predefined prefix declarations. This example query searches the MBGD database for the orthologs of the specified protein K9Z723 (Photosystem II lipoprotein Psb27).
Combinatorial execution of multiple queries
In federated use of multiple databases, SPANG can connect queries for distinct target databases through a Unix pipe. Combining a spang command in shortcut mode and another one in template mode is also possible. An example of such a combination is,
where the first spang command is the same as the one presented in the previous subsection to search the MBGD database for orthologs of the protein K9Z723; the obtained list of proteins are used in the second command to search the UniProt database for annotations of the given list of proteins (Fig. 3). The option -S 1 is used to specify the values in the first column of the standard input as subject. This combinatorial query enables integrative use of two databases distributed across the Web. Note that the output of the first command can also be used in a different query by altering the second command; for example,
where uniprot_xref is a SPARQL template (see Additional file 2 for the code), which retrieves cross-references from the UniProt IDs given in the standard input to the database specified as the parameter (in this example, PDB). This example command line searches for entries in the Protein Data Bank (PDB) [16] among orthologs of K9Z723.
Practical use case of SPANG
A series of queries that represents a practical use case of SPANG is described below. Suppose that we are examining Alzheimer’s disease by exploring genes associated with it. An important task would be to search for differentially expressed genes in Alzheimer’s disease patients. Differential gene expression data are available from the Gene Expression Atlas [17] constructed on the basis of a variety of samples that are curated and annotated with the Experimental Factor Ontology (EFO) [18]. Given that we do not know specific resource IDs in advance, we would begin the search with a specific keyword. The following query is available to search for relevant resources using a regular expression:
where atlas represents the SPARQL endpoint for Gene Expression Atlas [7]. This example query gives us a term, EFO_0000249 (Alzheimer’s disease) that is defined in the EFO. The following command line can be used to obtain detailed information about the term:
which retrieves statements that have efo:EFO_0000249 as a subject. Figure 4 illustrates the following stepwise execution of SPANG. The command line shown below retrieves differentially expressed genes in samples of Alzheimer’s disease and saves the result as a file:
where diff_expr is a SPARQL template to search for differentially expressed genes specifying a condition of samples (in this example, Alzheimer’s disease). The result includes microarray probes showing signals of differential gene expression, cross-references from these probes to UniProt IDs, and the PubMed entries describing these experiments. In this particular example, the result is derived from a specific microarray experiment [19]. The obtained result can be further processed by other commands; the next command line extracts the first column (protein IDs) and filters them by Gene Ontology annotation [20] to select those related to “synapse” (GO_0045202):
The result includes the protein Q9Y2J0 (Rabphilin-3A; RPH3A). Recently, it was experimentally shown that reduction of rabphilin-3A in Alzheimer’s disease correlates with dementia severity and amyloid beta accumulation [21]. Thus, stepwise execution of SPANG commands is a useful approach for RDF data integration and knowledge discovery.
All examples of SPANG commands used in this paper are summarized in a table, where they are compared with the corresponding plain SPARQL queries (Additional file 3). It shows that the burden of querying with SPARQL can be reduced by using SPANG commands.