Overview
Our work is motivated by the need for integrated query access to distributed services that compute information about lipids. Here we describe services that, starting with a SMILES representation of the lipid molecule, provide functional group annotations of molecules, and classify molecules according to these annotations (Figure 1). We illustrate, in conjunction with auxiliary SADI services, the integration of inferred class information with (i) information relating to proteins that interact with lipids belonging to the inferred lipid types and (ii) literature references relevant to the class of the lipid under investigation. To assess the quality of our framework, we document the performance of our classification service on a small subset of lipids, namely of eicosanoids, found in the LIPID MAPS database. Finally, for each analyzed LIPID MAPS eicosanoid molecule, we contrast the semantic definition of lipid classes in our ontology, with the class description in the LIPID MAPS database.
Integration of Lipid Resources with SADI and SHARE
The first and the most important step in constructing our prototype pipeline is the development and integration of SADI-based annotation and classification semantic web services. The functional group annotation service [20] consumes a molecule whose structure is specified as a SMILES string (Listing 1) and annotates it with the set of unique occurrences of functional groups found in the molecule through subgraph isomorphism detection (Listing 2). In the input, a molecule is related to its SMILES descriptor through the 'has attribute' relation (SIO_000008) as defined by the Semantic Science Integrated Ontology (SIO) [21]. Here and elsewhere in this work, the sio prefix shall refer to this ontology (see Methods). The descriptor itself is an instance of the type CHEMINF_000018 (also found in SIO) and is linked to its string value through the 'has value' datatype (SIO_000300). In the output of the functional group annotation service, a molecule is related to a given functional group through the 'has proper part' predicate (SIO_000053).
Listing 1. A fragment of N3 RDF input to the functional group annotation service.
@prefix ss: <http://semanticscience.org/>.
@prefix sio: <http://semanticscience.org/resource/>.
ss:LMFA03010001 sios:SIO_000008 ss:LMFA03010001smiles.
ss:LMFA03010001smiles
rdf:type sio:CHEMINF_000018;
sio:SIO_000300 "CCCCC[C@H](O)/C=C/[C@H]1[C@H](O)C[C@H](O)[C@@H]1CC(=O)CCCCC(=O)O".
Listing 2. A fragment of N3 RDF output of the functional group annotation service.
<http://semanticscience.org/LMFA03010001>
sio:SIO_000053 [a lipro:Propyl].
sio:SIO_000053 [a lipro:Alcohol].
. . . . .
sio:SIO_000053 [a lipro:Hydroxy_Compound].
In turn, the lipid classification service [22] consumes a molecule annotated with functional groups and produces annotation in the form of its LiPrO classification through the predicate lipidHasLiProClass which is a subproperty of rdf:type and is defined in the supporting service ontology [23], abbreviated as ont:
<http://semanticscience.org/LMFA03010001> ont:lipidHasLiProClass lipro:LC_Prostaglandin.
While one could write a script to sequentially call the two services, this can also be automatically carried out by the SHARE client with an appropriate SPARQL query (Listing 3). Please note that in the interest of clear presentation, the queries presented here are truncated. The full, functioning queries, along with detailed instructions for executing them, are available [24].
Listing 3. A fragment of SPARQL query for classification of a lipid using the SHARE client. Full query is available [24].
[L1] SELECT DISTINCT ?liProClass ?LMClass ?LMClassId
[L2] FROM <http://unbsj.biordf.net/lipids/service-data/LMFA03010001.rdf>
[L3] WHERE
[L4] {
[L5] <http://semanticscience.org/LMFA03010001> lco:lipidHasLiProClass ?liProClass.
[L6] <http://semanticscience.org/LMFA03010001> lmo:lipidHasLMClass ?LMClass.
[L7] # 'has attribute'
[L8] ?LMClass sio:SIO_000008 ?LMClassIdRes.
[L9] # 'identifier'
[L10] ?LMClassIdRes rdf:type sio:SIO_000115.
[L11] # 'has value'
[L12] ?LMClassIdRes sio:SIO_000300 ?LMClassId
[L13] }
To execute the query, SHARE first obtains the SMILES information from the RDF file identified in the "FROM" clause (whose contents are shown in Listing 1). It then calls the annotation service with the SMILES string found in the input and obtains the functional group annotation. Having detected the compatibility of the annotation service output and the classification service input type, the client then calls the classifier service by providing the functional group annotations and obtains the identifiers of the LiPrO classes that match the given lipid. In addition, it discovers the LIPID MAPS classes that would apply to the lipid being classified (lines [L6]-[L12]). This is achieved by calling a third service [25] that simply consumes the identified LiPrO types and maps them to the corresponding LIPID MAPS types. There is absolutely no programming involved, nor any computer-aided workflow composition: the query is declarative and can be created by a user with no programming experience.
This query returned the molecule classified into two LiPrO classes, LC_Prostaglandin and LC_Isoprostane. The definitions of these classes match the structure of the original molecule in terms of participating functional groups and conform to the informal definitions of the corresponding LIPID MAPS types FA0301 and FA0311. Although rigorous query performance evaluation is not among the goals of our exploratory study, we would like to mention, for illustrative purposes, that the execution of the query as indicated took less than one minute. Most of this time was spend by the SHARE client automatically constructing the workflow, whereas each of the service calls took less than one second.
Integration of Lipid Class Information with Reference Literature
Given lipid class information, it is now possible to identify the biological significance of the lipid under investigation by automatically mining the relevant literature for the roles of lipids belonging to this class. In a related effort, we have been exploring the use of text mining for the extraction of information from scientific literature as it pertains to lipids and related biological entities, such as proteins. We created a SADI service to carry out text mining (see Methods), making it possible to identify lipids mentioned in text corpora composed of the latest scientific literature. This service consumes the document location (specified as a URL) and annotates it with corresponding LiPrO lipid classes using SIO's 'is referred to by' relation that links entities to sources of information about them. To support our experiments, text mining has been performed for a small set of published work in lipid-related research and stored in an RDF triple store, which will be later extended to accommodate a more comprehensive body of literature. An additional service, the Lipid Literature DB service, provides the LiPrO annotation for a given publication by retrieving the text mining results from this triple store [26]. This service accepts a LiPrO lipid class as input and can identify the instances of this class and its subclasses within published documents. By combining all these services, it becomes possible to not only classify a given lipid, but also find information about its potential biological activity within published literature. For example, starting from a simple SMILES description of an eicosanoid, we can identify it as eicosapentaenoic acid, an essential fatty acid, and find the relevant literature (Listing 4).
Listing 4. A truncated literature retrieval query. Full query is available [24].
[L1] SELECT DISTINCT ?liProClass ?document ?LMClass ?LMClassId
[L2] FROM <http://unbsj.biordf.net/lipids/service-data/LMFA03010001.rdf>
[L3] WHERE
[L4] {
[L5] <http://semanticscience.org/LMFA03010001> lco:lipidHasLiProClass ?liProClass.
[L6] <http://semanticscience.org/LMFA03010001> lmo:lipidHasLMClass ?LMClass.
[L7] # 'has attribute'
[L8] ?LMClass sio:SIO_000008 ?LMClassIdRes.
[L9] # 'identifier'
[L10] ?LMClassIdRes rdf:type sio:SIO_000115.
[L11] # 'has value'
[L12] ?LMClassIdRes sio:SIO_000300 ?LMClassId
[L13] # 'is referred to by'
[L14] ?liProClass sio:SIO_000212 ?document.
[L15] }
Identification of Related Proteins Based on Lipid Type
Our final use case considers linking the lipid under investigation to potentially related proteins in metabolic or signalling pathways. This is achieved by formulating a SHARE client query that automatically discovers, invokes, and integrates the output of five very different SADI-compliant services. For protein interaction information, we rely on the LIPID MAPS Proteome DB (LMPD) [27], which links lipids to proteins that are involved in lipid metabolism or interactions.
We created a service that simply uses LMPD to provide UniProt entries for proteins related to a specified lipid category [28]. This service links a lipid within an identified high-level LIPID MAPS category (e.g. fatty acids) to its protein interaction partner through the SIO's 'is related to' relation. However, this service is not strictly semantically interoperable with our classifier service. While the classifier service produces the closest matching LiPrO lipid class and another service maps LiPrO type annotations to LIPID MAPS type annotations [24], the higher-level lipid classes required by the protein retrieval service are not explicitly stated. To address this, we have created another LIPID MAPS-based service to take a URI denoting an arbitrary class in the LIPID MAPS nomenclature, for example, a lower-level class like Prostaglandins (FA0301), and to compute the LIPID MAPS higher-level category that contains this class, which for Prostaglandins is Fatty Acyls (FA) [29]. A simple SPARQL query prompts SHARE to integrate this higher-level class information with inferred lipid classification information (Listing 7).
Listing 7. A truncated SPARQL query to identify proteins related to the lipid under investigation. Full queries are available [24].
[L1] SELECT DISTINCT ?liProClass ?LMClass ?LMClassId ?LMCategory ?UniProtRecord
[L2] FROM <http://unbsj.biordf.net/lipids/service-data/LMFA03010001.rdf>
[L3] WHERE
[L4] {
[L5] <http://semanticscience.org/LMFA03010001> lco:lipidHasLiProClass ?liProClass.
[L6] <http://semanticscience.org/LMFA03010001> lmo:lipidHasLMClass ?LMClass.
[L7] # 'has attribute'
[L8] ?LMClass sio:SIO_000008 ?LMClassIdRes.
[L9] # 'identifier'
[L10] ?LMClassIdRes rdf:type sio:SIO_000115.
[L11] # 'has value'
[L12] ?LMClassIdRes sio:SIO_000300 ?LMClassId
[L13] ?LMClass rdfs:subClassOf ?LMCategory.
[L14] # 'is related to'
[L15] ?LMCategory sio:SIO_000001 ?protein.
[L16] # 'is about'
[L19] ?UniProtRecord sio:SIO_000332 ?protein }
Here, lines up to [L12] are our base case: we essentially obtain a LIPID MAPS low-level class URI in the variable LMClass. Line [L13] maps this class to the corresponding higher-level LIPID MAPS category and line [L15] subsequently retrieves related instances of related proteins, which is done by calling the corresponding service [28]. The auxiliary constraint in line [L19] extracts the URL of the corresponding UniProt record from the protein description returned by this service. A test run with SHARE terminates in less than a minute and returns a list of 603 proteins corresponding to fatty acyls (FA), which is the common LIPID MAPS category for prostaglandins and isoprostanes to which the lipid is classified. For this input, we have received, among others, the URLs for the UniProt records P97363, a palmitoyltransferase, and P55249, a lipooxygenase, both of which are known to be involved in fatty acid metabolism. Thus, we have been able to integrate five different services to take our prototype pipeline from a simple SMILES specification of a lipid to its biological roles found in literature and the potential protein partners.
Evaluation of Lipid Classification Performance
The accuracy of lipid classification that we are capable of achieving defines to a great extent the applicability of our entire prototype lipidomics infrastructure. Within a randomly selected but representative sample of 150 eicosanoid lipids, our combination of annotation and classification services assigned a class consistent to the LIPID MAPS asserted classes in 96% of lipid entities considered. This figure includes about 8% of all lipids that received classification to a more general class in the classification hierarchy than the target class, while still being a correct classification. Furthermore, approximately 4% of lipids received a classification that was entirely incorrect, and 7% of all lipids were classified into a LIPID MAPS-consistent and LIPID MAPS-inconsistent class simultaneously. For instance, in case of the LIPID MAPS record LMFA03010001, that was correctly classified as a lipro:LC_Prostaglandin and the corresponding LIPID MAPS class FA0301, this lipid was also classified as a lipro:LC_Isoprostane (LIPID MAPS class FA0311), which is incorrect. This is because additional non-structural information is required, namely synthetic route, to differentiate the two classes.
Misclassification was not always a reflection of insufficient class definitions in the supporting ontology. Despite manual curation of the LIPID MAPS database, it appears that the database is not currently entirely error-free. For instance, compounds LMFA03050020 and LMFA03050021 were classified as epoxyeicosatrienoic acids by our automated classification framework and as hydroxy/hydroperoxyeicosatrienoic acids in the LIPID MAPS database. In the Lipid Eicosanoid Ontology, hydroxyl/hydroperoxyeicosanoic class (LIPID MAPS FA0305) is specified using the following equivalent class expression.
'has proper part' exactly 1 'primary acyl chain'
and 'has proper part' some 'carboxylic acid'
and 'has proper part' some (Alcohol or Hydroperoxide or Peroxide)
and 'has proper part' exactly 3 'Alkenyl_Group'
Similarly, the class of epoxyeicosatrienoic acids, equivalent to LIPID MAPS class of the same name (FA0308) is specified in LEO with the following expression.
'has proper part' exactly 1 Primary_Acyl_Chain
and 'has proper part' some Epoxy
and 'has proper part' some Carboxylic_Acid
and 'has proper part' exactly 3 Alkenyl_Group
Using our functional group annotation service, both of these acids were found to contain (among other non-diagnostic groups) a combination of three alkenyl functional groups, a carboxylic acid functional group, and one epoxide on the main chain, without the alcohol, hydroperoxide, or peroxide functional groups necessary to classify them into FA0305. This suggests that these two fatty acids should, in fact, be reassigned into a different class in the LIPID MAPS database, namely into the epoxyeicosatrienoic acid class, FA0308. Thus, explicit class definitions in our lipid ontology have allowed us to efficiently and automatically identify and correct an error in the expert-curated database.