A unified framework for managing provenance information in translational research
© Sahoo et al; licensee BioMed Central Ltd. 2011
Received: 3 May 2011
Accepted: 29 November 2011
Published: 29 November 2011
A critical aspect of the NIH Translational Research roadmap, which seeks to accelerate the delivery of "bench-side" discoveries to patient's "bedside," is the management of the provenance metadata that keeps track of the origin and history of data resources as they traverse the path from the bench to the bedside and back. A comprehensive provenance framework is essential for researchers to verify the quality of data, reproduce scientific results published in peer-reviewed literature, validate scientific process, and associate trust value with data and results. Traditional approaches to provenance management have focused on only partial sections of the translational research life cycle and they do not incorporate "domain semantics", which is essential to support domain-specific querying and analysis by scientists.
We identify a common set of challenges in managing provenance information across the pre-publication and post-publication phases of data in the translational research lifecycle. We define the semantic provenance framework (SPF), underpinned by the Provenir upper-level provenance ontology, to address these challenges in the four stages of provenance metadata:
(a) Provenance collection - during data generation
(b) Provenance representation - to support interoperability, reasoning, and incorporate domain semantics
(c) Provenance storage and propagation - to allow efficient storage and seamless propagation of provenance as the data is transferred across applications
(d) Provenance query - to support queries with increasing complexity over large data size and also support knowledge discovery applications
We apply the SPF to two exemplar translational research projects, namely the Semantic Problem Solving Environment for Trypanosoma cruzi (T.cruzi SPSE) and the Biomedical Knowledge Repository (BKR) project, to demonstrate its effectiveness.
The SPF provides a unified framework to effectively manage provenance of translational research data during pre and post-publication phases. This framework is underpinned by an upper-level provenance ontology called Provenir that is extended to create domain-specific provenance ontologies to facilitate provenance interoperability, seamless propagation of provenance, automated querying, and analysis.
During the pre-publication phases (Figure 1), provenance is collected to describe the experiment design, such as details about the biological or technical replication (RNA extracts or cDNA clones) in microarray experiments, the type of parasite used to create an avirulent strain, or the demographic information used in a clinical trial . Similarly, provenance information about the experiment platform (e.g. type of instruments used) and the tools used to process or analyze data (algorithms, statistical software) is also collected .
In the post-publication phase (Figure 1), data mining and knowledge discovery applications use provenance associated with the data extracted from peer-reviewed literature (e.g. PubMed), public data repositories (e.g. Entrez Gene), and Web resources (e.g. the European Bioinformatics Institute Web services) to guide analysis algorithms and interpretation of results . Specifically, the provenance information in post-publication phase is used to constrain extraction processes to reputable sources (e.g. journals with a high impact factor), clustering datasets according to their source, and ranking results based on the timestamp or authorship information .
Figure 1 illustrates that the provenance metadata follows similar lifecycle phases in both pre- and post-publication stages, but each stage has distinct requirements. We introduce two exemplar translational research projects, each corresponding to a specific stage, to describe the challenges that need to be addressed for creating an effective provenance management system.
The Semantic Problem Solving Environment for T.cruzi project (pre-publication)
T.cruzi is the principal causative agent of the human Chagas disease and affects approximately 18 million people, predominantly in Latin America. About 40 percent of these affected persons are predicted to eventually suffer from Chagas disease, which is the leading cause of heart disease and sudden death in middle-aged adults in the region. Research in T.cruzi has reached a critical juncture with the publication of its genome in 2005  and can potentially improve human health significantly. But, mirroring the challenges in other translational research projects, current efforts to identify vaccine candidates in T.cruzi and development of diagnostic techniques for identification of best antigens, depend on analysis of vast amounts of information from diverse sources. To address this challenge, the Semantic Problem Solving Environment (SPSE) for T.cruzi project has created an ontology-driven integration environment for multi-modal local and public data along with the provenance metadata to answer biological queries at multiple levels of granularity .
Traditionally, bench science has used manual techniques or ad-hoc software tools to collect and store provenance information (discussed further in the Discussion and Related Work section). This approach has several drawbacks, including the difficulty in ensuring adequate collection of provenance, creation of "silos" due to limited or no support for provenance interoperability across projects. Further, the use of high-throughput data generation technologies, such as sequencing, microarrays, mass spectrometry (ms), and nuclear magnetic resonance (NMR) are introducing additional challenges for the traditional approaches to provenance management. A new approach for provenance management is also required to support the increasing trend of publishing experiment results (e.g. microarray data) to community data repositories (e.g. European Bioinformatics Institute Arrayexpress for gene expression data  and NCBI GenBank ).
In the next section, we describe the BKR project corresponding to the post-publication stage.
The Biomedical Knowledge Repository project (Post-publication)
In addition to data, BKR project also includes provenance describing the source of an extracted RDF triple, temporal information (publication date for an article), version of a data repository, and confidence value associated with the extracted information (indicated by a text mining tool). For example, the provenance of the RDF statement "lipoprotein→affects→inflammatory_cells", the source article with PubMed identifier PMID: 17209178, is also stored in the BKR project (courier new font is used to represent RDF and OWL statements). The provenance information is used to support the services offered by BKR namely, (a) Enhanced information retrieval service that allows search based on named relationship between two terms, (b) Multi-document summarization, (c) Question answering, and (d) Knowledge discovery service.
Challenges to provenance management in translational research
Collecting provenance information in high throughput environments that is also adequate to support complex queries,
Representing the provenance information using a model that supports interoperability across projects, is expressive enough to capture the complexities of a specific domain (domain semantics), and allows use of reasoning software for automated provenance analysis over large datasets,
Efficiently storing and ensuring seamless propagation of provenance as the data is transferred across the translational research lifecycle,
A dedicated query infrastructure that allows composition of provenance queries with minimal user effort, addresses the requirements specific to provenance queries (e.g., support for transitive closure), and a highly scalable implementation to support complex user queries over large volumes of data.
We introduce a unified provenance management framework called semantic provenance framework based on the Provenir upper-level provenance ontology for use in both the pre- and post-publication phases of translational research.
We introduce a dedicated ontology-driven provenance collection infrastructure called Ontology-based Annotation Tool (OntoANT) that makes it easier for biomedical researchers to create and maintain web forms for use with bench experiments.
We illustrate the advantage of storing provenance metadata and data as a single RDF graph with significant impact on propagation of provenance.
We present the architectural details of a provenance query engine that can be deployed over multiple RDF databases and supports a set of dedicated provenance query operators.
In the next section, we describe SPF based on the notion of semantic provenance to address the provenance management challenges.
In contrast to traditional database and workflow provenance, semantic provenance incorporates domain-specific terminology represented using a logic-based formal model, which facilitates domain scientists to intuitively query provenance and also automated processing of provenance metadata . The semantic provenance framework (SPF) uses the Provenir upper-level provenance ontology as the core formal model coupled with Semantic Web technologies, including RDF , the Web Ontology Language (OWL) , and the SPARQL query language  for implementing provenance systems.
The Provenir ontology is domain-upper ontology that can be extended, using the standard rdfs: subClassOf and rdfs: subPropertyOf properties, for creating new domain-specific provenance ontologies. This approach of creating a suite of domain-specific ontologies by extending an upper-level ontology (instead of an unwieldy monolithic provenance ontology) facilitates provenance interoperability by ensuring consistent modeling and uniform use of terms  and is a scalable solution. This approach is also consistent with existing ontology engineering practices based on the Suggested Upper Merged Ontology (SUMO) , Basic Formal Ontology (BFO) , and the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE) ). The Provenir ontology is modeled using the description logic profile of OWL (OWL-DL) .
In the following sections, we discuss the use of the Provenir ontology to implement the SPF for managing the four stages of the provenance metadata in the two translational research exemplar projects.
The first phase of the provenance life cycle begins with the collection of provenance information as data is generated or modified in a project. The challenges in this phase include, (a) minimizing the disruption to existing research environment, (b) automating the collection procedure to scale with high-throughput data generation protocols while minimizing the workload for researchers, and (c) creating a flexible infrastructure that can be easily modified in response to changing user requirements. In the following sections, we describe the provenance collection infrastructure created for the T.cruzi SPSE and the BKR projects.
Collecting provenance in the T.cruzi SPSE
Dynamically generate web forms for use in research projects to capture provenance information,
Allow automatic conversion of the data captured in the web forms to RDF, and
Use the built-in automatic validation of the web forms to ensure data quality and consistency with respect to the reference domain-specific provenance ontology (e.g. PEO)
Once a valid provenance pattern is created, the Form Manager is invoked to automatically generate and deploy the web form. The Form Generator component of the Form Manager automatically generates the web form as a set of paired entities, namely, a field name and the corresponding text box (or a drop down list in case of a "nominal" class). Each field name in a web form corresponds to the provenance ontology class, while the value in the text box represents the instance of the ontology class (Figure 8). For example, the field "Priority" (Figure 8) corresponds to the priority class in PEO and the values in the drop-down menu (High, Medium, and Low) correspond to the instance values. The Form Processing Engine component of the Form Manager allows users to modify the automatically generated form. The third component of the Form Manager is the Form Validator, which ensures that the data values entered in the web forms are consistent with the provenance ontology. For example, the Form Validator validates that for a user input value for a web form field is consistent with the ontology class definition or with the property range or domain constraints.
The RDF Manager component of OntoANT defines a set of Application Programming Interfaces (API) that can be used by other OntoANT components to access, construct queries, and generate as well as validate RDF triples. OntoANT is currently being used in the T.cruzi SPSE project to deploy web forms (OntoANT is accessible at: http://knoesis.wright.edu/OntoANT/design.jsp).
Provenance extraction in BKR
BKR collects the provenance information at two levels. At the first level the provenance information associated with an RDF triple is collected, such as the source of the triple (journal article, data repository), the date of the original publication, and the author list for the source article. At the second level, BKR records the provenance information associated with the extraction process, for example the confidence value associated with the extraction technique (in case of text processing tools). The provenance collection process in BKR is integrated with the RDF generation process, which is described in the next section on provenance representation. Provenance representation is a central issue in provenance management and has direct impact on the storage, querying, and analysis of provenance information in translational research.
Earlier, we had described the Provenir ontology that forms the core model of the SPF. In this section, we demonstrate that though the requirements for provenance representation in the pre-publication phase differ from the post-publication phase, the Provenir ontology can be extended to model provenance in both the T.cruzi SPSE (pre-publication) and BKR (post-publication) projects.
Parasite Experiment ontology: Modeling provenance in the T.cruzi SPSE project
PEO models the experiment protocols by specializing the Provenir ontology classes and properties. In addition, PEO re-uses classes and relationships from existing biomedical ontologies, including the Sequence ontology , the National Cancer Institute (NCI) thesaurus, Gene ontology , the W3C OWL Time ontology , and the Ontology for Parasite Lifecycle (OPL) . This facilitates interoperability of data modeled using PEO with data that conform to other existing ontologies . For example, we imported and seamlessly integrated functional gene annotations, using GO terms, from the TritrypDB  and KEGG with existing internal experiment data for a specific list of genes found in T.cruzi and related parasites. Hence, PEO creates a unified schema for both the domain-specific provenance information and data that can be extended (often re-using existing ontology classes) to adapt to evolving needs of bench scientists in the T.cruzi SPSE project.
Provenance Context: Representing provenance in the BKR project
A practical challenge for implementing the PaCE approach in the BKR is to formulate an appropriate provenance context-based Uniform Resource Identifier (URIp) scheme that also conforms to best practices of creating URIs for the Semantic Web, including support for use of HTTP protocol . The design principle of URIp is to incorporate a "provenance context string" as the identifying reference of an entity and is a variation of the "reference by description" approach that uses a set of description to identify an entity . The syntax for URIp consists of the <base URI>, the <provenance context string>, and the <entity name>. For example, the URIp for the entity lipoprotein is http://mor.nlm.nih.gov/bkr/PUBMED_17209178/lipoprotein where the PUBMED_17209178 provenance context string identifies the source of a specific instance of lipoprotein.
This approach to create URIs for RDF entities also enables BKR (and other Semantic Web applications using the PaCE approach) to group together entities with the same provenance context. For example,
are entities extracted from the same journal article. The multiple contextualized URIs representing a common type of entity, for example "lipoprotein", can be asserted to be instances of a common ontology class by using the rdf: type property. In the next section, we address the issues in provenance storage and propagation stage of the provenance lifecycle.
Provenance Storage and Propagation
The current high-throughput data generation techniques, including gene sequencing and DNA microarray, have created very large datasets in biomedical applications [7, 8]. Though the capture and storage of provenance associated with the above datasets leads to an exponential increase in the total size of the datasets , provenance plays an important role in optimizing the access and query of the datasets [35, 14]. There are two approaches to store provenance, namely (a) provenance is stored together with the dataset, and (b) provenance is stored separately from the data (and combined on demand).
The SPF uses the first approach by storing both the data and provenance together in a single RDF graph. The primary motivation for selecting the first approach is to allow applications to flexibly categorize an information entity as either data or provenance metadata according to evolving user requirements. For example, the temperature of a gene knockout experiment (in the T.cruzi SPSE project) is provenance information, which can be used to query for results generated using similar temperature conditions. In contrast, the body temperature of a patient in clinical research scenario is a data value and not provenance information. Hence, this application-driven distinction between provenance metadata and data is a critical motivation for storing provenance and data together in the SPF. In addition, storing provenance together with the data makes it easier for application to also ensure that updates to data are seamlessly applied to the associated provenance. Ensuring synchronization between the data and separately stored provenance is challenging especially in a high-throughput data generation scenarios and the provenance information may become inconsistent with the data.
An essential requirement for provenance storage is ensuring the propagation of provenance as the data traverses the translational research life cycle, for example the provenance of gene expression profiling experiment results is used in a downstream application such as biological pathway research. The integrated approach for provenance storage allows seamless propagation of provenance information with the data. In contrast, it is often difficult to transfer provenance separately from the data across projects, institutions, or applications. Further, many applications often query the provenance metadata to identify relevant datasets to be imported and analyzed further, for example identifying a relevant patient cohort for clinical research requires identifying qualifying health care providers, the geographical location of the patients, and related provenance information. Hence, if the provenance associated with a patient health record is stored separately and cannot be easily propagated and accessed by the clinical researcher then it adversely affects translational research projects.
Though storing provenance and data together has many advantages, one of the challenges that needs to be addressed is the large size of the resulting datasets. Cloud-based storage solutions, such as Simple Storage Service (S3) from Amazon and Azure blob from Microsoft have been proposed to effectively address these issues .
Provenance storage in the T.cruzi SPSE
Details of the RDF instance base in the T.cruzi SPE project
Number of Experiment Runs
Total RDF Triples
Provenance-specific RDF Triples (% of total triples)
1. Proteome analysis
3. Gene Knockout
4. Strain Creation
Provenance storage in the BKR project
Number of provenance-aware RDF triples generated using the PaCE and RDF reification vocabulary
RDF Reification vocabulary
Total Number of RDF triples
Provenance-specific RDF triples
Exhaustive approach (E_PaCE): Capturing the provenance of the S, P, and O elements of the RDF triple increased the total size of the BKR dataset to 113.1 million RDF triples
Minimal approach (M_PaCE): 48.3 million additional RDF triples (total 71.6 million RDF triples) were created using this approach
Intermediate approach (I_PaCE): A total of 94.7 million RDF triples were created using the I_PaCE approach
Table 2 also clearly illustrates the decrease in the number of provenance-specific RDF triples as compared to the RDF reification vocabulary approach. The open source Virtuoso RDF store version 06.00.3123 was used to store the BKR datasets. Similar to the T.cruzi SPSE project, the provenance metadata associated with the BKR data is seamlessly propagated along with the data since both are represented in a single RDF graph.
Provenance Query and Analysis
The provenance literature discusses a variety of queries that are often executed using generic or project-specific query mechanisms that are difficult to re-use. Provenance queries in workflow systems focus on execution of computational process and their input/output values. Provenance queries in relational databases trace the history of a tuple or data entity . In contrast, scientists formulate provenance queries using domain-specific terminology and follow the course of an experiment protocol [16, 37]. In addition, provenance queries over scientific data often exhibit "high expression complexity"  reflecting the real world complexity of the scientific domain .
Provenance query operator input and output value
1. provenance ( )
Data entity (instance of Provenir data_collection class)
Provenance of data entity
_context ( )
Provenance of data entity (instances of Provenir data, agent, and process classes)
Data entity(s) (satisfying the provenance constraints)
_compare ( )
Provenance of two data entities (RDF files)
True (if provenance of two data entities are equivalent), otherwise False
_merge ( )
Two sets of provenance information (RDF files)
Merged provenance information
provenance ( ) query operator - to retrieve provenance information for a given dataset,
provenance_context ( ) query operator - to retrieve datasets that satisfy constraints on provenance information,
provenance_compare ( ) query operator - given two datasets, this query operator determines if they were generated under equivalent conditions by comparing the associated provenance information, and
provenance_merge ( ) query operator - to merge provenance information from different stages of an experiment protocol. In the T.cruzi SPSE project, provenance information from two consecutive phases, namely gene knockout and strain creation phases, can be merged using this query operator.
The query operators are defined in terms of a "search pattern template" composed of Provenir ontology classes and properties (the query operators are defined using formal notation in ). The query operators use the standard RDFS entailment rules  to expand the query pattern and can be executed against the instance base of any (Provenir-ontology based) domain-specific provenance ontology. The formal definition of these query operators is described in . In addition, the query operators can be extended to create new query operators and can be implemented in either SQL or SPARQL.
The SPF was implemented as a scalable provenance query engine that can be deployed over any RDF database that supports standard RDFS entailment rules .
Provenance Query Engine
1. A Query Composer
The query composer maps the provenance query operators to SPARQL syntax according to semantics of the query operators.
2. A Function to Compute Transitive Closure over RDF
SPARQL query language does not support transitive closure for an RDF <node, edge> combination. Hence, we have implemented a function to efficiently compute transitive closure using the SPARQL ASK function. The output of this function together with the output of the query composer is used to compose the complete query pattern.
3. Query Optimizer using Materialized Provenance Views
Using a new class of materialized views based on the Provenir ontology schema called Materialized Provenance Views (MPV) a query optimizer has been implemented that enables the query engine to scale with very large RDF data sets.
The query operators are implemented taking into account the distinct characteristics of provenance queries as well as existing provenance systems. For example, provenance information represents the complete history of an entity and is defined by the exhaustive set of dependencies among data, process, and agent. However, in real world scenarios the provenance information available can be incomplete due to application-specific or cost-based limitations. Hence, a straightforward mapping of provenance query operators to SPARQL as a Basic Graph Pattern (BGP) is not desirable, since the BGP-based query expression pattern may not return a result in the presence of incomplete provenance information . Hence, the OPTIONAL function in SPARQL can be used to specify query expression patterns that can succeed with partial instantiation, yielding maximal "best match" result graph. Another challenge in implementation of the query engine was that unlike many graph database query languages such as Lorel or GraphLog, , SPARQL does not provide an explicit function for transitive closure to answer reachability queries (http://www.w3.org/2001/sw/DataAccess/issues#accessingCollections). Reachability queries involving computation of transitive closure is an important characteristic of provenance queries to retrieve the history of an entity beginning with its creation. In case of the provenance query engine, the query composer computes the transitive closure over the <process, preceded_by> combination to retrieve all individuals of the process class linked to the input value by the preceded_by property.
Transitive Closure Module
The evaluation of the provenance query engine followed the standard approach in database systems  and was performed for both "expression complexity" - SPARQL query patterns with varying levels of complexity, and "data complexity" - varying sizes of RDF datasets. The SPARQL query complexity was measured using the total number of variables, triples, use of OPTIONAL function, and levels of nesting in the query pattern . The most complex query pattern had 73 variables, 206 triples, and 7 levels of nesting using the OPTIONAL function. Further, to evaluate the data complexity, five different sized datasets were used ranging from 32,000 RDF triples to 308 million RDF triples. We found that a straightforward implementation of the query engine was not able to scale with both increasing expression and data complexity . Hence, the provenance query engine uses a novel materialization strategy based on the Provenir ontology schema, called materialized provenance views (MPV) . The use of MPV improved the performance of the query engine by an average of 99.93% for increasingly complex SPARQL query patterns and by an average of 98.95% for increasingly large RDF datasets, thereby validating the scalability of the query engine. We now describe a few example provenance queries in context of the T.cruzi SPSE project that leverage the SPF query infrastructure.
Provenance queries in the T.cruzi SPSE
Retrieving the history of experiment results to ensure quality and reproducibility of data. In addition, the provenance information is used to describe the experiment conditions of results published in literature
Keeping track of experiment resources during an ongoing project or auditing of resources used in a completed project. This helps project managers to monitor status of projects and ensure optimal use of lab resources
We consider the following two example provenance queries representing the above two categories of usage:
Query 1: Find the drug and its concentration that was used during drug selection process to create "cloned_sample66."
Query 2: What is the status of knockout plasmid construction step to create pTrex? Query 1 illustrates the retrieval of provenance information associated with a cloned sample, where the type of drug and concentration of the drug are important for researchers to understand the characteristics of the cloned sample. Similarly, Query 2 describes a provenance query used for project management, where the lead researcher or project manager can keep track of the project status. Both these example queries are answered using the provenance () query operator, which takes as input "cloned_sample66" and "pTrex" as input values respectively.
As described earlier, the provenance () query operator, implemented in the provenance query engine, automatically generates a SPARQL query pattern using the PEO schema as reference. This query pattern is executed against the T.cruzi SPSE RDF instance base and the retrieved results are represented as a RDF graph (which can be used by any Semantic Web visualization tool, for example Exhibit  or in the Cuebee query interface ). Similar to our earlier work , the results of the above queries were manually validated by domain researchers in the Tarleton research group. In the next section, we describe provenance queries used in the BKR project.
Provenance query in the BKR Project
The provenance queries in the BKR project are used for identifying the source of an extracted RDF triple, retrieving temporal information (for example, the date of publication of a source article), version information for a database, and the confidence value associated with a triple (indicated by a text mining tool). The provenance information is essential in the BKR project to ensure the quality of data and associate trust value with the RDF triple. We discuss the following two example provenance queries used in the BKR project:
Query 1: Find all documents asserting the triple "IL-13 → inhibits → COX-2"
Query 2: Find all triples of the form "IL-13 → inhibits → gene" where value of gene is not known apriori. The results are filtered based on a set of provenance constraints such that results are only from (a) journals with impact factor > 5, (b) journal published after the year 2007, (c) RDF triples with confidence value > 8.
Query 1 is used by the enhanced information retrieval service in the BKR project, which supports user query based on not only keyword or concepts, but also relations . Hence, results from Query 1 are used to create a basic index, similar to traditional search engines, listing all documents from which a given biomedical assertion is extracted . In contrast to Query 1, Query 2 is used by the Question Answering service of the BKR project to define provenance-based quality constraints to retrieve results from reputable journals that have been published recently and a high confidence value is associated with the extracted RDF triple. Both the provenance queries are expressed in SPARQL and executed against the BKR instance base. In our earlier work, we have discussed the improved performance of provenance queries using the PaCE approach in comparison to the RDF reification vocabulary .
In both the T.cruzi SPSE and BKR project, the SPF provides users with an easy to use, expressive, and scalable provenance query infrastructure that can scale with increasing size of data and complexity of the queries [16, 43].
We first discuss related work in provenance representation in context of the Provenir ontology. Next, we discuss the work in database provenance and workflow provenance with respect to provenance query/analysis and compare it with the functionality of the provenance query operators defined in SPF.
Multiple provenance representation models have been proposed, with the Open Provenance Model (OPM)  and the proof markup language (PML)  being the two prominent projects. As part of the W3C Provenance Incubator Group, we have defined a lightweight mapping between the OPM and other provenance models including the Provenir ontology, which demonstrates that all three of them model similar classes, but only the Provenir ontology has a comprehensive set of named relationships linking the provenance classes . Specifically, OPM (core specification) models only "causal relations" linking provenance entities , which makes it difficult for OPM to model partonomy, containment, and non-causal participatory provenance properties needed in many translational research applications. Provenance representations, in the context of relational databases, extend the relational data model with annotations , provenance and uncertainty , and semirings of polynomials . Provenir ontology can be extended to model the provenance of tuple(s) in relational databases, which relies on mappings defined between description logic to relational algebra .
Database provenance or data provenance, often termed as "fine-grained" provenance, has been extensively studied in the database community. Early work includes the use of annotations to associate "data source" and "intermediate source" with data (polygen model) in a federated database environment to resolve conflicts , and use of "attribution" for data extracted from Web pages . More recent work has defined database provenance in terms of "Why provenance," "Where provenance,"  and "How provenance" . "Why provenance", introduced in , describes the reasons for the presence of a value in the result (of a query in a relational database context) and "Where provenance" describes the source location of a value . A restricted view of the "Where provenance" identifies each piece of input data that contributes to a given element of the result set returned by each database query. We use the syntactic definition of "Why provenance"  that defines a "proof" for a data entity. The proof consists of a query, representing a set of constraints, over a data source with "witness" values that result in a particular data output. The semantics of the provenance () query operator closely relates to both "Where provenance" and "Why provenance" .
To address the limitation of "Why provenance" that includes "...set of all contributing input tuples" leading to ambiguous provenance,  introduced semiring-based "How provenance." The provenance () query operator over a "weighted" provenance model, which reflects the individual contribution of each component (for example process loops or repeated use of single source data), is comparable to "How provenance."
The Trio project  considers three aspects of lineage information of a given tuple, namely, how was a tuple in the database derived along with a time value (when) and the data sources used. A subset of queries in Trio, "lineage queries", discussed in , can be mapped both as provenance () and as provenance_context () query operators depending on the input value.
The rapid adoption of scientific workflows to automate scientific processes has catalyzed a large body of work in recording provenance information for the generated results. Simmhan et al.  survey different approaches for collection, representation, and management of workflow provenance. Recent work has also recognized the need for inclusion of domain semantics in the form of domain-specific provenance metadata  along with workflow provenance . The semantics of these projects can be mapped to the provenance () query operator.
In our previous work [14, 15], we have separately addressed some of the issues in pre- and post-publications phases of translational research applications. Here we expand on the challenges in creating a unified framework for provenance management, with a focus on a dedicated infrastructure for effective provenance collection, a flexible provenance model, and a scalable query implementation that can be adopted across translational research projects.
What does it take to build an effective provenance management system for translational research today? It is clear from the work discussed in this paper that creation of a practical and usable provenance management system is not a trivial task. Though provenance represents critical information for research projects, the high threshold in terms of resources required deters widespread adoption of a systematic and comprehensive provenance infrastructure. In addition, the lack of provenance-specific standards makes it difficult for developers to implement interoperable provenance systems across projects, applications, and different phases of the translational research lifecycle. This current state of provenance systems forces researchers to create ad-hoc systems that cannot be re-used, extended, or adapted to changing project requirements.
Hence, we have deliberately aligned the implementation of the SPF components with existing W3C Semantic Web standards, including RDF, OWL, and SPARQL. Though, these standards are not tailored for the specific requirements of provenance systems, we demonstrated that they can be extended and adopted to address some of the challenges. For example, a component of the provenance query engine uses SPARQL ASK function to compute transitive closure over RDF graphs, since SPARQL does not have explicit support for computing transitive closure. Despite some advantages of using existing Semantic Web standards, provenance management in context of translational research is still in an early phase.
How are things likely to improve in the future? The W3C provenance incubator group (Provenance XG)  has collected an extensive set of use cases and requirements for effective provenance management. This work has led to the creation of the W3C Provenance Working Group, which has been mandated to define a language for exchanging provenance information across applications . In addition, the working group will also define a mechanism for querying and accessing the provenance information along with a set of best practices that can be used to guide implementation of provenance systems . We are members of the working group and we plan to make the SPF compatible with the standards that will be proposed by the working group.
We described a unified framework based on the upper-level Provenir provenance ontology for managing provenance information during generation of data from bench experiments and their subsequent use (post-publication) by data mining and knowledge discovery applications. In the process, we identified that both the pre and post-publication phases of translational research have a common set of stages associated with the provenance metadata that can be managed by the SPF. Using two exemplar projects, corresponding to the two translational research phases, we described how the SPF could handle provenance collection, representation, storage/propagation, and query/analysis.
As part of our future work, we will implement a "lifting mechanism" between contexts to allow easier transformation of RDF triple between different PaCE-based applications. In addition, we aim to specialize the existing provenance query operators to interface with distributed SPARQL end-points, which have been proposed for provenance access and querying by the W3C Provenance Working Group.
This research was supported in part by the NIH RO1 Grant# 1R01HL087795-01A1 and the Intramural Research Program of the National Institutes of Health (NIH), National Library of Medicine (NLM).
- Mehra RSK, Blackwell T, Ancoli Israel A, Dam T, Stefanick M, Redline S: Prevalence and Correlates of Sleep-Disordered Breathing in Older Men: the MrOS Sleep Study. J Am Gerontol Soc 2007, 55(9):1356–1364.View ArticleGoogle Scholar
- Sahoo SS, Thomas C, Sheth A, York WS, Tartir S: Knowledge modeling and its application in life sciences: a tale of two ontologies. Proceedings of the 15th international Conference on World Wide Web WWW '06 May 23 - 26 2006; Edinburgh, Scotland 2006, 317–326.Google Scholar
- Bodenreider O: Provenance information in biomedical knowledge repositories - A use case. In Proceedings of the First International Workshop on the role of Semantic Web in Provenance Management (SWPM 2009). Volume 526. Edited by: Freire J, Missier P, Sahoo SS. Washington D.C, USA: CEUR; 2009.Google Scholar
- Weatherly B, Atwood J, Minning T, Cavola C, Tarleton R, Orlando R: A heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results. Mol Cell Proteomics 2005, 4(6):762–772. 10.1074/mcp.M400215-MCP200View ArticlePubMedGoogle Scholar
- Semantics and Services enabled Problem Solving Environment for Tcruzi[http://www.knoesis.org/research/semsci/application_domain/sem_life_sci/tcruzi_pse/]
- Martin DLWD, Laucella SA, Cabinian MA, Crim MT, Sullivan S, Heiges M, Craven SH, Rosenberg CS, Collins MH, Sette A, Postan M, Tarleton RL: CD8+ T-Cell responses to Trypanosoma cruzi are highly focused on strain-variant trans-sialidase epitopes. PLoS Pathog 2006., 2(8):Google Scholar
- Parkinson HSU, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, Kurbatova N, Lukk M, Malone J, Mani R, Pilicheva E, Rustici G, Sharma A, Williams E, Adamusiak T, Brandizi M, Sklyar N, Brazma A: ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 2011, (39 Database):1002–1004.Google Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, (35 Database):D5–12.Google Scholar
- Bodenreider O, Rindflesch TC: Advanced library services: Developing a biomedical knowledge repository to support advanced information management applications. Bethesda, Maryland: Lister Hill National Center for Biomedical Communications, National Library of Medicine; 2006.Google Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resources for deciphering the genome. Nucleic Acids Res 2004, 32: D277-D280. 10.1093/nar/gkh063PubMed CentralView ArticlePubMedGoogle Scholar
- Manola F, Miller E, (Eds): RDF Primer. W3C Recommendation 2004. [http://www.w3.org/TR/rdf-primer/]Google Scholar
- Hayes P: RDF Semantics. W3C Recommendation 2004. [http://www.w3.org/TR/rdf-mt/#defentail]Google Scholar
- Klyne G, Carroll JJ: Resource Description Framework (RDF): Concepts and Abstract Syntax. W3C Recommendation 2004.Google Scholar
- Sahoo SS, Weatherly DB, Muttharaju R, Anantharam P, Sheth A, Tarleton RL: Ontology-driven Provenance Management in eScience: An Application in Parasite Research. In The 8th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE 09): 2009; Vilamoura, Algarve-Portugal. Springer Verlag; 2009:992–1009.Google Scholar
- Sahoo SS, Bodenreider O, Hitzler P, Sheth A, Thirunarayan K: Provenance Context Entity (PaCE): Scalable Provenance Tracking for Scientific RDF Data. In Knoesis Center, Technical Report. Wright State University; 2010.Google Scholar
- Sahoo SS: Semantic Provenance: Modeling, Querying, and Application in Scientific Discovery. Wright State University; 2010.Google Scholar
- Hitzler P, Krötzsch #o;M, Parsia B, Patel-Schneider PF, Rudolph S: OWL 2 Web Ontology Language Primer. W3C Recommendation. 2009.Google Scholar
- Prud'ommeaux E, Seaborne A: SPARQL Query Language for RDF. W3C Recommendation 2008. [http://www.w3.org/TR/rdf-sparql-query]Google Scholar
- Basic Formal Ontology (BFO)[http://www.ifomis.org/bfo/]
- Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, Mungall C, Neuhaus F, Rector AL, Rosse C: Relations in biomedical ontologies. Genome Biol 2005, 6(5):R46. 10.1186/gb-2005-6-5-r46PubMed CentralView ArticlePubMedGoogle Scholar
- Sirin E, Parsia B, Cuenca Grau B, Kalyanpur A, Katz Y: Pellet: A practical OWL-DL reasoner. Journal of Web Semantics 2007., 2(5):Google Scholar
- Brickley D, Guha RV: RDF Schema. W3C Recommendation 2004. [http://www.w3.org/TR/rdf-schema/]Google Scholar
- Oberle D, Ankolekar A, Hitzler P, Cimiano P, Schmidt C, Weiten M, Loos B, Porzel R, Zorn H-P, Micelli V, Sintek M, Kiesel M, Mougouie B, Vembu S, Baumann S, Romanelli M, Buitelaar P, Engel R, Sonntag D, Reithinger N, Burkhardt F, Zhou J: DOLCE ergo SUMO: On Foundational and Domain Models in SWIntO (SmartWeb Integrated Ontology). Journal of Web Semantics: Science, Services and Agents on the World Wide Web 2007.Google Scholar
- Niles I, Pease A: Towards a Standard Upper Ontology. 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001): October 17–19 2001; Ogunquit, Maine 2001.Google Scholar
- Gangemi A, Guarino N, Masolo C, Oltramari A, Schneider L: Sweetening Ontologies with DOLCE. In 13th International Conference on Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web: 2002; Siguenza, Spain. Springer Verlag; 2002:166–181.View ArticleGoogle Scholar
- Bizer C, Cyganiak R: D2RQ -- Lessons Learned. W3C Workshop on RDF Access to Relational Databases Cambridge, USAGoogle Scholar
- Eilbeck KLSE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: A tool for the unification of genome annotations. Genome Biology 2005., 6(5):Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Hobbs JR, Pan F: Time Ontology in OWL. W3C Working Draft 2006.Google Scholar
- Cross V, Stroe C, Hu X, Silwal P, Panahiazar M, Cruz IF, Parikh P, Sheth A: Aligning the Parasite Experiment Ontology and the Ontology for Biomedical Investigations Using AgreementMaker. In International Conference on Biomedical Ontologies (ICBO). Buffalo NY; 2011:125–133.Google Scholar
- Aurrecoechea C, Heiges M, Wang H, Wang Z, Fischer S, Rhodes P, Miller J, Kraemer E, Stoeckert CJ, Roos DS Jr, Kissinger JC: ApiDB: integrated resources for the apicomplexan bioinformatics resource center. Nucleic Acids Research 2007, 35(D):427–430. 10.1093/nar/gkl880View ArticleGoogle Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, (32 Database):267–270.Google Scholar
- Ayers A, Völkel M: Cool URIs for the Semantic Web. In Woking Draft Edited by: Sauermann L, Cyganiak, R.: W3C. 2008.Google Scholar
- Chapman AP, Jagadish HV, Ramanan P: Efficient provenance storage. In ACM SIGMOD international Conference on Management of Data: June 09 - 12, 2008 2008; Vancouver, Canada. ACM, New York, NY; 2008:993–1006.Google Scholar
- Muniswamy-Reddy KKSM: Provenance as First-Class Cloud Data. 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS'09) 2009.Google Scholar
- Widom J: Trio: A System for Integrated Management of Data, Accuracy, and Lineage. Second Biennial Conference on Innovative Data Systems Research (CIDR '05): January 2005; Pacific Grove, California 2005.Google Scholar
- Sahoo SS, Sheth A, Henson C: Semantic Provenance for eScience: Managing the Deluge of Scientific Data. IEEE Internet Computing 2008, 12(4):46–54.View ArticleGoogle Scholar
- Vardi M: The Complexity of Relational Query Languages. 14th Ann ACM Symp Theory of Computing (STOC '82): 1982 1982, 137–146.Google Scholar
- Angles R, Gutierrez C: Survey of graph database models. ACM Comput Surv 2008, 40(1):1–39.View ArticleGoogle Scholar
- Taylor CF, Paton NW, Garwood KL, Kirby PD, Stead DA, Yin Z, Deutsch EW, Selway L, Walker J, Riba-Garcia I, Mohammed S, Deery MJ, Dunkley T, Aebersold R, Kell DB, Lilley KS, Roepstorff P, Yates JR, Brass A, Brown AJ, Cash P, Gaskell SJ, Hubbard SJ, Oliver SG: A systematic approach to modeling, capturing, and disseminating proteomics experimental data. Nat Biotechnol 2003, 21(3):247–254. 10.1038/nbt0303-247View ArticlePubMedGoogle Scholar
- Pérez J, Arenas M, Gutiérrez C: Semantics and Complexity of SPARQL. Int'l Semantic Web Conf (ISWC '06): 2006; Athens, GA 2006, 30–43.View ArticleGoogle Scholar
- Exhibit: Publishing Framework for Data-Rich Interactive Web Pages[http://www.simile-widgets.org/exhibit/]
- Asiaee AH, Doshi P, Minning T, Sahoo SS, Parikh P, Sheth A, Tarleton RL: From Questions to Effective Answers: On the Utility of Knowledge-Driven Querying Systems for Life Sciences Data. Knoesis Center Technical Report 2011.Google Scholar
- Open Provenance Model[http://openprovenance.org/]
- McGuinness DL, Pinheiro da Silva P: Explaining Answers from the Semantic Web: The Inference Web Approach. Journal of Web Semantics 2004, 1(4):397–413. 10.1016/j.websem.2004.06.002View ArticleGoogle Scholar
- W3C Provenance Incubator Group Wiki[http://www.w3.org/2005/Incubator/prov/wiki/Main_Page]
- Chiticariu L, Vijayvargiya G: DBNotes: a post-it system for relational databases based on provenance. In ACM SIGMOD international Conference on Management of Data: 2005; Baltimore, Maryland. ACM, New York, NY; 2005:942–944.View ArticleGoogle Scholar
- Green TJ, Tannen V: Provenance Semirings. ACMSIGMOD-SIGACTSIGART Symposium on Principles of database systems (PODS): 2007 2007, 675–686.Google Scholar
- Borgida A: Description Logics in Data Management. IEEE Transactions on Knowledge and Data Engineering 1995, 7(5):671–682. 10.1109/69.469829View ArticleGoogle Scholar
- Wang YR, Madnick SE: A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. 16th VLDB Conference 1990 1990, 519–538.Google Scholar
- Lee T, Bressan S: Multimodal Integration of Disparate Information Sources with Attribution. Entity Relationship Workshop on Information Retrieval and Conceptual Modeling 1997.Google Scholar
- Buneman P, Khanna S, Tan WC: Why and Where: A Characterization of Data Provenance. 8th International Conference on Database Theory: 2001 2001, 316–330.View ArticleGoogle Scholar
- Cui Y, Widom J: Practical Lineage Tracing in Data Warehouses. 16th ICDE: 2000; San Diego, California: IEEE Computer Society 2000.Google Scholar
- Simmhan YL, Plale AB, Gannon AD: A survey of data provenance in e-science. SIGMOD Rec 2005, 34(3):31–36. 10.1145/1084805.1084812View ArticleGoogle Scholar
- Zhao J, Sahoo SS, Missier P, Sheth A, Goble C: Extending semantic provenance into the web of data. IEEE Internet Computing 2011, 15(1):40–48.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.