Ontology is the science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area of reality [12]. As applied in the biomedical domain, ontology plays a key role in providing consensus-based controlled vocabularies serving the consistent annotation of biological and medical data and information, most conspicuously within the framework of the Gene Ontology (GO) [7] and now of its sister ontologies within the Open Biomedical Ontologies Foundry http://obofoundry.org. We believe that an approach to the analysis of saliva in terms of a controlled structured vocabulary and a common set of measurement data elements developed along OBO Foundry lines can provide a cost-effective approach to support the coordinated screening of large populations in such a way as to yield data that is capable of being aggregated for statistical purposes and for example in the context of meta-analysis.
Currently, ontologies support data integration primarily through data annotation (or 'tagging'), including the annotation of data reported in the peer-reviewed scientific literature [13]. While the value of such data annotation has been demonstrated in molecular and model organism biology and in the analysis of gene expression data [14–16], the potential of ontology-based annotation in the clinical domain has been largely unrealized due to limitations in current ontology development practices, including:
-
1.
Most ontologies consist of only a few well-defined relations, primarily the is_a (e.g. heart is_a organ) and part_of (e.g. aortic valve part_of heart) relations, and they only relate terms within a single taxonomy [17, 18]. This results in an inability to capture higher levels of biological complexity.
-
2.
Most ontologies and terminology artifacts lack a sound logical underpinning, rest on mixed modes of classification and inadequate formal definitions, resulting in an inability to support sophisticated computation [19–24].
To address these and related shortfalls, the OBO Foundry was created in 2006 by a group of developers of OBO ontologies on the basis of an evolving set of principles designed to foster the creation of an evolving set of best practice in ontology development. The first list of ontologies satisfying OBO Foundry peer review was released in April 2010. OBO ontologies are designed to represent in an interoperable fashion the biomedical reality from which data are sampled. Their development within the framework of a common top-level ontology (Basic Formal Ontology [25]) and consistent employment of a common set of relations [10] allows Foundry ontologies to be used together as interoperable modules within an evolving larger network. The relations themselves are formalized in such a way as to ensure support for sophisticated computation both within and across ontologies [10].
We capitalize on and contribute to the OBO Foundry initiative in this work. One important element therein is the distinction between reference and application ontologies. [26]. The former correspond in medicine to the basic biomedical sciences such as anatomy and physiology. The latter correspond to the clinical specialisms and sub-specialisms, for example pediatric surgery or radiation oncology. Just as the clinical specialisms draw on the methods, theories and terminologies of the basic biomedical sciences for a variety of purposes, including the education of clinicians and the formulation of clinical research hypotheses, so, within the OBO Foundry framework, application ontologies draw on reference ontologies to serve as feeders of lexically more simple terms (such as 'protein' or 'disease') to be used in the construction of the more specialized compound terms by which the application ontologies are composed.
SALO is in this sense an application ontology. It draws primarily on four reference ontologies - Protein Ontology (PRO), Gene Ontology (GO), Chemical Entities of Biological Interest (CHEBI) and Ontology for Biomedical Investigations (OBI) - which are described in more detail below. All terms in SALO, other than those created anew because they relate specifically to the SALO domain, will as far as possible be derived from the mentioned sources. Thus for example all protein terms in SALO will be taken from the Protein Ontology. Where PRO does not have the needed terms, then requests for inclusion of these terms in PRO will be submitted to the PRO tracker [8].
The Protein Ontology (PRO)
The Protein Ontology Consortium, led by researchers affiliated with the Universal Protein Knowledgebase (UniProt, http://www.pir.uniprot.org/), developed the PRO framework [8] with two axes of classification, based, respectively, on the protein structural units of domains, and on full-length protein sequences and their modifications. This second axis represents the various protein entities (such as splice variants, cleavage products) that can derive from a single gene.
Because proteins themselves are combinations of domains with additional sequence, the two axes of classification are related via the has_part relation. We are collaborating with PRO's developers in the curation of those sections of PRO relating to those proteins which are of primary interest to the saliva domain. We will also participate in PRO dissemination activities in order to expand the community of users of both SALO and SKB.
The Gene Ontology (GO)
The Gene Ontology (GO) project is a collaborative effort to develop and use ontologies to support biologically meaningful annotation of genes and their products in a wide variety of organisms. Major model organism databases and other bioinformatics resource centers contribute to the project [27]. The GO provides a systematic language for the consistent description of attributes of genes and gene products in three key biological domains that are shared by all organisms: molecular function, biological process and cellular component. GO's value derives in large part from the fact that it has been utilized for the systematic annotation by trained biologist-curators of experimental results pertaining to multiple species of organisms and communicated in the peer-reviewed scientific literature. Some 50,000 journal articles have been annotated in this way, and their content has thereby been made accessible to computer-aided discovery. We will collaborate with GO's developers in the curation of those sections of GO relating to gene products of primary interest to the saliva domain.
Chemical Entities of Biological Interest (CHEBI)
Chemical Entities of Biological Interest (CHEBI) is a freely available ontology of molecular entities focused on 'small' chemical compounds. The molecular entities in question are either natural products or synthetic products used to intervene in the processes of living organisms. Genome-encoded macromolecules (nucleic acids, proteins and peptides derived from proteins by cleavage) are not as a rule included. In addition to molecular entities, CHEBI contains what are called 'groups' (parts of molecular entities) and classes of entities. CHEBI includes an ontological classification whereby the relationships between molecular entities or classes of entities and there is_a parents and children are specified. CHEBI is available online at http://www.ebi.ac.uk/chebi/[9]. We will collaborate with CHEBI's developers in the curation of those sections of CHEBI relating to the chemical compounds of primary interest in the saliva domain.
Ontology for Biomedical Investigations (OBI)
The Ontology for Biomedical Investigations (OBI) addresses the need for controlled vocabularies to support integration of experimental data, a need originally identified in the transcriptomics domain by the Microarray Gene Expression Data Society (MGED), which developed the MGED Ontology as an annotation resource for microarray data. In response to the recognition of convergent needs in areas such as protein and metabolite characterization, this effort was broadened to become what was initially known as FuGO (Functional Genomics Investigation Ontology) - the ontology associated with the FUGE (Functional Genomics Experiment) data model [28]. The coverage of FuGO was then further expanded in 2006 to include clinical trials and epidemiological studies, biomedical imaging and a variety of further experimentation domains, to become what is today OBI, an ontology designed to serve the coordinated representation of designs, protocols, instrumentation, materials, processes, data and types of analysis in all areas of biological and biomedical investigation. Twenty five groups are now involved in building OBI, deriving from all areas of omics research, and the Foundry discipline, including the BFO (Basic Formal Ontology) top-level framework, has proven essential to its distributed development [6]. OBI is used in our work as a source for ontological representation of biomarkers and related terms pertaining to sample collection and to diagnostic and experimental uses of saliva, as well as to associated protocols, instrumentation, statistical methods, and so forth.
SNOMED CT
Since SALO is designed for use in support of clinical research and treatment, it is important that it be aligned as closely as possible with the SNOMED® Systematized Nomenclature of Medicine, which is designed to provide the terminology needed to code the entire medical record. The current version of SNOMED is SNOMED CT (for 'Clinical Terms'), which is maintained by the International Health Terminology Standards Development Organization (IHTSDO) in Copenhagen. At its simplest, SNOMED CT is a controlled vocabulary of expressions used in healthcare reporting, as for example in an electronic health record. 'Controlled' means that the content of the terminology is designed to provide a well-managed non-redundant set of codes and associated expressions to ensure consistency of clinical coding. Quality assurance procedures are in place, which are designed to ensure that the terminology is structurally sound, biomedically accurate and consistent with current practice.
Unfortunately, while SNOMED is built around an evolving core vocabulary that is largely the work of the College of American Pathologists (CAP), it has been subjected at different times to various different sorts of combinations with other terminological resources, deriving mainly from the UK. The result is that, even after considerable efforts on the part of the new IHTSDO organization, SNOMED remains a terminology resource that is marked by multiple redundancies and associated inconsistencies of coding [29]. It is for this reason that we did not utilize SNOMED content in constructing SALO, but rather are working to ensure alignment between SNOMED CT and the results of our work on the Saliva Ontology by incorporating SNOMED CT terms where needed. At the same time, we will submit all clinically relevant new saliva-related terminology content created within the SALO framework to the IHTSDO Content Committee with a recommendation for inclusion for inclusion in future versions of SNOMED CT.
Saliva and Ontology
No dedicated ontology has thus far been defined in direct relation to oral biological fluids [30], and the treatment of saliva in ontology and terminology resources has thus been insufficient for purposes of saliva research. SNOMED CT returns 39 records for the search term 'saliva', including 'saliva (substance)', 'normal saliva (finding), and 'saliva-induced contact dermatitis (disorder)'. Saliva (substance) is asserted in the SNOMED CT concept hierarchy to be a digestive system fluid, which is in turn a body fluid. In The Foundational Model of Anatomy ontology (FMA), saliva is a subordinate of portion of secreted substance; no definition is provided [17]. Given the intention of the IHTSDO to align the SNOMED treatment of anatomy (including bodily substances) with that of the FMA, and given our existing collaboration with both the IHTSDO editorial community and the FMA's developers, we will work with both communities to create, through SALO, a more detailed representation of the ontology of this bodily fluid that is optimized to meet the needs of both the clinical diagnostic community and the cross-disciplinary community of omics researchers.
Results similar to those obtained from the analysis of SNOMED apply also to other terminology resources. The Cyc ontology (which contains hundreds of thousands of terms in all domains) [31], defines Saliva is: A Type of: bodily secretion and liquid, whereby it is asserted merely that it is: An Instance of tangible stuff type.
In WordNet, saliva is defined as a clear liquid secreted into the mouth by the salivary glands and mucous glands of the mouth; it is asserted that saliva moistens the mouth and starts the digestion of starches [32].