Semantic representation of monogenean haptoral Bar image annotation
BMC Bioinformatics volume 14, Article number: 48 (2013)
Digitised monogenean images are usually stored in file system directories in an unstructured manner. In this paper we propose a semantic representation of these images in the form of a Monogenean Haptoral Bar Image (MHBI) ontology, which are annotated with taxonomic classification, diagnostic hard part and image properties. The data we used are basically of the monogenean species found in fish, thus we built a simple Fish ontology to demonstrate how the host (fish) ontology can be linked to the MHBI ontology. This will enable linking of information from the monogenean ontology to the host species found in the fish ontology without changing the underlying schema for either of the ontologies.
In this paper, we utilized the Taxonomic Data Working Group Life Sciences Identifier (TDWG LSID) vocabulary to represent our data and defined a new vocabulary which is specific for annotating monogenean haptoral bar images to develop the MHBI ontology and a merged MHBI-Fish ontologies. These ontologies are successfully evaluated using five criteria which are clarity, coherence, extendibility, ontology commitment and encoding bias.
In this paper, we show that unstructured data can be represented in a structured form using semantics. In the process, we have come up with a new vocabulary for annotating the monogenean images with textual information. The proposed monogenean image ontology will form the basis of a monogenean knowledge base to assist researchers in retrieving information for their analysis.
Over the years, we have been collating information on monogeneans found in Malaysian waters, digitized these images and stored them as unstructured data. These images which were extracted from journal publications are meaningless without their textual annotations. Contemporary approaches to organizing image data and its corresponding textual descriptions are by using either the relational database technologies or the XML technologies. For example, the Biota , InsideWood , MonoDb  used the relational database technology, while the Open Microscopy Environment (OME) Data Model and XML File , knowledge-based grid services for high-throughput biological imaging , PLAZi  utilised the XML technology. Annotations of images in a relational database are confined by the number of columns used for the descriptions of the images. The number of characters allowed in a cell of a database table is also fixed. Any new inclusions into existing relational model with fixed tables and set of fields may require new schema to be developed and existing queries to be revised. Migration to a new schema and revision of queries can be very cumbersome and time consuming. Excessive images stored in a database take up a lot of space and creates a huge database file, affecting retrieval time. Storing images outside the database file in a directory and linking them via identifiers in the database column was a possible solution but here again any new inclusion of data will required a change in identifiers [7, 8]. XML is normally used to describe and structure data . Annotations of images in XML are not linked and hence the relationships between objects are not expressed.
Semantics is needed to organize data by focusing on the meaning of objects by expressing relationships. It provides necessary vocabularies to link different data entities to their properties .
In this paper, we intend to semantically annotate the haptoral bars of the monogenean species in a structured manner with their textual information or descriptions (see Figure 1) for retrieval purposes.
Data used in this paper are images of the monogenean haptoral bars along with textual information which consist of: (1) taxonomic classification and (2) description of properties of an image found in publications (see Figure 1). The data is analysed and structured into main concepts. Defining these concepts using a standard structured vocabulary is necessary to make sure the meaning of data is clear and explicit, thus facilitating data sharing and maximizing reusability in wide variety of contexts.
The Taxonomic Data Working Group (TDWG)  strongly suggests the deployment of Life Science Identifiers (LSID), the preferred Globally Unique Identifier technology and transitioning to RDF encoded metadata as defined by a set of simple vocabularies. The TDWG LSID vocabulary has been widely used in biodiversity and offers a wide coverage of concepts, which are suitable to annotate the taxonomic information of an organism. The nomenclature used in this research is from TDWG LSID vocabulary and where necessary, appropriate vocabularies specific to the monogeneans are formed (see Additional file 1). Specific vocabularies (for example DiagnosticPartTerms) are needed as Monogeneans are parasitic platyhelminths and are distinguished based on both soft reproductive anatomical features as well as shapes and sizes of sclerotised hard parts such as the haptoral bar, anchor, hook and male and female copulatory organ .
Seven concepts are described from the monogenean data used in this paper - Specimen, TaxonName, PublicationCitation, KindofSpecimenTerm, TaxonRankTerms, PublicationTypeTerms are defined using the TDWG LSID controlled vocabulary, whereas the DiagnosticPartTerm is a new concept. Specimen concept represents the illustrated images of the haptoral bars of the monogeneans. TaxonName represents a single scientific name. PublicationCitation represents a reference to the publication of the monogenean species. KindofSpecimenTerm represents the specimen terms such as illustration, digital object and still image. TaxonRankTerms represents the taxon rank terms for taxonomic classification. PublicationTypeTerms represents type of publication for example an article in journal or book. DiagnosticPartTerms represents the name of the monogenean hard parts.
Defining properties and relationships
There are two types of properties for the semantics representation which are object properties and datatype properties. Object properties are relationships between two individuals (link an individual to an individual), whereas datatype properties describe relationships between an individual and data values. The properties defined for the seven concepts are mentioned here and descriptions are available in Additional file 1.
Properties for specimen concept
Four object properties are defined under the Specimen concept; kindOfSpecimen, isHaptorBar, isCitedIn , typeForName and three datatype properties; specimenId, imgDir and imgDescription.
Properties for taxon name concept
Eight object properties are defined under the TaxonName concept; rank, isBelong,part,hasSpecies,hasGenus,hasFamily,hasOrder,isHostedIn and four datatype properties; nameComplete, authorship, year and locality.
Properties for publication citation concept
Two object properties are defined under PublicationCitation concept; pubType and lists and five datatype properties; author, year, title, parentPublicationString, number.
Properties for diagnostic part terms, kind of specimen terms, taxon rank terms, publication type terms concepts
One datatype property is defined for DiagnosticPartTerms, KindofSpecimenTerms, TaxonRankTerms, PublicationTypeTerms concepts, which is called definedTerm. This property is given a generic name as it will be used to bind multiple concepts together.
Semantic representation of data using the Web ontology language
concepts, 27 properties, and the relationships between them represent conceptualization of the data used in this paper. This conceptual framework needs to be converted in a machine readable formal specification to reason about the identified concepts and eventually describe the data. This formal specification of shared conceptualization is called ontology .
OWL  is an ontology language for the Semantic Web, developed by the World Wide Web Consortium (W3C) Web Ontology Working Group. OWL was primarily designed to represent information about categories of objects and how objects are interrelated—the sort of information that is often found in ontology. OWL can also represent information about the objects themselves—the sort of information that is often thought of as data . OWL facilitates greater machine interpretability of Web content than that supported by underlying XML, RDF, and RDF Schema representations by providing additional vocabulary along with a formal semantics. In this paper we utilise ontologies in OWL format to represent a shared structured vocabulary that describe the monogeneans image data through the concepts, properties and relationships discussed above. Figure 2 depicts the whole ontology in a graph format.
The data described by concepts is annotated in the form of instance s. While there are no fixed rules to name the instance s nevertheless the names should be reflective of the data they represent. For example, for the Specimen concept the record of each image of the haptoral bar or instance is given a unique label that will include its taxon name, diagnostic part depicted by the image and its sequence number in the directory (as shown in Table 1). There are 159 instance s for the Specimen concept, which represents all the haptoral bars of the monogenean images (see Table 2).
Linking data from other ontologies
Since monogeneans species are parasites on fish, frogs and turtles, linking the monogenean data to their host data will provide more information about the monogeneans. In this paper, the data we used are basically of the monogenean species found in fish thus we decided to build a simple Fish ontology with TaxonName concept to demonstrate how the host ontology can be linked to the MHBI ontology. The two ontologies are merged by redefining the datatype property (isHostedin) in the TaxonName concept in the MHBI ontology as an object property to merge with the TaxonName concept in the Fish ontology as shown in the graph model (Figure 3).
We may consider the ontology evaluation process either from the technical point of view (quality of the designed ontology), or from the practical view (usability of the designed ontology). For the purpose of evaluation of quality of the designed ontologies, we adopted five criteria suggested by Gruber  against which these ontologies will be evaluated. This methodology was successfully used previously to evaluate the Protein Ontology . The five criteria are clarity, coherence, extendibility, ontology commitment and encoding bias. A discussion on how these criteria are applied to the concepts and properties in MHBI ontology is presented in the Results section.
We introduce some level of formality into this discussion by adopting criteria suggested by Gruber  against which the ontology needs to be evaluated.
Definitions within an ontology need to be stated in such a way that the number of possible interpretations of a concept would be restricted. This will contribute to the effectiveness of communication between agents. In the design of our MHBI Ontology, we stated that for each concept c with property p; the pair (c, p) exactly specifies a unique pair. During the design of MHBI Ontology this rule is enforced, and the uniqueness of the definition of concepts is guaranteed (see Figure 2). Clarity of MHBI Ontology is also checked by running 8 tests listed below and making sure all of them return true:
No Cardinality Restriction on Transitive Properties
No Classes or Properties in Enumerations
No Import of System Ontologies
No Properties with Class as Range
No Sub Classes of RDF Classes
No Super or Sub Properties of Annotation Properties
Transitive Properties cannot be Functional
Example of result for Test 1 and Test 8 are as shown in Figure 4. Biological data is evolving over time whereby a new data type may need to insert into the ontology at any time. Thus for transitive properties we have not assigned any cardinality restriction. Besides that, it cannot be functional because it relates to more than one instance via the property. The example is explained further in Coherence Test 11.
As for Test 2 result, as presented in Figure 2, it is clearly show that no classes or properties in enumeration. Furthermore, for the Test 3 as illustrated in Figure 5, even though we have followed TDWG LSID standard for the vocabulary, we have created our own ontology based on our requirement study. Thus, we have not imported any other system ontologies. For the Test 7 result, we just used the built in Annotation property in Protégé  and there are no super or sub properties of Annotation properties as shown in Figure 5.
For Test 4, Test 5 and Test 6 results, as illustrated in Figure 6, in the MHBI ontology, there is no Meta-class, properties with class as range and sub classes of RDF classes.
The definitions of concepts given in the ontology should be consistent. Only inferences consistent with existing definitions should be allowed. The formal part of the MHBI Ontology is checked by running the 12 consistency tests listed below and ensuring that, for these tests, all return true:
Domain of a Property should not be empty
Domain of a Property should not contain redundant Classes
Range of a Property should not contain redundant Classes
Domain of a Sub Property can only narrow Super Property
Range of a Sub Property can only narrow Super Property
Inverse of Functional must be Inverse Functional
Inverse of Inverse Functional must be Functional
Inverse of Sub Property must be Subproperty of Inverse of Super Property
Inverse of Symmetric Property must be Symmetric Property
Inverse of Top Level Property must be Top Level Property
Inverse of Transitive Property must be Transitive Property
Inverse Property must have matching Range and Domain
Results of the Test 1 to Test 3 are presented in Additional file 1. As shown in the results, domain and range of all the properties are assigned and no contain redundant classes.
The result of Test 4, Test 5, Test 8 and Test 10, are as illustrated in Figure 7. ishaptorbar property is a sub property of super property named part. Thus, domain and range of the sub property are defined by the super property. In this ontology, the fullImage, isBar, isHaptor and isHaptorBar sub properties are classified under part property. This is because, each specimen of haptoral bar image may annotate to any of these properties.
One of the results for Test 6 and Test 7 were applicable on the typeForName and part properties. If a property is inverse functional, then it means that the inverse property is functional . For example, as illustrated in Figure 4, in this ontology, typeForName is functional property while part is inverse functional property. Thus, we can state that BifBaungi typeForName for bif-baungi-vb-i1, and then because of the inverse property we can infer that bif-baungi-vb-i1 part of BifBaungi.
An example for the result of Test 11 is illustrated as well in Figure 4. It shows an example of the transitive property isBelong. Since Bifbaungi isbelong to Bifurcohaptor, and Bifurcohaptor isbelong to Ancylodicoididae, then we can infer that Bifbaungi isbelong to Ancylodicoididae. As for inverse of transitive property hasSpecies, we can infer that Ancylodicoididae hasSpecies Bifbaungi. Furthermore, as presented in Additional file 1, inverse property in this example was fulfilled the Test 12 whereby it matched the range and domain.
Figure 8 illustrates an example of a Test 9 result. It shows an example of the symmetric property hasSynonym. The instance BycGharui is related to the instance SiloGharui via the hasSynonym property. Then we can infer that SiloGharui must also be related to BycGharui via the hasSynonym property. Put another way, the property is its own inverse property.
It should be possible to extend the ontology without altering the existing definitions. The requirement of easy ontology extension is quite an important feature as new knowledge emerges each day and may need to be added to an already existing ontology. To make MHBI Ontology extendable, the design consists of a hierarchical classification of concepts represented as classes, from general to specific. In MHBI ontology the notions classification, reasoning, and consistency are applied by defining new concepts from defined generic concepts. The concepts derived from generic concepts are placed precisely into the class hierarchy of MHBI Ontology to completely represent information defining a specimen.
Figure 9 illustrates an example of this criterion. In the DiagnoticPartTerms concept of the MHBI ontology, we have considered HaptorSclerotisedpartBar, HaptorSclerotisedpartAnchor and FullImage. In the future we plan to include other diagnostic part such as HaptorSclerotisedpartMarginalHook, HaptorSclerotisedpartPatch and HaptorSclerotisedpartOther. Thus this ontology do not sanction a preference for one diagnostic part only and allow for the definition of other diagnostic parts, and a way to relate them to existing diagnostic parts.
Ontology should make as few claims as possible about the domain while still supporting the intended knowledge sharing. MHBI Ontology will have as low an ontology commitment as domain ontology, because it reuses most of the concepts that have already been used to represent monogenean data and knowledge, and propose fewer new concepts. The low ontology commitment of the MHBI Ontology makes it more extendible and reusable as shown in Figure 10. Also, if fewer new concepts need to be agreed upon by the community, then this makes agreement easier.
Ontology representation language should be as independent as possible from the use of the ontology. While developing MHBI Ontology, the choice of representation language as OWL  will keep the encoding bias to a minimum as MHBI ontology will be used by all stakeholders of taxonomy domain like: domain experts, pharmaceutical companies, researchers and students.
In this paper, we have used the TDWG LSID vocabulary to represent our data using semantics and we have also defined new vocabulary which is specific for annotating monogenean haptoral bar images (see Additional file 1 for the list and description).
MHBI and MHBI-fish ontologies
Discussion and conclusions
Semantic annotations of morphological descriptions that have been proposed till date have no information on the actual annotation of morphological descriptions or morphological images . In this paper, we have annotated the monogenean images semantically and have developed a MHBI ontology which was eventually merged with a Fish ontology forming MHBI-Fish ontologies. This will enable linking of information from the monogenean ontology to the host species found in the fish ontology without changing the underlying schema for either of the ontologies.
To semantically represent our data we have used the vocabularies in TDWG LSID  which is the standard semantic naming convention for biodiversity information. We have also defined new vocabulary (Additional file 1) because this is the first time that images of the monogenean diagnostic hard part are being annotated semantically. In this paper, we have identified 7 concepts, and 27 properties (object and datatype properties in ontology) to represent descriptions of 159 images (instance s) (see Table 2).
In the future, we intend to work on developing a semantic query model through which a researcher can search using any word or phrase related to monogeneans and their hosts. In the future we also intend annotate images of other diagnostic hard parts to build a complete monogenean ontology. We will also build specific ontologies for the all the monogenean hosts such as fish, amphibians and reptiles. These ontologies will form the basis of a monogenean knowledge base to assist researchers in retrieving information for their analysis.
Furthermore, query results from the MHBI ontology presented in this paper are used as training set images for the Content Based Image Retrieval (CBIR). We have used this ontology to improve the efficiency of CBIR for Biodiversity [20, 21]. As a result the relevancy rate of results provided by CBIR increases due to the decrease in the size of the training set as most the images are relevant to the query image. Also the retrieved images in the CBIR results are annotated, providing more information to the researcher.
Biota: The Biodiversity Database Manager. http://viceroy.eeb.uconn.edu/Biota
Inside Wood - Search the Inside Wood Database http://insidewood.lib.ncsu.edu
MonoDb Homepage. http://www.monodb.org/index.php
Goldberg IG, Allan C, Burel JM, Creager D, Falconi A, Hochheiser H, Johnston J, Mellen J, Sorger PK, Swedlow JR: The Open Microscopy Environment (OME) Data Model and XML File: Open Tools for Informatics and Quantitative Analysis in Biological Imaging. Genome Biol 2005, 6: R47. 10.1186/gb-2005-6-5-r47
Ahmed WM, Lenz D, Jia L, Robinson JP, Ghafoor A: XML-Based Data Model and Architecture for a Knowledge-Based Grid-Enabled Problem-Solving Environment for High-Throughput Biological Imaging. Information Technology in Biomedicine, IEEE Transactions on 2008,12(2):226-240. 10.1109/TITB.2007.904153
Plazi: Access to Taxonomic Literature. http://plazi.org/
Arpah A Master thesis. In The use of information classification in face recognition and identification using eigenfaces. Kuala Lumpur: University of Malaya; 2007.
Arpah A, Sarinder KKS, Lim LHS: A Database Management System (DBMS) for Monogenean Taxonomy. In Proceedings of 2010 International Conference on Environmental Science and Technology: 23-24 April 2010; Bangkok, Thailand. Edited by: Saji B, Parvinder Singh S. Research Publishing Services, Singapore; 2010.
Taniar D, Rusu LI: Strategic Advancements in Utilizing Data Mining and Warehousing Technologies: New Concepts and Developments. Hershey, Pennsylvania (USA): IGI Global; 2010.
Toby S, Colin E, Jamie T: Programming the Semantic Web. Sebastopol, CA: O’Reilly Media; 2009.
Taxonomic Data Working Group. http://tdwg.org
Lim LHS: Bravohollisia bychowsky and Nagibina, 1970 and Caballeria bychowsky and Nagibina, 1970 (Monogenea, Ancyrocephalidae) from Pomadasys-Hasta (Bloch) (Pomadasyidae), with the description of a new attachment mechanism. Syst Parasitol 1995,32(3):211-224. 10.1007/BF00008830
Gruber TR: Towards Principles for the Design of Ontologies Used for Knowledge Sharing. Int J Hum Comput Stud 1995, 43: 907-928. 10.1006/ijhc.1995.1081
Deborah LM, Frank van H: OWL Web Ontology Language Overview. http://www.w3.org/2004/OWL/
Sidhu AS, Dillon TS, Chang E, Sidhu BS: Protein ontology: vocabulary for protein data. In 3rd International IEEE Conference on Information Technology and Applications. Edited by: He X, Hintz T, Piccardi M, Wu Q, Huang M, Tien D. Sydney: IEEE CS Press; 2005:465-469.
Sidhu AS, Dillon TS, Chang E: 2007, Protein Ontology. In Biological Database Modeling. Edited by: Chen J, Sidhu AS. New York: Artech House; 2007:63-80.
Michael KS, Chris W, Deborah LM: OWL Web Ontology Language Guide. http://www.w3.org/TR/owl-guide/
Cui H: Semantic Annotation of Morphological Descriptions: An Overall Strategy. BMC Bioinforma 2010, 11: 278. 10.1186/1471-2105-11-278
Arpah A Phd Thesis (Unpublished). In Architecture for Biodiversity Image Retrieval Using Ontology and Content Based Image Retrieval (CBIR). Kuala Lumpur: University of Malaya; 2012.
Arpah A, Lim SLH, Amandeep SS, Sarinder KD: Biodiversity image retrieval framework for monogeneans. Systematics and Biodiversity 2013. 10.1080/14772000.2012.761655
This project was supported by the University of Malaya’s Postgraduate Research Fund (PS284/2009B) to the first author and the University of Malaya Research Grant (RG053/09SUS) to the second and fourth authors.
The authors declare that they have no competing interest.
SKD headed the study and structured the whole research. AA developed the system as part of her PhD. AS assisted in ontology evaluation. AA, LLHS and AS assisted in manuscript writing. All authors contributed in this study. All authors read and approved the final manuscript.
Arpah Abu contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
About this article
Cite this article
Abu, A., Susan, L.L.H., Sidhu, A.S. et al. Semantic representation of monogenean haptoral Bar image annotation. BMC Bioinformatics 14, 48 (2013). https://doi.org/10.1186/1471-2105-14-48