Skip to main content
  • Poster presentation
  • Open access
  • Published:

Semantic integration of isolation habitat and location in StrainInfo

StrainInfo (http://www.straininfo.net) is a global catalog of microbial material, building upon the catalogs of Biological Resource Centers (BRCs) by integrating catalog entries of equivalent microbial material. Currently, the integration algorithm resolves the equivalent cultures and links all downstream information [1]. However, in order to increase the information content of StrainInfo, it is necessary to add fine-grained semantic information. This information enters StrainInfo on the culture level (synchronization with BRC catalogs), but must be integrated to the strain level (i.e. the set of equivalent cultures) in order to be presented on so-called strain passports.

The adoption of Microbiological Common Language (MCL) XML synchronization quickly increased the volume of semantic data in StrainInfo [2]. However, the effective data values of the different semantic fields still are raw textual entries and therefore are of varying detail, can have different forms or languages, and sometimes contain inconsistencies or even true errors. By consequence, in order to generate a strain level consensus value for each field, a specialized semantic integration of this data needs to be developed. As a case study for semantic integration in StrainInfo, the focus was put on the isolation habitat and location information fields due to their importance from both biological and legal (IP rights) perspective. An example of such data can be found in Table 1.

Table 1 Example isolation habitat and location data of a Pichia guilliermondii strain, as listed by different BRCs. For each column, we want to calculate a consensus value for the complete strain.

To integrate geographical information, named entity recognition is performed by annotating all geographic names with features from the GeoNames ontology. This yields a multitude of annotations, each annotation matching a name with one or more geographical features. As a large number of geographic names is not unique (e.g. Cambridge becoming annotated with both the USA and the UK instance), irrelevant annotations are removed by using other higher order features such as countries or continents found in the strain. In addition, the most specific feature is selected by removing the higher order features as this is redundant information that can be inferred from the ontology. The remaining annotation is the integration result; multiple remaining annotations or features being too distant indicate inconsistent data.

The habitat fields can also be integrated using a similar algorithm. However, in order to have enough ontological coverage, a combination of the Environmental Ontology (EnvO), the NCBI Taxonomy and Foundational Model of Anatomy (FMA) ontology is used. This possibly yields multiple orthogonal annotations, but for this field, having multiple annotations increases the information content and therefore does not indicate inconsistencies.

References

  1. Dawyndt P, Vancanneyt M, De Meyer H, Swings J: Knowledge accumulation and resolution of data inconsistencies during the integration of microbial information sources. IEEE Trans. Knowl. Data Eng 2005, 17: 1111–1126. 10.1109/TKDE.2005.131

    Article  Google Scholar 

  2. Verslyppe B, Kottmann R, De Smet W, De Baets B, De Vos P, Dawyndt P: Microbiological Common Language (MCL): a standard for electronic information exchange in the Microbial Commons. Res. Microbiol 2010, 161(6):439–445. doi:10.1016/j.resmic.2010.02.005 doi:10.1016/j.resmic.2010.02.005 10.1016/j.resmic.2010.02.005

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bert Verslyppe.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Verslyppe, B., De Smet, W., De Vos, P. et al. Semantic integration of isolation habitat and location in StrainInfo. BMC Bioinformatics 11 (Suppl 5), P3 (2010). https://doi.org/10.1186/1471-2105-11-S5-P3

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-11-S5-P3

Keywords