Ontology shift 1: Dealing with anomalies
One type of ontology shift occurs when curators become aware of a mismatch between GO representation and reality, leading to a term being incorrectly related to other terms in the ontology. The discovery of such anomalies leads to revisions of the ontology, which needs to be both internally consistent and faithfully representing reality.
As an example, consider the GO process term "serotonin secretion". Serotonin is a small molecule produced by a variety of cells, including neurons, enterochromaffin cells, basophils, and mast cells. It plays several roles in the body, most famously as a neurotransmitter, but also in contraction of the gut and mediation of allergic inflammation by mast cells. The term "serotonin secretion" was initially added to the ontology as an is_a descendant of both "hormone secretion" and "neurotransmitter secretion," since the neurotransmitter and hormone roles of this molecule were the only roles considered when the term was developed. When a new term "serotonin secretion during acute inflammatory response" was subsequently created as an is_a descendant of "serotonin secretion" (Figure 1A) it was realised that it was not biologically accurate to state that "serotonin secretion during acute inflammatory response" was a subtype of "neurotransmitter secretion," because in inflammatory responses serotonin acts on target cells other than neurons. The erroneous placement of "serotonin secretion" was corrected resulting in the graph shown in Figure 1B. The new term "serotonin secretion, neurotransmission" was added as a subclass of "serotonin secretion" to capture the process of serotonin secretion during neurotransmission.
Another example is the disambiguation of immune responses from defense responses. Regulatory immune responses such as those involved in tolerance induction to non-self antigens, which prevent inappropriate responses to substances in food, for instance, challenge the idea that every "immune response" is also a "defense response". Yet, these two terms were initially represented as synonymous within GO. This situation was discussed during a GO meeting at The Institute for Genomic Research in November 2005. Several biological cases were presented as anomalous under the current description, and it was determined that this warranted a broad shift in the ontology itself. "Immune response" and "defense response" thus became terms that share a common ancestor term but have no direct relationship between them. This enabled curators to distinguish between the two types of responses, while signalling their common origin as a reaction to a stimulus.
Ontology shift 2: Expanding scope
A second type of ontology shift occurs when GO needs to be extended to cover terminology and data coming from new research fields, biological issues or species.
The problems caused by the addition of immunological terms illustrate what happens when including knowledge from a new field into the ontology. The scope of what GO considered an "immune response" was expanded by a major revision discussed at a GO meeting in 2005 [9] and completed in September 2006 [10], as still it became clear that the concept of having an "immune system" was not restricted to vertebrates. Biologists working in both invertebrate and plant systems used the term "immune system" to describe the cells and biological processes mediating innate immune responses in these organisms [11]. Until the revision, GO had considered immune responses in higher vertebrates only, and terms related to innate immune responses in other organisms, such as the "incompatible interaction" of plants and "melanization defense response" in insects, were found in other areas of GO. After the revision, all the various types of immune responses in different organisms were grouped together as types of "immune response".
What happens when GO is expanded to cover a new biological issue is illustrated by the 2004 development of ontology to describe host-parasite interactions. The early versions of GO contained few terms to describe the interactions that occurred between hosts and their symbionts. Those that did exist, such as "evasion of host defense response" and "cell invasion," shared no common ancestor and were often ill-defined with respect to which organism in a particular interaction the term was referring to. For example, the process of cell lysis can be induced in a host organism by its parasite, or can be an endogenous process whereby the immune system destroys its own infected cell, but both were represented by a single term, "cytolysis". In 2004 the PAMGO (Plant-Associated Microbe Gene Ontology, http://pamgo.vbi.vt.edu) Consortium was formed and worked with GO to develop terms relating to host-parasite interactions, specifically for plant parasites. The initial set of around 450 terms added to biological process had a single ancestor term "interaction between organisms" (later to be renamed "multi-organism process") which not only encompassed host-parasite interactions but also processes as diverse as "female pregnancy" and "biofilm formation". The multi-organism process sub-hierarchy now contains over 1300 terms, and in addition there are around 80 terms in cellular component to describe locations within other organisms, such as "host". The introduction of these terms also required a change to the annotation methodology such that information about the taxon of both the species involved in an interaction could be captured (see http://www.geneontology.org/GO.annotation.conventions.shtml#interactions).
Finally, ontology shifts have frequently occurred when GO was expanded to include data from a new species. GO aims to support the comparative analysis of gene products across species, and its terms need to accommodate differences in the biology of organisms ranging from fruitflies to mice and plants [12]. Especially when grouping together species coming from different kingdoms, GO has been radically modified to avert the danger of biological inaccuracies. When GO was first applied to prokaryotic gene products, for instance, many relationships within the cellular component ontology, and some in the biological process ontology, had to be altered to allow for the fact that prokaryotes do not have nuclei or several other membrane-bounded organelles. For example, the enzyme complexes that carry out the reactions of the TCA cycle (also known as the Krebs cycle or citric acid cycle) are located in the mitochondrial matrix in eukaryotes. Cellular component terms representing these complexes were originally grouped under 'tricarboxylic acid cycle enzyme complex', which was in turn part_of "mitochondrial matrix". In bacteria, which do not have mitochondria, analogous complexes are located in the cytosol; the way in which GO related the TCA cycle complex terms to "mitochondrial matrix" was thus inaccurate (Figure 2a). To address this, the existing term was renamed to add "mitochondrial", thus making information that was implicit in the ontology structure explicit in term names. Two new terms were then added, one of which uses the non-location-specific name, and is a descendant of "cytoplasm"; the second new term has a name and path specifying that the complex is located in the cytosol (Figure 2b).
Ontology shift 3: Dealing with diverging definitions across communities
A third type of ontology shift results from the need to deal with diverging definitions across research communities. Often, given the diversity and fragmentation typical of biological and biomedical research, the same phrase is used in different ways depending on the research context [3]. Maintaining univocity (a word or phrase having a single meaning) is essential in developing unambiguous ontologies, so it is necessary to alter the structure of GO where cases of multiple meaning for a word or phrase arise. Not surprisingly, this type of ontology shift often coincides with ontology shifts of type three detailed above: divergence in the use of terms across communities are commonly discovered when new terminologies, fields or species are added to GO.
The 2001 transition to include plants in the ontology is a case in point. Early on, GO had only one term for gamete formation: "gametogenesis", which was defined as the "generation, maintenance, and proliferation of gametes". The meaning of "gamete" was not specified, but the term and definition were generated with animal gamete formation in mind. Plant biologists, however, use "gametogenesis" to refer to the generation of a gametophyte, that is, a plant in the haploid phase that can produce gametes. An extensive set of changes was required to remove ambiguity in the usage of "gametogenesis", and add terms to represent plant biology. The definition of "gametogenesis" was altered to define gamete as "a haploid reproductive cell" and to remove the mention of proliferation. For plant processes, a new term "gametophyte development" was added, its name and definition clearly referring to the relevant phase of a plant life cycle.
Ontology Shift 4: Mirroring scientific advance
A fourth type of ontology shift occurs in response to new evidence which changes the understanding of a given entity or process so that its definition and relations to other terms also need to change.
An example of this involves the term "cytoskeleton". For several decades, cytoskeletal structures such as microfilaments, microtubules, and intermediate filaments were observed only in eukaryotic cells, and were therefore thought to be absent from prokaryotic cells. Accordingly, the definition for the GO cellular component term "cytoskeleton" followed one of many dictionary and textbook definitions, beginning "Any of the various filamentous elements that form the internal framework of eukaryotic cells". In recent years, however, evidence has accumulated that bacterial cells do contain cytoskeletal structures [13, 14]. To accommodate these discoveries, and to facilitate annotation of bacterial cytoskeletal gene products the GO definition had to be broadened to remove the word "eukaryotic", and with it the restriction on species to which the term could be applied. Deletion of a single word from a term definition thus allowed GO to capture a revolutionary advance in the research community's understanding of both prokaryotic cell organization and the taxonomic distribution of cytoskeletal structures.
The definition of the term "conoid" is another example of how ontology can shift in response to scientific developments. The conoid is a cytoskeletal element that forms part of the apical complex, a distinctive and elaborate structure found in apicomplexan parasites [15]. Based on electron microscopic analysis, the conoid was known to consist of fibers; based on the prevailing hypothesis that the fibers were microtubules, GO included a cellular component term, "conoid", which was an is_a descendant of "microtubule". The definition was: "Coiled microtubules within both the polar and basal rings of the apical complex of an apicomplexan parasite." More recently, Hu et al. [16] showed that the conoid is indeed composed primarily of tubulin, but the tubulin structure differs markedly from that of typical microtubules. This improved understanding of conoid structure had two consequences for "conoid". First, the is_a relationship to "microtubule" was removed, because the conoid could no longer be considered a type of microtubule. Second, the text definition was changed to remove information now known to be false and to more accurately describe the structure of a conoid.
Ontology shift 5: Adding relations
One last type of ontology shift concerns changes to the type of relations deemed to hold between ontology terms. This shift is more complex than the previous four, since it potentially affects the whole ontology and requires a revision of the whole system in order to be implemented.
As originally used in GO, the part_of relationship was not rigorously defined. This led to a number of problems, one of the biggest ones being that the part_of relationship in GO was being used with different levels of stringency: in some cases all of the subclass is part of some of the superclass (all-some), while in others only some of the subclass is part of some of the superclass [17]. For example, the TRAMP complex is found only in the nucleus, while the exosome complex is found in both the cytoplasm and the nucleus yet in GO both complexes had the same part_of relation to nucleus, with the exosome complex having a further part_of relation to cytoplasm. To remedy this situation, the scope of the part_of relationship was limited to specifically refer to the all-some relationship, and the graph altered accordingly. Further, new relationship types have been introduced [18] to address other consistency issues with the use of part_of in GO:
• Regulation Relationships
Before the introduction of the regulation relationships, all regulatory processes in GO were made part_of the processes they regulated, which was insufficient to capture the biology because not all regulatory processes are integral to the processes they regulate. For example, a kinase which phosphorylates a transcription factor and thus regulates its translocation from the cytoplasm to the nucleus regulates transcription. However, the kinase is not part of the transcription machinery and thus does not itself play a direct role in the process of transcription.
• Has_part relationship
The use of the part_of relationship in the spliceosomal component terms led to a logical flaw, sometimes referred to as a true path violation, in the ontology. The term "U5 snRNP" was a part_of descendant of both the term "major (U2-dependent) spliceosome" and the term "minor (U12-dependent) spliceosome". Most eukaryotic organisms have two forms of spliceosomes, each of which contains five snRNP complexes, four which are unique to that type, and one which is found in both types of spliceosomes. While it is true biologically that both the major (U2) and the minor (U12) forms of the spliceosome contain the U5 snRNP, any specific U5 snRNP complex is not present in both forms of the spliceosome at the same time. Further, some organisms, e.g. S. cerevisiae, do not have the minor spliceosome. The "U5 snRNP" part_of the "minor (U12-dependent) spliceosome" relationship leads to the conclusion that the S. cerevisiae genes annotated to the term "U5 snRNP" are present in the minor spliceosome, which is not true. During a major revision of the spliceosomal complex terms, new terms were added to represent the different spliceosomal complexes that are recognized during various stages of the spliceosomal assembly/disassembly cycle. The has_part relationship was then used to capture the biological relationships between some of the large spliceosomal complexes and the smaller snRNP complexes, thus more accurately describing relationships between complexes and subcomplexes in the cellular component ontology (and similarly in the biological process ontology) that were either misrepresented or not represented previously.