We present here an approach for generating domain-specific subsets of GO, via an editing method we call 'clipping'. We show that the use of a clipped subset of GO can improve functional analysis of microarray data relevant only to the domain of the clipped ontology. We present NIGO, a clipped subset of GO directed at nervous and immune systems. Evidence that the immune system affects neuronal processes, such as neuronal maintenance and repair, is accumulating. It was recently shown, for example, that mice lacking both T- and B-cell populations (Severe Combined Immune Deficiency, SCID) show impairment in neural precursor cell proliferation and differentiation into mature neurons [17]. We thus chose to create a neural and immune subset of the Gene Ontology since we were specifically interested in annotating microarray experiments which link the two systems. Even though the design of the subset is aimed for the annotation of such experiments, NIGO is also useful for the annotation of expression studies which investigate only one of these two biological systems. We show that NIGO outperforms the full GO or a generic GO-slim in finding relevant terms that are enriched in genes with varied expression in these systems. NIGO revealed GO terms, from all hierarchical levels of the ontology, that were relevant to the experimental system used to create the microarray data and that did not pass the statistical cutoff when conducting the analysis using the full GO. In addition, NIGO improved the statistical scores assigned to many neural/immune-relevant GO terms that passed the statistical cutoff for the full GO.
It is important to stress the fundamental difference between a clipped subset of GO and a slimmed subset. While a slimmed subset will also achieve improvement of statistical scores assigned to the enriched terms, theses terms will be general top-level terms that will reveal a lot about the nature of the biological differences between two sets of samples, but little regarding the specific responses. The power that lies in a clipped subset of GO is not only in improvement of statistical scores but also in enrichment of terms from all hierarchical levels which reveals more about the biology underlying the study. In this work, we used a generic slim subset which contains 152 top-level GO terms. While a neural/immune specific GO slim could have performed better than a generic slim, it would most likely not include specific terms, such as 'Antigen processing and presentation of exogenous peptide antigen via MHC class I' and 'Positive regulation of T cell mediated cytotoxicity' which are located 6 and 7 steps away from the root, respectively. Such terms proved to be important for the interpretation of a relevant microarray dataset (see Table 1) and thus demonstrate how a clipped subset of GO could outperform even a specific slimmed subset.
In one study, the use of NIGO actually allows the formulation of hypotheses that are otherwise missed when using the full GO. In studying the effect of chronic Fluoxetine treatment on hippocampal gene expression, Miller et. al. compared expression patterns in the hippocampi of mice with or without treatment with the antidepressant, Fluoxetine, for 21 days [10]. NIGO revealed several immune-related terms that were not identified with the full GO. These include GO terms such as 'antigen processing and presentation of peptide antigen via MHC class I', 'MHC class I protein complex', 'positive regulation of T cell mediated cytotoxicity', and 'defense response' Furthermore, this group of terms was not found to be significantly enriched in the original analysis of this dataset.
The hypothesis that can be derived from this finding, namely that treatment with Fluoxetine alters immune-related processes in the brain via the MHC-class I pathway, is in agreement with previous knowledge. Chronic stress and depression are widely known to down-regulate the immune system and several lines of evidence indicate that some antidepressants can reverse this impairment by producing various immunomodulatory effects [18–20]. Interestingly, it was shown that uptake of serotonin 5-hydroxytryptamine (5-HT) is impaired by Fluoxetine, a process which may interfere with mechanisms of immune regulation [21]. Moreover, rats treated with Fluoxetine demonstrated reduced CD4+ cell number, increased number of CD8+ cells and elevated levels of cytokines such as IL4 and IL2 in vitro [22]. The NIGO-based finding thus coincides with the observed increase of CD8+ cells since the term 'positive regulation of T cell mediated cytotoxicity' was found by NIGO to be enriched in the dataset. Furthermore, most cytotoxic T cells express T-cell receptors (TCRs) that recognize a specific antigenic peptide bound to class I MHC molecules. Accordingly, GO terms related to MHC class I antigen presentation were also found by NIGO to be enriched in this dataset. This line of evidence suggests that Fluoxetine, and possibly other antidepressants, exert their effects, at least partially, via modulation of CD8 T cell's activity. These results further demonstrate that analysis with NIGO can enhance interpretation of functional analysis results produced for relevant microarray datasets.
In the analysis of the GSE6509 expression dataset, three relevant terms passed the statistical cutoff with NIGO but not with the full GO. These terms, 'viral envelope', 'viral infectious cycle' and 'viral capsid' are all terms related to mouse genes involved in viral infection. It was previously shown that for this dataset, clustering of the GO nodes revealed that the GO term 'response to virus' was enriched in genes down-regulated by LPS treatment [11]. Several lines of evidence show a connection between Mifepristone and viral infection. For example, it was shown that Mifepristone can increase target cell sensitivity to retroviral infection [23]. Though the analysis of this dataset with NIGO did not reveal any new biological knowledge, detecting and interpreting the 'response to virus' related terms was made much easier.
In several cases, terms were detected by the full GO but did not pass the same cutoff when conducting functional analysis with NIGO, even though these terms were included in the NIGO subset. Such terms received FDR or p-values that were very close (but larger) than the cutoff values used. This is partially explained by the stochastic nature of the GSEA algorithm. Indeed, for one set (GSE8788), we compared the raw results with averaged FDR values. Averaging dramatically decreased the number of such terms. Furthermore, we conducted a similar functional analysis using the Fisher Exact Test and found that while this method included no stochastic element, several of terms that were found by both the full GO and NIGO (in the analysis of two non-neural/immune related datasets) had lower adjusted p-values when conducting the analysis with the full GO (see the Results section).
Out of nine neural/immune-related expression datasets used to evaluate the performance of NIGO in comparison to the full GO, NIGO revealed terms that did not pass statistical cutoff with the full GO for five. For two of the four datasets in which NIGO did not reveal new terms, this ontology improved the FDR values for over half of the terms that passed the statistical cutoff with the full GO. Hence, for approximately 77% of the datasets used, NIGO outperformed the full GO either by revealing new terms or improving FDR values for more than 50% of the terms that passed the statistical cutoff with both ontologies. For neural- or immune unrelated microarray studies, on the other hand, NIGO did not outperform the full GO or the generic GO slim. These results are well in agreement with the design and purpose of NIGO.
Alternative approaches may lead to the creation of domain-specific subsets from the GO. One approach involves the selection of terms that are descendant of the GO term or terms that are most pertinent to that domain. In the case of NIGO, this would mean choosing all those terms that are direct descendants of the terms 'immune system process' (GO:0002376) or 'neurological system process' (GO: 0050877). We believe that this approach would lead to many relevant terms being omitted, since not all pertinent terms are necessarily defined in GO as a neural or immune system process/function. For example, the term 'muscle hypertrophy' (GO:0014896) is not defined in GO as a neurological system process but was found to be linked to the neurological system by the UMLStermFinder (Figure 4B). It is highly likely that not all of the information linking biological processes to these systems have been incorporated into the Gene Ontology. It is our opinion that by adding knowledge from the literature and from other biomedical ontologies (included in UMLS), terms that are not directly associated with these biological systems can still be included in NIGO, or any other domain-specific subset. Another possible approach is to create a domain-specific slim. While this approach will probably create a highly compact ontology, setting some threshold of abstraction, which is the declared purpose of GO slimming, would necessary lead to loss of information. In this study, for example, it is hard to imagine a GO slim that would go so far down the GO graph as to include the terms 'Antigen processing and presentation of exogenous peptide antigen via MHC class I' and 'Positive regulation of T cell mediated cytotoxicity' without defeating the purpose of the slimming process. Yet these two terms were found to be enriched in GSE6476, and are crucial for generating a hypothesis based on the expression profile. This shows that GO slims may be complemented by small, yet fully detailed domain-specific subsets of GO.
NIGO is obviously not a perfect representation of all knowledge related to neural/immune-related gene function. Due to the massive use of human-based curation of GO terms used for the production of NIGO, wrong judgment calls have most likely lead to some erroneous inclusion or exclusion of terms. The use of the core gene set (filter 2) may lead to errors of commission due to the multifunctional nature of genes. Thus, the use of the subset in analyzing high-throughput experiments may lead to some loss of information and may include irrelevant terms. Error of commission only leads to a slight degradation of the statistical power of analysis as long as the fraction of falsely included terms is small. Since the results must be interpreted by someone versed in the domain to be useful, the inclusion of a small fraction of non-specific terms would do little to degrade the usefulness of the results. Errors of omission, on the other hand, may more significantly degrade the results. In theory, it is sufficient that one important term be missing to blind the interpreter from seeing the true biological meaning of the results. Nevertheless, our analysis of actual microarray data demonstrates that even in its current form NIGO allowed interesting enrichments to be identified that were otherwise missed. However, to overcome the impact of errors of omission, at least in part, we recommend that in addition to using NIGO (or other subsets generated in a similar fashion), one should conduct a parallel analysis using the full GO. This will ensure that the domain-specific GO subsets could do no harm - the interpretation of the results using both full and clipped GO cannot be less informative than interpreting the full GO alone.
The approach we present here for constructing domain-specific GO clipping can be applied to other fields. It is possible, for example, to divide NIGO into immune-specific and neural-specific sets. It should also be possible to generate specific subsets that are relevant to other domains, although significant work is involved in the process, especially in the step involving a domain expert's review of a significant fraction of all GO terms. It is possible that this step can be largely replaced by automatic procedures, or that the pipeline described here can be improved to reduce the number of decisions that require an expert's opinion. Furthermore, the five-step filter used to decide upon NIGO inclusion/exclusion could be altered to include more steps, such as searches of other knowledge sources. Improved automation, or alternatively a major community effort, could lead to the creation of a library of domain-specific clipped GO subsets, which could, in turn, enhance the interpretability of many microarray experiments. Improved automation could, in principle, be obtained by reversing the logic of our NIGO evaluation, domain-related GO terms can be picked by examining the GO terms that are associated with expression profiles deemed by domain experts to be related to some domain.
Another challenge in developing high quality GO subsets is their maintenance. The Gene Ontology is constantly growing, with new terms and annotations to genes being regularly added. It is thus important to continually update GO subsets, such as NIGO. NIGO could be updated by periodically reviewing new terms and annotations that have been added to the GO, subjecting them to the filter system developed to find relevance of terms to the neural/immune systems and adding the relevant terms, along with their parental terms, to the subset. Assuming no dramatic increase in the rate of growth of the GO, this can be achieved with modest effort as the number of new terms is much smaller than the number of existing terms.
Further research into the automation of domain-specific ontology clipping and/or community efforts may lead to the emergence of multiple domain-specific derivates of GO that will improve the interpretability of high-throughput gene-related analyses.