High throughput experimentation such as gene expression microarrays, next generation sequencing or proteomics enables the interrogation of many thousands, or even millions, of data points simultaneously. Comparison between these experiments (such as a phenotype and control) enables identification of gene or protein sets of interest in a hypothesis free manner. To stimulate generation of testable, explanatory hypotheses for experimental validation from these sets of genes, researchers will often apply Gene Set Enrichment Analysis (GSEA)  or concept enrichment analysis using controlled vocabulary terms. Term enrichment analysis, which refers to the search for ontology terms that occur more in a given gene list when compared with a background gene set, can be used to generate new scientific hypotheses. Gene Ontology (GO) [2, 3], arguably the most commonly used ontology in basic research, consists of a collection of three non-overlapping controlled vocabularies that describe molecular functions, biological processes and cellular components. There are now more than 50 GO-based enrichment analysis tools available. Examples of such functional analysis tools are BiNGO  or GOEAST  , which solely utilize gene ontology (GO) for their analyses. Other approaches, such as ClueGO , DAVID  and GeneWeaver , incorporate larger range of sources, such as disease ontologies, phenotype ontologies or common pathways. However, all of them rely on predefined gene annotations and thus are limited to biomedical domains that have curated annotations. Baumgartner, et al. presented an analysis that demonstrated how manually curated annotations can never keep pace with novel scientific discoveries, and argued that text-mining based methods need to be adopted to keep pace with the rising volume of literature. For example, an incredible amount of established knowledge about genomes and proteomes is available through NCBI Entrez Gene  and UniProt , but the concepts mentioned in the textual descriptions of genes and proteins in these resources are not part of any statistical enrichment analysis. We believe in a hybrid approach of testing manually curated terms along with automatically recognized concepts from curated text will result in more hypotheses and therefore be more useful to the researcher.
Large-scale annotation of all the known genes and expressed proteins in an organism’s genome is a complex and arduous task. To this end, biology and medicine have created and manage discipline-specific structured ontologies that are suitable for gene or protein annotation. Although these ontologies are publicly available, for instance via the National Center for Biomedical Ontology [12, 13] or the EBI Ontology Lookup Service [14, 15] and provide valuable information about connections between different biological concepts, only a small fraction of these ontologies are used for gene and protein annotation and therefore a relatively small amount of annotations are actually available for use in enrichment analysis methods.
The quality of results from term enrichment analysis is naturally dependent on the quality of the annotations underlying the analysis. Therefore term enrichment analysis should only use high quality annotations, such as the human-curated annotations from Ingenuity Pathway Analysis (IPA) (http://www.ingenuity.com/) or from a highly restricted subset of GO of experimentally validated and published annotations. However, many genes do not have annotations of this quality, and therefore the results of enrichment analysis can be highly incomplete. On the other end of the spectrum, including automated annotations based on criteria such as computational prediction using sequence similarity would result in a richer but less accurate set of annotations and hence less reliable results from term enrichment analysis. In this paper, we propose a middle ground that combines high quality human-curated gene descriptions with automated assignment of annotation terms based on those descriptions. We use the Stanford National Center for Biomedical Ontology (NCBO) Annotator , which provides annotations with terms from over 200 publicly available biomedical ontologies, to automatically annotate a gene or protein based on the corresponding Entrez Gene or UniProt textual description. The text description is used as the basis on which the NCBO Annotator provides ontological terms that could annotate the gene or protein
We find that automated annotations generated in this manner reliably recover the known annotations already present in the text record (such as GO terms or OMIM  terms), and we find that we are able to annotate with a wide spectrum of concepts not available in any currently used ontology enrichment tools. Additionally, we are able to identify GO terms that are present in curated text that are not currently formerly annotated to these genes or proteins, and many of these examples are bona fide annotations. Overall, our approach is able to annotate proteins with 524,304 terms from across 291 ontologies; and a vast majority of these terms are not part of the GO.
In the following, we will demonstrate the advantages of using automatic annotations that are based on manually curated textual descriptions, by extending our previous RANSUM approach  to enable analysis of genes and protein concepts. We will first describe the STOP workflow, which allows a researcher fast and easy statistical analyses of gene sets using up-to-date information of genes and proteins from the most widely used model organisms and human. We will further demonstrate how automatically derived annotations contain valuable information that is not currently present in the GO, without diminishing the value of manually curated GO enrichment analysis. Therefore, we compare our annotations against GO and highlight examples of gene-to-term annotations that are likely to be correct but not present in official GO annotations. Finally we describe two use-cases: (1) proteins that are direct protein interaction partners of the huntingtin protein and (2) known Parkinson’s disease genes. We use these sets of proteins to demonstrate how STOP can reveal interesting enriched concepts that improve the understanding of functional traits implied by gene sets.