A GO project ontology is represented by an directed acyclic graph (DAG), each term being a node in the graph. Each parent term can have multiple children, and each child term can have multiple parents. There is a single root term for each of the three ontologies, ‘molecular function’, ‘biological process’ and ‘cellular component’. Terms become more specific as one moves away from the root.
Often only the more specific GO terms are annotated in the association databases. For instance in the September 2013 release of the Entrez Gene database gene2go table [11] only 16 mouse genes are directly annotated with the cellular component term ‘sarcomere’. However a total of 81 additional genes are annotated with child terms of ‘sarcomere’ such as ‘A band’, ‘I band’ or ‘Z disc’. Absence of the higher level annotations means that in this case the ‘sarcomere’ gene set will be missing the vast majority of sarcomere related genes and the sensitivity of GSEA towards perturbations in sarcomere biology will be significantly reduced. Making a fully featured gene set collection in which these implicit but highly meaningful associations are captured requires that the GO DAG is used to annotate genes with these inferred higher level term associations. Thus both the association data and the rich structure of the ontology contribute to the extraction of meaning from the data. Gene annotation with these inferred associations is equivalent to propagation of the explicit term associations up the DAG towards the root. The original Broad Institute ‘c5’ collection was constructed in this way (A. Liberzon, personal communication). Other GO based gene set tools adopting a similar strategy include FatiGO [1], ErmineJ [3], GoParGenPy [12], GoStat [13] and DAVID [14]. Likewise GO2MSIG includes a propagation phase during set construction.
Figure 1 shows in detail the procedure adopted by GO2MSIG. The program flow can be broken down into 4 main phases. The first is obtaining the GO term hierarchy, the second is obtaining the gene associations, the third is the propagation of the associations towards the root term and the fourth is post-processing the resultant gene sets and formatting the output. Each phase consists of a number of discrete steps, shown in Figure 1 and described below in order of execution.
Obtaining and parsing the GO term hierarchy
GO2MSIG can obtain term information (the IDs, names and hierarchical relationships of the GO terms) from a standard GO term database (installed locally or remotely) or an OBO flat file. If used with a slow implementation of the GO term database GO2MSIG can cache the GO hierarchy in order to speed up future calls. The GO ontology is constantly evolving, hence the GO database contains information about GO terms now obsolete, and about terms which are synonymous with other terms that are now preferred. GO2MSIG parses the database to determine the list of current terms and their parent–child relationships, the list of obsolete terms and the list of terms with synonymous preferred terms.
Obtaining, parsing, filtering and mapping the GO term gene associations
GO2MSIG can obtain GO term gene associations from any of three sources: Affymetrix or Agilent array annotation files, the GO association database, or the NCBI curated Entrez Gene database gene2go table [11]. The Entrez Gene database has the advantage over the GO association database that it uses a consistent gene identifier (the Entrez Gene ID), but the disadvantage that it contains fewer gene associations. The Entrez Gene database is the source of the pre-built MSigDB GO sets for human.
The next step allows the user to filter the associations by evidence code. The association between GO terms and genes is accompanied by an evidence code which describes how strong the evidence is for that association. IEA (inferred from electronic annotation) for example means that the association has been inferred automatically, whereas TAS (traceable author statement) means that the association has been proven experimentally and that the assertion can be located in a paper. The original MSigDB human GO sets used a subset of evidence codes ensuring only well supported associations were used. The same defaults are used by GO2MSIG, but any combination can be specified by the user. For less well characterised species it can be that only IEA supported associations exist.
Following evidence filtering, GO2MSIG selects either the gene symbol or the Entrez Gene ID as specified by the user (for those association data sources where this is an option).
The next step allows the user to supply an optional file remapping the gene identifiers obtained from the association data source to user supplied identifiers. This may be necessary if the association data uses a different type of gene identifier to those present in the experimental results. One common use would be to map from probe IDs supplied in an array annotation file to gene IDs. The program can be set to either leave identifiers missing from the translation file unmapped, or to repress them. The former is useful if the user wishes to remap only a subset of gene identifiers, useful in those cases where the annotation data source is inconsistent in its identifier use. The latter can be useful when trying to extract gene sets for a single species from an Affymetrix annotation spreadsheet that contains more than one species.
Finally GO2MSIG remaps synonymous GO terms to the term preferred by the version of the GO hierarchy being used. Obsolete terms in the associations are ignored. Informational warnings are issued when synonymous or obsolete terms are encountered, and also when GO terms in the gene association data source are not represented in the GO term data source at all (possible if the term data source is older than the association data source).
Annotation of genes with higher level term associations
The annotation of genes with higher level term associations is performed by recursively propagating gene associations explicitly defined in the association database up the DAG towards the root. During this process the shortest path connecting each term to the root is recorded for later output.
Size filtering, duplicate removal and output
The first step after propagation is filtering the collection of sets by user selectable maximum and minimum size cut-offs. The minimum cut-off prevents the output of many very small gene sets that would never achieve statistical significance. The maximum cut-off removes GO terms that are so general that the results would be meaningless.
Gene set collections built from GO will likely contain multiple GO terms with identical gene associations. Since such duplicate gene sets can affect the accuracy of the key GSEA false discovery rate statistic, the next step is to collapse these down to a single set. The original MSigDB gene set collections eliminate identical terms leaving one representative. Elimination of identical terms without record can make inference harder. If, for example, a parent term ‘muscle system process’ has the same gene associations as its eliminated child term ‘muscle contraction’, then during analysis the user looking at the GO hierarchy will likely want to know whether the significant change identified for ‘muscle system process’ is a result of ‘muscle contraction’, or other child terms such as ‘muscle hypertrophy’. This is hard to do if the fact that the child term existed and is identical to the parent term is not recorded. GO2MSIG also records each unique gene set (by gene content) only once. However the description field in the output file will contain a list of all GO terms with that identical set of gene associations. The URL link field (which can only reference one term) contains a link to whichever of the GO terms has the shortest distance between it and the root term - in other words the most general of the terms associated with that gene set.
During analysis of the results from a GO based gene set analysis the experimenter is likely to want to home in on the more specific terms that show statistically significant changes. In this implementation a rough guide to specificity is provided by appending the distance of each term from the root (calculated during the propagation stage) to the end of the term description during the final output. This is analogous to the concept of ‘levels’ used by the functional enrichment analysis application FatiGO [1].
Finally the set is output in either .gmt or .gmx format, one being essentially the transpose of the other.