GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data

Background Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0971-3) contains supplementary material, which is available to authorized users.


Background
Following the rapid decrease in the cost of highthroughput sequencing and the standardisation of analytical pipelines for microarray data, complex multifactorial experimental designs have become commonplace in many research fields. Many different methodologies have been proposed to address data summarisation and visualisation at the pathway level [1][2][3]. Gene ontology (GO) is one of the most robust and widely used resources to categorise biological entities into functionally related groups [4][5][6]. The most common system biology techniques currently used to extract biological knowledge from transcriptomics data sets often apply gene-set enrichment analysis (GSEA) on gene lists resulting from differential expression outputs. However, this approach has two limitations: (1) GSEA approaches are typically limited to the analysis of a single gene list resulting from a two-group comparison, and (2) gene expression estimates in the individual replicates are lost in differential expression statistics, potentially obscuring outliers. The first limitation can be circumvented by combining multiple comparisons into a summary statistic assigned to each gene feature (e.g., time-series). To the best of our knowledge no publicly available tool addresses the second limitation and provides a simple interface to access and visualise individual gene expression profiles following the identification of relevant genes and molecular pathways. The GOexpress software package described here provides a number of functions for visualisation of gene expression data from multifactorial experimental designs, both as individual gene profiles or summarised as functionally-related gene sets. Additionally, the package facilitates the use of supervised classification or parametric analysis of variance, which complement existing approaches to identify gene features that best discriminate multiple groups of samples. Indeed, while parametric differential expression approaches are widely used to identify significant changes in expression levels, non-parametric supervised learning and classification methods represent a valuable alternative strategy to identify modest yet consistent differences, even between limited numbers of replicates [7,8].
Although the underlying technologies are very different, RNA-sequencing (RNA-seq) and microarray transcriptomic approaches both yield quantitative expression levels for each gene in each sample. Typically, this expression matrix is filtered to retain only 'informative' genes (e.g., > 1 read count per million [CPM] in at least n biological replicates for RNA-seq). In addition, the removal of genes expressed at very low levels is normally performed to minimise stochastic background expression at the lower ends of the dynamic ranges for both RNA-seq and microarray technologies [9][10][11]. Traditionally, differential expression analysis of transcriptomics data sets has been performed using parametric statistical methods such as edgeR [12] or limma [13]. However, non-parametric methods using bootstrapping (e.g., IsoDE) or non-parametric empirical Bayesian-based approaches (e.g., GFOLD) have been shown to perform equally well or better using transcriptomics data sets with few or no replicates to produce biologically meaningful rankings of differentially expressed genes [14,15].
The non-parametric random forest (RF) algorithm has been shown to perform comparably or better than other methods for both microarray and RNA-seq data sets [16,17]. It shows excellent performance even with high levels of noise; it is a powerful method for feature selection (e.g., identification of biomarkers); it can be used when the number of variables is significantly larger than the number of observations; and for data sets involving two or more experimental groups [18][19][20]. Moreover, in contrast to parametric approaches intended to optimally detect differences in mean expression among groups, the non-parametric RF algorithm is assumption-free regarding parameters of the distributions underlying gene expression patterns (i.e., mean, variance), facilitating detection of more subtle associations between gene expression levels and phenotypes [21].
In this paper we extend the RF approach for gene feature scoring by integrating a supervised RF analysis with a subsequent summarisation at the GO level, to identify robust panels of functionally related genes that best classify multiple sample groups. This multi-classifier fusion approach differs from other functional class scoring (FCS) methods: it combines classification of multiple groups with ranking information to identify GO terms that best classify samples, and estimates probability of GO ranking through permutation-based P-values. It is important to emphasise that GOexpress is not a conventional GSEA tool; its purpose is to use supervised learning to prioritize gene sets within GO functional classes that best classify samples according to their known experimental labels. In addition, the non-parametric RF algorithm has many desirable features, including considerable robustness to outliers, and absence of overfitting [21,22]. The software implementation provides a range of visualisation functions, and seamless integration with the R/Bioconductor framework. It also reduces the burden of programming for non-expert users, while providing a route to more advanced applications (e.g., R/Shiny).
To demonstrate the capabilities of the GOexpress package we use a previously analysed and published multifactorial microarray gene expression data set [23]. A separate demonstration RNA-seq data set is also included with the package and the package vignette guides users through the implementation of the corresponding analysis workflow, which is essentially identical to that of microarray data provided in Additional file 1.

General aspects and design
GOexpress is written entirely in the R programming language [24] and relies on several other widely used R packages available from Bioconductor [25,26] (biomaRt [27,28]) and CRAN packages (ggplot2 [29], randomForest [30], RColorBrewer, stringr, VennDiagram). The entry point for the package is the function GO_analyse, which processes the input gene expression data and returns the resulting scoring tables and annotations required for all downstream filtering and visualisation functions (Fig. 1). Critically, GOexpress does not transform the input expression data in any way. For the input ExpressionSet, GOexpress is designed to use normalised expression data pre-processed using widely available dedicated software packages; for example, log 2 -transformed microarray intensities normalised using the farms package or log 2 -transformed counts per million obtained using the edgeR package for RNA-seq data. Conveniently, the function allows analyses based on a subset of samples defined on-the-fly, without the need to manually create multiple input ExpressionSet objects. The procedure performed by the GO_analyse function comprises several steps: (1) collection of gene and gene ontology annotations, (2) evaluation and ranking of genes by capacity to classify groups of samples, (3) summarisation of classification power at the GO level, and (4) formatting of resulting statistics into a structured list returned to the user (Fig. 1). These steps are described in more detail in the following sections.

Semi-automated annotation of input gene expression data
The GO_analyse function requires a minimum of two mandatory user inputs. The package source archive includes an example of each input and output (Additional file 2). Firstly, the function expects pre-processed expression data (i.e., filtered and normalised), and associated sample phenotypic information loaded in the assayData and phenoData containers of an ExpressionSet object [Bioconductor Biobase package] (Fig. 1). This standardised format ensures interoperability with other Bioconductor packages, and simplifies data handling. The second mandatory user input is the name of an experimental factor-with two or more levels-present in phenoData. The function will consequently estimate the capacity of each gene and GO term to classify groups of samples associated with different levels of that experimental factor based on the provided expression data.
In many cases these two arguments are sufficient due to retrieval of gene and GO annotations from the current Ensembl release using the Bioconductor bio-maRt package. However, it is strongly recommended to generate a local copy of all annotations for two key reasons: (1) to ensure traceability and reproducibility of results, even when new Ensembl annotations are released; and (2) to avoid multiple calls to the web Ensembl BioMart API, saving significant runtime during the execution of the function. Additionally, custom annotations may also be provided to analyse datasets using gene feature identifiers not currently supported (Fig. 1).
The term "gene feature" will henceforth refer to either microarray probeset identifiers or Ensembl gene identifiers, the two types of feature identifiers supported by the automated annotation procedure. Custom annotations are imported in three independent data frames: (1) a two-column table that maps gene features to GO terms, (2) individual gene annotations that include the gene ID and associated gene name with an optional short description, and (3) individual GO annotations that include the corresponding name (e.g., "catalytic activity") and namespace (e.g., "molecular function"). The mapping table must also include genes absent from the expression data set such that all known annotated genes are used as a background set for the scoring of individual genes and GO terms. Genes present in the annotations, but absent from the expression data set, will be assigned a score of 0 and a corresponding rank equal to the number of genes in the expression data set plus one, Fig. 1 Overview of the GOexpress workflow. A typical GOexpress analysis takes as input: an ExpressionSet of the Biobase Bioconductor package containing either microarray or RNA-seq normalised expression data; the name of an experimental factor present in the phenoData slot of the ExpressionSet; and annotations for the features and GO terms (or other functional classes) considered. The GO_analyse function calculates scores and ranks for the individual genes and GO terms. Optionally, the pValue_GO function randomly permutes the gene features to estimate the probability of each GO term to rank (or score) higher by chance. Finally, various functions allow visualisation of gene expression profiles by gene and gene ontology, and export of the calculated statistics in text files impacting the subsequent scoring and ranking of their associated GO terms, if any. This choice is further discussed below for the scoring of GO terms.

Scoring of genes using expression data
Currently, the RF algorithm is used as the default method to answer the question: "How well does each gene feature in the expression data set discriminate between multiple groups of samples?" The RF algorithm consists of multiple decision trees; each internal node in each classification tree is built based on a different bootstrap sample (with replacement) of observations (i.e., biological samples) and a random sample of variables (i.e., gene features). The randomForest package first calculates the Gini index [31] for each node in the tree, where the Gini index is a measure of homogeneity from 0 (homogeneous) to 1 (heterogeneous). The decrease in Gini index resulting from a split on a variable is then calculated for each node, and averaged for each variable over all the trees in the model. The variable with the largest mean decrease in the Gini index is then considered the most important. Technically, GOexpress uses the mean decrease in Gini index as the score for each gene feature in the expression data set.
A key feature of the RF approach is the implicit handling of interactions between genes. In order to provide a robust solution, each tree in the random forest is built on a bootstrap sample of observations. As the trees are grown, a random sample of genes is selected for each internal node and these genes are tested for their individual capacity to improve the partitioning reached in the previous node. The larger the number of trees built, the more complete the coverage of interactions between gene features.

Summarisation and ranking of GO terms
The GO_analyse function scores each GO term by aggregating the rank (or alternatively, the score) for all annotated genes associated with the term obtained in the previous step (Fig. 2, Additional file 3).
All genes annotated to GO terms but absent from the expression data set are considered to have no classification power, and are assigned a score of 0 and the worst rank preserving the continuity of ranking. This procedure is particularly suited to transcriptomics data sets where uninformative gene features (e.g., below a detection threshold) are filtered prior to the analysis. Indeed, during the summarisation step, these uninformative genes negatively impact the ranking of corresponding GO terms. Consequently, this summarisation approach implicitly favours GO terms over-represented in the data set, in addition to integrating the ranking information for genes present in the ExpressionSet.

Formatting of results for downstream analyses
The final step of the GO_analyse function is to assemble the resulting tables of ranked genes and GO terms with important analysis parameters for traceability and reproducibility. These include the subset of samples used for the analysis (if applicable), the number of decision trees and the number of gene features sampled (if the default RF algorithm was used), and the summarisation function used to score each GO term based on the scores of all their associated gene features. Alternatively, one-way analysis of variance (ANOVA) is available as a parametric statistical scoring approach.

Sample data
We demonstrate the use of GOexpress using microarray data from a previously published and well-characterised multifactorial microarray gene expression data set shown on the right. The ranks of all genes associated with the 1stand 55th-ranked GO terms are shown, following filtering for only molecular function GO terms associated with at least 15 genes in the annotations. Notably, eight and 13 genes associated with the GO terms chemokine activity and kinase binding are absent from the sample ExpressionSet and ranked last [23,[32][33][34][35][36]. In this example, monocyte-derived macrophage (MDM) transcriptomes from five different cattle were examined over three time points spanning a 24-hour time course, following one of each of the following four treatments: (1) infection with Mycobacterium bovis, (2) infection with M. bovis BCG, (3) infection with M. avium subsp. paratuberculosis, and (4) culture media for noninfected controls [33]. The ExpressionSet used here was obtained following normalisation and filtering of informative probesets using the Bioconductor farms package [37], leaving 11,842 probesets measured in 65 samples (Additional file 4).

Probeset-level classification and visualisation
In the example below, we use GOexpress to identify and visualise the genes and GO terms that best classify samples subjected to the four different treatments across all time points post-infection. Given that the ExpressionSet also includes control samples prior to infection, we use the subset argument of the GO_analyse function to consider only the samples post-infection (i.e., 2, 6, and 24 h post-infection). We also use local versions of genomewide annotations for the Affymetrix® GeneChip® Bovine Genome Array downloaded from the Ensembl release 77 (Additional file 1).
In the resulting object, the table of ranked genes (Additional file 5) demonstrates that the probeset Bt.552.1.S1_at (gene name CCL5), best classifies the samples according to treatment group. It is also possible to use the expression_plot and expression_profiles family of functions to visualise group trends and individual gene expression profiles for the four treatment groups (Fig. 3).
To assess the performance of the RF algorithm we compared the importance score-i.e., the decrease in the Gini index-to the F ratio obtained using the one-way ANOVA method also implemented in GOexpress, considering the same four treatment groups (Additional file 6). The two methods show good agreement, with a positive Pearson correlation coefficient of 0.69 (P-value < 2.2 × 10 −16 ). In particular, the best (i.e., top ranked) classifiers identified by the RF tend to display a high variance among groups relative to the variance within groups. Conversely, poor classifiers identified by the RF generally display a low F ratio. Notably, the RF produces generally more conservative results; indeed, a number of features identified as significant using the one-way ANOVA (FDR < 0.05) show little or no classification power (Additional file 6).

Ontology-level summarisation Permutation-based P-value
From an ontology perspective, the basic GO results present two limitations. Firstly, all three types of ontology-biological process, molecular function, and cellular component-are all merged in a single table. Secondly, and most importantly, GO terms associated with fewer genes are favoured at the top of the ranking. This happens because the highest possible average rank for a group of five genes is 3, as opposed to 3,000 for a group of 6,000 genes. Additionally, GO terms associated with small numbers of genes are more susceptible to outliers and single gene effects in the expression data; this problem is not normally observed for GO terms associated with larger numbers of genes. Moreover, when small GO terms are filtered out, this scoring method emphasises specific and well-defined GO terms (e.g., GO:0070233: negative regulation of T cell apoptotic process), as opposed to the broad higherlevel and generally less informative GO terms (e.g., GO:0005515: protein binding). Importantly, the effect of gene ontology size is an acknowledged issue of pathway analysis, most methods focusing the analysis on pathways that pass specific size thresholds [38].
To assess the probability of GO term ranking, GOexpress includes the function pValue_GO that randomises the gene feature ranking table, and produces a permutation-derived P-value that indicates the proportion of permutations where each GO term is ranked equal or higher relative to the original result (Additional file 7). In other words, these P-values directly translate into the estimated probability of each set of functionally-related genes reaching their calculated average rank by chance. In addition, it is important to note that P-values obtained for each GO term in this manner are clearly non-independent, due to the competitive ranking of genes and subsequent GO term summarisation. Therefore, P-value adjustment for multiple testing is not recommended in this case.
To assess the performance of the GO summarisation step, we compared our results to those obtained using the widely used GSA package [39] (Additional file 8). In a similar fashion to GOexpress, GSA determines the significance of pre-defined sets of genes with respect to an outcome variable, such as a multiclass group indicator. The "maxmean" test implemented in GSA was designed to detect unusually large and coordinated gene scores indicative of differentially expressed gene sets; significant gene sets are identified as those where most of the genes show either higher or lower expression correlating with the group indicator. In contrast, the RF approach implemented in GOexpress identifies gene sets containing both upand down-regulated genes that, together, contribute to improve the classification of samples into their respective phenotypic groups. Notably, the supervised classification approach implemented in the RF algorithm was previously shown to have superior power to detect association of gene expression level with phenotype relative to a traditional Significance Analysis of Microarray extended to Gene-Set analyses (SAM-GS) in the presence of correlations between gene expression profiles, with similar performance in the absence of correlated gene expression [21].
Comparison of the average-rank and permutationbased P-value implemented in GOexpress to the GSA approach, revealed that four of the seven GO terms identified by GSA (FDR < 0.05) were also found to be significant (P-value < 0.05) using GOexpress. Interestingly, all seven GO terms are biological processes. In contrast, GOexpress returned an additional 18  Notably, the largest average-score metric available in GOexpress also emphasises functional categories that are biologically highly relevant, including chemoattractant activity such as cellular response to interleukin-1 (BP) and chemoattractant activity (MF) [Additional file 8]. Taken together, those results indicate that GOexpress detects additional functional categories capable of improving the classification of samples, while GSA may be restricted yet more sensitive in detecting coordinated expression changes within gene sets [39].

Filtering and visualisation
Using the subset_scores filtering function, users may filter GO terms according to domain (i.e., biological process, molecular function or cellular component), and/or minimal count of annotated gene features. In the example shown below, only GO terms associated with at least 15 genes, and an empirical P-value ≤ 0.05 are retained.
The resulting object is a list with the same structure as the input object and an additional element stating the filtering criteria applied. In this example, the molecular function chemokine activity (GO:0008009), which is associated with 35 genes in the annotations-27 of which are present in the ExpressionSet-is ranked as the top GO term that best classifies the four treatment groups across all time points (Additional file 7). Those results are consistent with our previously published finding highlighting the role of chemokine signalling and communication between innate and adaptive immune cells in the differential response to virulent and attenuated mycobacterial infections [23]. Following this, using the heatmap_GO function, it is possible to visualise the expression level of all genes associated with the GO term in each sample, as well as the hierarchical clustering of samples and probesets resulting from the corresponding expression profiles (Fig. 4). Additionally, the table_genes function allows users to export individual scores, ranks and annotations for genes associated with a particular GO term (Table 1).
Finally, the objects and functions available in GOexpress may be readily integrated within R/Shiny applications (http://shiny.rstudio.com), offering a dynamic interface to rapidly access the various plots and tables with minimal additional programming (Fig. 5). Shiny applications may be distributed as ZIP archives or hosted on web servers, providing a flexible interface for collaboration GO:0008009 chemokine activity

Discussion
The widespread adoption of microarray and more recently RNA-seq for gene expression analysis has witnessed parallel development of a large ecosystem of methodologies and software packages, all designed to extract biological knowledge from increasingly complex experimental data.
Traditional GSEA methods simply use two gene lists: a target set and a background set, such that the background set is used to detect over-representation of molecular pathways in the target set (e.g., GOrilla [40], GOstats [41], GOseq [42]). More recently, integration of information from differential expression and differential splicing has been proposed to weight genes (e.g., seqGSEA [43]). However, these tools are generally limited to the analysis of a single list of target genes identified from differential expression analysis in a two-group comparison, or alternatively require summarisation to merge multiple gene lists. Time-series analyses can address this limitation through characterisation of correlated differential gene expression profiles, which can be used for GO term enrichment (e.g., STEM [44]). Although the STEM program represents a powerful approach for investigating functional enrichment in groups of co-regulated genes across continuous experimental factors, it uses GO annotations largely as a descriptive summarisation of gene groups. In contrast to the GSEA tools described above, GOexpress does not estimate or evaluate enrichment in gene lists; instead, it uses GO annotations in a functional class scoring (FCS) approach to identify terms containing genes that best classify multiple groups of samples according to any type of experimental factor. A comparison of GOexpress features with existing GO analysis software is provided in Table 2.
It is well established that methodologies based on supervised learning of expression data are useful techniques for identification of biologically-relevant markers to differentiate and predict class membership in multi-level classification [45][46][47]. Furthermore, the RF algorithm has been shown to be one of the most robust multi-classifier algorithms for the identification of class predictors [48]. The use of supervised learning approaches such as RF for feature selection is particularly relevant for biological studies where group membership is defined by the experimental design and the number of observation is much smaller than the number of candidate predictors [49]. Notably, GOexpress is demonstrated here using an experimental design that consists of five biological replicates per experimental group, with gene expression measurements from 11,842 microarray probesets.
Although there is debate concerning the optimal sample size for RF [49], we would recommend sample sizes of at least five biological replicates to accurately estimate out-of-bag (OOB) classification error rates.
In GOexpress, each GO term is individually scored by the average capacity of genes associated with the term to classify the predefined groups of samples. Consequently, if a particular GO term is associated with a number of genes that emerge among the best ranked predictors of class membership, this GO term will also be present among the top-ranking GO categories, indicative of robust differences in the corresponding cellular functions or molecular pathways (Fig. 2). Unfortunately, FCS approaches are dependent on the underlying properties of the annotations (e.g., pathway size); therefore, requiring users to choose from a complex range of univariate and multivariate pathway-level statistics [1]. GOexpress, therefore, also allows users to provide their own scoring function, as an alternative to the default averaging of feature scores. It is also important to note that the GO initiative is a rapidly developing resource, which still contains many entries that are only inferred from electronic annotation (IEA) for many species, as opposed to experimentallyvalidated annotations (inferred from direct assay, IDA). Therefore, careful use of appropriate gene annotations is critical for reliable results [5]. An additional feature of GOexpress is the probability of GO term scoring and ranking that may be assessed by estimation of permutationbased P-values. Although more computationally intensive than the use of pre-computed statistical distributions such as the minimum hypergeometric (mHG) statistical framework used by GOrilla, this assumption-free approach enables support of any set of annotations for which the underlying statistical distribution is unknown.
To the best of our knowledge, no currently available software package provides similar integration of multilevel sample classification directly based on gene expression data from both microarray and RNA-seq experiments (with support for new platforms through user-provided custom annotations). Importantly, the data-driven visualisation functions provided in the GOexpress package do not transform the input expression data, assuming this task was performed using dedicated tools such as edgeR [12], limma [13], Cufflinks [50], or DEseq2 [51]. GOexpress, therefore, can be seamlessly integrated within existing computational biology pipelines, and can be used for development of dynamic Shiny web-applications that may be distributed online and offline, promoting collaboration and accessibility of high-throughput biological data and results within and between research groups.

Conclusion
We have introduced GOexpress, an R/Bioconductor package for identification and visualisation of gene expression profiles that best classify sample groups according to any known experimental factor. In contrast to most GO term summarisation approaches, GOexpress integrates prior biological knowledge and gene expression data from individual sample replicates to rank molecular pathways based on the capacity of functionallyrelated groups of genes to classify multiple sample groups. Notably, the use of multiple genes for GO-based classification improves the robustness and biological relevance of the resulting interpretations and predictions.

Availability and requirements
Project name: GOexpress Project home page: http://bioconductor.org/ packages/release/bioc/html/GOexpress.html Operating system(s): Platform independent Programming language: R Other requirements: R 3.1 or higher, Bioconductor 3.0 or higher License: GPL (> = 3) Any restrictions to use by non-academics: None Ethical approval for animal work All animal procedures were carried out according to the provisions of the Irish Cruelty to Animals Act (Department of Health and Children licence number B100/ 3939) and ethical approval for the study was obtained from the UCD Animal Ethics Committee (protocol number AREC-P-07-25-MacHugh).