Markov Chain Ontology Analysis (MCOA)
© Frost and McCray; licensee BioMed Central Ltd. 2012
Received: 10 August 2011
Accepted: 3 February 2012
Published: 3 February 2012
Skip to main content
© Frost and McCray; licensee BioMed Central Ltd. 2012
Received: 10 August 2011
Accepted: 3 February 2012
Published: 3 February 2012
Biomedical ontologies have become an increasingly critical lens through which researchers analyze the genomic, clinical and bibliographic data that fuels scientific research. Of particular relevance are methods, such as enrichment analysis, that quantify the importance of ontology classes relative to a collection of domain data. Current analytical techniques, however, remain limited in their ability to handle many important types of structural complexity encountered in real biological systems including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical relationships between classes, semantic distance and sparse data.
In this paper, we describe a methodology called Markov Chain Ontology Analysis (MCOA) and illustrate its use through a MCOA-based enrichment analysis application based on a generative model of gene activation. MCOA models the classes in an ontology, the instances from an associated dataset and all directional inter-class, class-to-instance and inter-instance relationships as a single finite ergodic Markov chain. The adjusted transition probability matrix for this Markov chain enables the calculation of eigenvector values that quantify the importance of each ontology class relative to other classes and the associated data set members. On both controlled Gene Ontology (GO) data sets created with Escherichia coli, Drosophila melanogaster and Homo sapiens annotations and real gene expression data extracted from the Gene Expression Omnibus (GEO), the MCOA enrichment analysis approach provides the best performance of comparable state-of-the-art methods.
A methodology based on Markov chain models and network analytic metrics can help detect the relevant signal within large, highly interdependent and noisy data sets and, for applications such as enrichment analysis, has been shown to generate superior performance on both real and simulated data relative to existing state-of-the-art approaches.
Ontologies have become a crucial component for the analysis, retrieval and integration of the data underpinning modern biomedical science . Whether structured as controlled vocabularies or expressive description logic-based models, biomedical ontologies have been used to manually and semi-automatically annotate enormous volumes of genomic, clinical and bibliographic information. These annotated datasets support a range of ontology-driven applications such as semantic search, enrichment analysis, data integration and clinical decision support.
Of particular importance in the biomedical space are the family of applications, including enrichment analysis , semantic similarity clustering  and data-based ontology evaluation , that quantify the importance of classes in an ontology relative to a collection of domain data. These applications, especially enrichment analysis based on the Gene Ontology (GO) , have been widely adopted by the scientific community and have proven effective in distilling large datasets that would otherwise be extremely difficult for researchers to interpret. Yet, despite the extensive use and high utility of these applications, the underlying analytical methods remain limited in their ability to successfully detect and synthesize several important types of ontological and dataset complexity, including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical class relationships, semantic distance and sparse data.
To help address these limitations, we have developed a new methodology, Markov Chain Ontology Analysis (MCOA), for analyzing hierarchical models relative to a collection of domain data. Our approach represents the combination of an ontology and the instances in an associated dataset as a single finite ergodic Markov chain whose adjusted transition probability matrix is used to compute modified eigenvector centralities, or steady-state probabilities, for each class and instance. The negative log of these modified eigenvector centralities, a quantity we call the information rank of the class, represents the importance of each class relative to both the data set and other classes in the ontology.
In the remainder of this paper, we outline the analytical challenges that motivated the development of our methodology, detail the mathematical model of our technique and demonstrate its utility in the context of GO enrichment analysis. Following a standard benchmarking process, we demonstrate the ability of a MCOA-based enrichment analysis method to outperform existing state-of-the-art enrichment methods on simulated gene enrichment datasets. To evaluate the performance of MCOA on real experimental data, we compare the enrichment results generated by MCOA with other comparable methods using gene expression data from a study of Parkinson's disease. Finally, we discuss other applications that could benefit from the MCOA approach and our plans for future investigations.
Although the analysis approach we propose is relevant to any application that quantifies the importance of ontology classes relative to a dataset, we frame the discussion in this paper in the context of enrichment analysis. Our focus on enrichment analysis is motivated both because of the widespread use of enrichment analysis in the biomedical field as well as the fact that the technical challenges faced by enrichment analysis methods are directly relevant to many other ontology-based data analysis activities.
Enrichment analysis assesses whether classes in an ontology are statistically over or under-represented in a specific dataset based on the semantic annotations of dataset members relative to some baseline distribution. In the biomedical field, enrichment analysis methods are commonly employed to determine the statistical enrichment of GO categories for gene expression data by comparing the annotation frequency in a target gene list with the annotation frequency in a background collection of genes. The widespread use of the method in this context has motivated the extensive manual annotation of genomic and proteomic data with GO categories and the development of a wide range of enrichment analysis techniques and tools . Although analysis relative to GO is the most common use case, the underlying enrichment analysis techniques are relevant to any biomedical ontology (or even flat categorization) and correspondingly annotated dataset (e.g., enrichment of pathways defined in KEGG ). Recent successes applying enrichment analysis outside the genomic domain include efforts by Tirrel et al  who performed enrichment analysis of the ontologies contained in BioPortal  relative to both MEDLINE and the collection of biomedical repositories aggregated by the NCBO Resource Index .
Whether analyzing genomic data for enrichment of GO categories or bibliographic data for enrichment of classes in a clinical ontology, the same set of enrichment methods can be employed. Huang et al  decomposed the existing diversity of 68 different enrichment methods into three broad classes: singular enrichment analysis (SEA, Class 1), gene set enrichment analysis (GSEA, Class 2) and modular enrichment analysis (MEA, Class 3). SEA represents the traditional linear enrichment analysis strategy and approaches in this category evaluate ontology classes one-at-a-time for enrichment against a fixed list of interesting dataset members using a statistical test like Fisher's exact test following the hypergeometric distribution. Methods in this class vary according to the statistical test employed, the criteria by which the dataset is selected and any special heuristics or weightings applied during analysis. The GSEA class of methods, which includes the original Gene Set Enrichment Analysis (GSEA) technique ) as well as the more recent Random sets  and LR Path  methods, take advantage of experimentally derived weights to evaluate the entire dataset. Methods in the MEA category evaluate the enrichment of multiple ontology classes simultaneously by taking into account the full network of ontology and dataset relationships. Similar to the methods in the SEA category, most MEA methods do not consider continuous instance weights and must therefore be run against a fixed list of interesting data set members.
The MEA category includes the MCOA-based enrichment analysis approach described in this paper as well as a number of state-of-the-art techniques developed since the publication of the Huang et al  survey such as NOA by Wang et al , TopoGSA by Glaab et al , GenGO by Lu et al  and MGSA by Bauer et al [16, 17]. NOA attempts to capture the functional enrichment of inter-instance relationships through the calculation of link annotations and subsequent application of standard statistical enrichment methods to these annotations (e.g., hypergeometric distribution). TopoGSA supports the visualization and analysis of network analytic properties for gene and protein sets mapped to interaction networks. GenGO and MGSA, which both adopt a generative probabilistic model of gene activation, are particularly well suited to the challenge of class overlaps in the presence of noise and are among the best methods in terms of benchmarked enrichment performance. GenGO uses the generative model and a maximum likelihood approach to identify a small set of GO categories that best explains an observed experimental gene list. P-values for this optimal set of GO categories are then computed using the standard Fisher's exact test, or any other desired test statistic, including optional multi-hypothesis correction. Motivated by the GenGO approach, Bauer et al  also adopted a generative model of gene activation for their MGSA method. MGSA uses a Bayesian network to model gene activation and represents GO enrichment using the marginal posterior probabilities of each GO category computed using a Markov Chain Monte Carlo algorithm. Rather than identifying a fixed set of classes that maximize the objective function based on the generative model, MGSA provides a posterior probability enrichment score for all classes. Although not directly comparable against GSEA methods, when evaluated against existing SEA and MEA methods, the GenGO and MGSA methods are significantly better, as measured on synthetic data, at correctly identifying enriched GO categories while minimizing reported false positives and false negatives [15, 16].
Despite the extensive use and high utility of enrichment analysis applications and the important recent advances made in the GSEA and MEA categories, existing analytical methods remain limited in their ability to successfully analyze the full spectrum of ontological and dataset complexity. Challenging structural features include overlaps between ontology classes, continuous instance and annotation weights, relationships between instances, non-hierarchical relationships between classes, semantic distance and sparse data. These analytical challenges, and how current enrichment methods attempt to address them, are discussed in further detail below.
Methods in the SEA and GSEA categories commonly generate enrichment results comprising long lists of highly correlated classes, leaving users to determine which of multiple, largely redundant, classes are actually relevant. This problem is due to both the overlaps between class members and the fact that SEA and GSEA methods evaluate each class independently for enrichment and thus fail to take class interdependencies into account. Overlaps between the member sets of different classes can result from several structural features:
Inheritance: one class is an ancestor of the other class and therefore all dataset members annotated to the descendant are implicitly annotated to the ancestor.
Multiple parents: both classes share a common descendant and therefore are implicitly annotated with the same dataset members.
Multiple annotations: a dataset member is annotated to both classes (or descendants of both classes).
Overlaps between classes are very common in practice with each GO term overlapping with an overage of 1078 other terms based on common human gene annotations (see Additional File 1 for details). When overlaps between enriched classes exist because of multiple annotations, the results are also skewed in favour of instances associated with a large number of classes. This distortion can be particularly problematic for cases where annotation bias exists (e.g., protein annotation bias ) or cases where the total amount of enrichment evidence should be based on the number of instances and their weights rather than on the number of annotations (e.g., web page ranking using the PageRank algorithm ).
The class overlap problem has been explored by several existing enrichment analysis approaches including MGSA, GenGO, parent-child union by Grossmann et al  and elim and weight by Alexa et al . The parent-child union, elim and weight methods all address overlaps by computing statistical enrichment using the hypergeometric distribution with counts weighted according to the hierarchical structure of the ontology. Parent-child union computes enrichment for a specific class in the context of dataset members annotated to the parents of the class. Elim removes genes annotated against enriched subclasses when computing enrichment for parent classes and weight generalizes the elim approach by adjusting gene weight to a value between 0 and 1. Because the weighting heuristics used by parent-child union, elim and weight utilize just the structure of the ontology, these methods only address overlaps due to inheritance or multiple parents. Although GenGO and MGSA are able to detect all cases of overlaps, the fact that these methods collapse the ontology hierarchy means that they are unable to distinguish between the different cases of overlap, which impacts support for semantic distance and annotation bias.
A key drawback of methods in the SEA category and most methods in the MEA category is their inability to model continuously valued data. For most biological data of interest in an enrichment analysis scenario, dataset members have varying levels of experimental significance and continuous weights can be associated directly with each instance (e.g., differential gene expression, test statistic associated with SNP-to-gene analysis, etc.) or with each instance-to-class annotation (e.g., probabilistic confidence score generated via statistical classification, GO annotations weighted according to source of evidence, etc.). Continuous weights can also be associated directly with classes or with inter-instance and inter-class relationships (e.g., protein-protein interaction scores, gene co-expression scores, etc.). Analyzing continuously valued datasets using SEA or MEA methods requires the use of an arbitrary cut-off with all dataset members or annotations above the cut-off given equal weighting in the analysis, potentially leading to significantly skewed enrichment results. Addressing this shortcoming is the primary objective of methods in the GSEA category including Gene Set Enrichment Analysis (GSEA) , which computes statistical significance for all genes in all differentially expressed arrays using a weighted Kolmogorov-Smirov test; LRPath , which uses a logistic regression likelihood ratio test compute significance of enrichment for all genes taking expression level into account; Random-sets , which incorporates quantitative instance scores to compute class enrichment values using an analytical approximation of the statistical distribution and is asymptotically equivalent to the LRPath technique; and ProbCD , which supports probabilistic instance and annotation weights and computes statistical significance using Goodman-Kruskal gamma and comparison against a null distribution estimated via random permutations.
Although the GSEA methods avoid a potentially arbitrary dataset "cut-off" through the use of continuous dataset weights, this requirement can be problematic in cases where a single biologically meaningful value for each gene does not exist. GSEA methods are further limited by their one-at-a-time analysis of ontology classes and, in practice, have been found to generate enrichment results very similar to those output by SEA methods on actual experimental data .
Meaningful relationships often exist between the members of the datasets targeted for enrichment analysis (e.g., citation links between publications, protein-protein interaction links, gene-gene links in gene regulatory networks, etc.). Network models are particularly well suited for representing the interconnections in real biological systems [23–25]. Similar to the links in a social network or hyperlinks between web pages, such instance-level relationships provide evidence of a relative ranking between instances that can be quantified using network analysis metrics such as eigenvector centrality. The use of such network analysis techniques is commonly performed on biomedical networks comprising data instances. Although the output from this type of analysis can be used to adjust the weight of genes for subsequent enrichment analysis using GSEA category methods capable of handling continuous values, current state-of-the-art methods do not compute or use such metrics for enrichment analysis. While the NOA method of Wang et al  does directly focus on the relationships between dataset members, the goal of this approach is a functional analysis of the gene-to-gene links themselves rather than the use of gene-to-gene links to adjust the functional enrichment of specific genes. Analysis of datasets lacking links between dataset members is not possible with NOA.
Standard enrichment analysis only considers hierarchical relations between classes (is-a, part-of), however, many relevant biomedical ontologies, including GO, include non-hierarchical class relationships (e.g., regulates). Accounting for such inter-class relationships may be even more relevant in scenarios where multiple inter-related ontologies are jointly analyzed and inter-class relationships are used to capture mappings between classes in different ontologies (e.g., relationship between GO categories and KEGG pathways). Although the same network analytical methods used to analyze instance-level links can be applied on the ontology graph, the current set of state-of-the-art enrichment methods do not do so, and, for most enrichment approaches, their incorporation is not feasible due to the nature of the underlying statistical tests.
When analyzing data against hierarchical ontologies, it is generally desirable to bias more specific classes over more general classes when both classes are associated with the same number of dataset members. Standard SEA category methods like Fisher's exact test measure significance based solely on annotation frequency and ignore semantic distance. Although semantic distance is incorporated into methods such as parent-child union, elim and weight, the state-of-the-art MEA methods GenGO and MGSA use flattened representations of the ontology and therefore fail to explicitly incorporate semantic distance.
Real datasets frequently suffer from sparsity due to a variety of data collection and experimental design issues . Bayesian approaches, which incorporate prior probabilities based on knowledge about the likely statistical distribution of the data, are better able to handle sparse data then frequentist approaches like those based on Fisher's exact test, which need to employ some type of smoothing (e.g., Laplace or add one smoothing). Bayesian methods that perform enrichment analysis using a prior probability distribution include MGSA and the BayGO framework . Although these Bayesian methods enable the enrichment analysis of sparse data, their lack of support for inter-instance relationships, non-hierarchical class relationships and semantic distance means that only a limited range of sparse datasets can currently be analyzed.
Our approach represents the combination of the classes in an ontology and the instances in an associated dataset as a single finite ergodic Markov chain whose adjusted transition probability matrix is used to compute modified eigenvector centralities, or steady-state probabilities, for each class. These modified eigenvector centralities, a quantity we term the information rank, provide a measure of the importance of each class relative to both a dataset and the other classes in the ontology. Similar to annotation frequency, the information rank of a class can be used to support applications that compare the importance of a class in a target dataset with a baseline dataset (e.g., enrichment analysis).
For defining our approach and discussing other related methods, we follow Bade et al  and Cimiano et al  and adopt a simplified formal model of an ontology and its extension as a rooted hierarchy with instance assignments. Although both our analysis approach and many related techniques can be generalized to more complex structures, as formalized by the description logic-based models  used for popular ontology modelling languages such as OWL, this minimal structure contains the essential modelling primitives for evaluating GO enrichment analysis and allows the methodology to be developed with minimal descriptive complexity. Definitions 1 and 2 below formally define the ontology model. Potential extensions to this model include class weights, weights for inter-class relationships, weights for instance-to-class relationships and weights for inter-instance relationships.
Definition 1 (Ontology): An ontology is a directed acyclic graph of classes structured in a hierarchy and represented by the tuple O = 〈C, parent(c)〉
C is a non-empty set class identifiers
A strict partial ordering relation parent(c) that maps each class c in C to the set of direct parents of c in the class hierarchy. ∀c ∈ C: parent(c) ⊆ C
Definition 2 (Ontology Extension): The extension of an ontology is represented by the tuple E = 〈I, type(i), rel(i), weight(i)〉
A potentially empty set I of instance identifiers
An instance type relation type(i) that maps each instance in I to a set of one or more classes in C. ∀i ∈ I: type(i) ⊆ C
An inter-instance relation rel(i) that maps each instance in I to a set of zero or more other related instances in I. ∀i ∈ I: rel(i) ⊆ I
An instance weight relation weight(i) that maps each instance in I to a normalized weight between 0 and 1. ∀i ∈ I: weight(i) ∈ 0
Our proposed methodology for analyzing an ontology relative to a collection of domain data represents the combination of an ontology and its extension as a finite ergodic Markov chain. A finite Markov chain is a finite stochastic process in which the probability of transitioning from a state i to a state j is only dependent on the state i and not on the path taken through the chain to arrive at state i . This property of a Markov chain is called the Markov property and, for an ergodic Markov chain, it enables the state transitions to be represented as a stochastic matrix with the special property of possessing a principal left eigenvector for the maximum eigenvalue of 1. The components of this principal left eigenvector represent the steady-state probabilities for each state in the chain. Definition 3 below provides the formal specification of a Markov chain.
Definition 3 (Finite Ergodic Markov Chain): A finite ergodic Markov chain is a finite stochastic process characterized by:
A non-empty set of states S of size N
An N × N transition probability matrix P where each entry p ij represents the probability that the state will be j if the current state is i.
The transition probability matrix for a Markov chain is a stochastic matrix with a principal left eigenvector, , of length N for its largest eigenvalue of 1.
For a finite ergodic Markov chain, the components of this principal left eigenvector are the steady-state probabilities, or eigenvector centralities, of the states of the Markov chain.
At the core of our methodology is a process for computing an eigenvector-based score for each class in an ontology relative to an extension of that ontology (i.e., a collection of data annotated using the ontology classes). We call this the information rank based on its similarity to the well-known PageRank algorithm for computing the ranks of web pages using a Markov model of a random walk with jumps through web page links . The MCOA process involves three key steps:
Step 1: Model the ontology and extension as a single finite ergodic Markov chain.
Step 2: Create an adjusted transition probability matrix for the Markov chain.
Step 3: Use the transition probability matrix to compute the eigenvector-based steady-state probability and information rank for each ontology class.
Our approach builds a Markov chain model of an ontology and its extension by mapping classes in the ontology and the instances of those classes to states in the Markov chain and by mapping all instance-to-class relations and hierarchical relations between classes to state transitions. Given the simplified model of an ontology and its extension specified in Definitions 1 and 2 and the model of a finite ergodic Markov chain specified in Definition 3, the process for building a Markov chain from an ontology and its extension is formalized in Definition 4 below. Figure 1B shows an example Markov chain for the ontology in Figure 1A generated according to this mapping.
Definition 4 (Ontology-to-Markov Chain Mapping): The mapping between an ontology O and its extension E (as defined in Definitions 1 & 2) and a finite ergodic Markov chain is characterized by:
Equivalence mapping class(s) (and inverse mapping state(c)) between the states in subset S C of the Markov chain and the classes in set C (i.e., there is a one-to-one mapping between each class and each Markov chain state in S C ).
Equivalence mapping inst(s) between the states in subset S I of the Markov chain and the instances in set I (i.e., there is a one-to-one mapping between each instance and each Markov chain state in S I ).
Calculating the transition probability matrix for the Markov chain defined above involves three key adjustments:
A random jump probability α. This is equivalent to the damping factor, d, used in the PageRank algorithm, specifically α = 1-d.
A parameter, ω, that controls how much of the random jump probability is distributed among class states, SC, vs. instance states, SI
The weights of each individual instance, as specified by the function weight(i)
Using these parameters, the creation of the adjusted transition probability matrix can be formalized according to Definition 5 below. Figure 1C contains the adjusted transition probability matrix created for the ontology in Figure 1A according to this process.
Definition 5 (Adjusted Transition Probability Matrix): The adjusted transition probability matrix P for the finite ergodic Markov chain that represents an ontology and its extension, as specified in Definition 4 above, is defined by:
A random jump parameter, α, which determines the probability that the Markov chain makes a random jump to one of the other states rather than following the defined transitions from that state.
A probability distribution weight, ω, that determines how probabilities are distributed between states representing classes, S C , and states representing instances, S I , following each random jump. If ω = 0, random jump probability is distributed only among instance states, likewise, if ω = 1, random jump probability is distributed only among class states.
The instance weight function weight(i), which is used to compute a potentially non-uniform distribution of random jump probabilities among the instances.
The use of the random jump and non-uniform distribution parameters defined above has several benefits in the context of our method:
It ensures that the Markov chain is ergodic (it would otherwise be absorbing given the 0 out-degree for any root node).
It allows for prior probability smoothing. Classes without instances can be assigned a configurable portion of the random jump probability as a form of prior probability smoothing. By varying the ω parameter between 0 and 1, the relative weight of a uniform prior probability distribution can be adjusted relative to the analyzed dataset distribution.
It enables the use of class and instance weighting. Similar to the topic-sensitive PageRank approach , a non-uniform distribution of random jump probabilities can be used to mirror differential class and instance weights.
It allows semantic distance to be quantified. The amount of transferred rank naturally decays as one moves up the hierarchy.
Given an adjusted transition probability matrix as specified in Definition 5 above, the importance of each class relative to the dataset can be quantified using the components of the principal left eigenvector that correspond to classes in the ontology. These eigenvector components represent the steady-state probabilities of the class states in the associated Markov chain. Normalizing these steady-state probabilities relative to the probabilities for all class states and then taking the negative log of the normalized probabilities generates the information rank. The definitions of steady-state class probability and information rank are formalized in Definitions 6 and 7 below. Figure 1D shows the information rank values for the example ontology in Figure 1A.
Our initial application of the MCOA method to enrichment analysis adopts the probabilistic generative model of gene activation used by both GenGO and MGSA. It specifically extends the GenGO maximum likelihood approach by adding MCOA-based terms to the objective function used in the original GenGO algorithm. Although our initial enrichment analysis method extends GenGO, MCOA can be integrated with other enrichment methods or used directly to determine enrichment significance by employing permutation tests to compute a distribution of possible information rank values. Our choice of GenGO as a base approach was motivated by several factors:
GenGO is one of the best state-of-the-art methods. GenGO and MGSA are two state-of-the-art MEA approaches shown to provide overwhelmingly superior enrichment performance on simulated data.
GenGO is feasible to extend. Integration of MCOA through modification of the objective function was both feasible and straightforward.
GenGO returns intuitive results with flexible statistics. The GenGO process outputs p-values, using the statistical test of choice, for the set of categories that maximize the log likelihood objective function. Use of p-values, as opposed to the marginal posterior probabilities used by MGSA, make the results of this method more intuitive to researchers and more easily comparable to the results from other enrichment methods. Use of multiple hypothesis correction is also optional.
Execution of the MCOA enrichment analysis algorithm involves three steps:
Step 1: Compute steady state probability scores for the ontology relative to both the reference and target datasets.
Step 2: Find the set of ontology classes that maximizes the likelihood of the observed dataset given a probabilistic generative model.
Step 3: Compute p-values and apply multi-hypothesis correction.
Algorithmic details for each of these steps are outlined below.
This step follows the core MCOA process outlined above.
The MCOA approach modifies the GenGO objective function by replacing the α|C| term that penalizes the sizes of active GO categories by a term computed from the MCOA-based steady state probability scores for each active category. This modification of the GenGO objective function to incorporate MCOA steady state probability scores as a regularization parameter is similar to approaches taken for SNP selection during GWAS analysis in which the objective function for a stochastic wrapper algorithm is modified to include preprocessed attribute quality estimates . This replacement term, which is equivalent to a weighted log-odds value, still penalizes large sets of active GO categories while also giving a preference to those categories whose steady state probability is larger in the target dataset than in the reference dataset. Where the steady-state probability ratios are equal for two categories, the weighting acts to prefer the category with a greater steady state probability in the target dataset. Similar to the original GenGO method, MCOA optimizes the objective function via a greedy search algorithm. Optimization of the p and q values also follows the GenGO approach. Although originally specified in terms of GO categories and genes, this approach can be easily generalized to the generic ontology model outlined earlier in the paper and this generalized description is used in the formal definition of the modified objective function in Definition 8 below.
C is the set of active ontology classes
G is the set of active instances
q is the false positive rate or the percentage of instances not associated with an active ontology class that are activated
(1-p) is the false negative rate or the percentage of instances associated with an active classes that are deactivated
A g is the set of active instances annotated with at least one active class
A n is the set of active instances not annotated with any active classes
S g is the set of annotations (materialized according to the ontology hierarchy) between inactive instances and active classes
S n is the set of annotations (materialized according to the ontology hierarchy) between inactive instances and inactive classes
ssp(c) ref is the steady state probability for ontology class c computed using the reference dataset
ssp(c) tar is the steady state probability for ontology class c computed using the target dataset.
β is a parameter that weights the steady state probability regularization term.
For the set of ontology classes that maximizes the objective function, p-values can be computed using any desired statistical test. Similar to the original GenGO method, the current implementation of MCOA computes p-values using the hypergeometric distribution. If desired, multiple hypothesis correction methods can also be applied to the generated p-values. An important benefit of this approach is that multiple hypothesis correction only needs to consider the subset of classes that maximize the objective function rather than all classes in the ontology.
To demonstrate the utility of the MCOA methodology for enrichment analysis of biomedical data, we compared the performance of the MCOA method against GenGO (the Ontologizer implementation), MGSA, Alexa et al's weight method , Grossmann et al's parent-child union and the standard hypergeometric test for Gene Ontology enrichment of simulated Drosophila melanogaster, Homo sapiens and Escherichia coli data sets. For the GenGO, MGSA, weight, parent-child union and hypergeometric methods, we used the implementations and configurations from the Ontologizer framework that were employed to generate the benchmarking results in Bauer et al .
To enable comparison with prior work, our benchmarking process follows the general approach adopted by Bauer et al , Lu et al , Grossmann et al  and Alexa et al . This process builds a test gene list using a pre-selected set of active GO categories, with specific false negative and false positive rates, and then evaluates each enrichment analysis method, using precision/recall metrics, based on its ability to identify the originally selected categories within the noisy dataset. The following parameters control the creation and analysis of the simulated datasets following this approach:
Source of GO annotations: Creation and analysis of the simulated datasets was performed using the following ontology and species annotation files downloaded from the source control repository links on the Gene Ontology website : Gene Ontology (revision 1.2078, 34,171 total GO categories), Drosophila melanogaster annotations from FlyBase  (file revision 1.209; 12,966 gene products with 78,094 annotations to GO categories), the Homo sapiens annotations from Go Annotations @ EBI  (file revision 1.197; 18,307 gene products with 237,437 annotations to GO categories) and the Escherichia coli annotations from EcoCyc  & EcoliHub  (file revision 1.57; 3884 gene products with 39,129 annotations to GO categories).
Selection of active GO categories: Following prior work [15, 16, 20, 21] we varied the number of active GO categories between 1 and 5 and avoided selecting hierarchically related categories. Also following prior work , we filtered the set of potential active categories to remove categories with fewer than 5 annotations. Such a minimum annotation threshold helps ensure that the selected categories are more likely to be biologically meaningful in the context of experimental data analysis (similar filters are supported on enrichment analysis tools for this same purpose). Whereas Lu et al used categories with 5 or more direct or indirect annotations, we have chosen to filter based on just direct annotations. Our motivation for using direct as opposed to total annotations is several-fold:
2. Create simulated datasets that are more consistent with a generative model of gene activation. Categories with very few or no direct annotations are more likely to be high-level grouping constructs with low analytical value than categories with at least a few direct annotations. A direct annotation for a high-level category provides evidence that the category, rather than one of its subcategories, has been found by curators to provide the best explanation for a specific piece of experimental data. We believe that requiring such evidence for active categories results in datasets that better reflect a generative model of gene activation and represent more biologically meaningful categories.
3. Create simulated datasets that highlight key analytical challenges. Filtering based on either direct or total annotations creates a dataset with a high mean annotation level and increased level of class overlaps. Filtering by direct annotations has the added benefit of generating datasets with a larger ratio of direct-to-indirect annotations, highlighting the challenge of differentiating between these types of annotations during enrichment analysis, a distinction ignored by most enrichment methods. With no filtering, each GO category with Drosophila annotations has an average of 7 direct and 61 total annotations. Requiring a minimum of 5 direct annotations results in a set of potentially active categories with an average of 29 direct and 115 total annotations. If a minimum of 5 total annotations is required, the set of active categories has an average of 14 direct annotations and 120 total annotations.
False positive rate (q): Probability that a gene not associated with an active category is activated. GenGO tested with fairly low false positive rates of 0.01 and 0.15. MGSA reported results for false positive rates of 0.1 and 0.4. The results shown below use a value of 0.1, which corresponds to one of the MGSA values and is between the two GenGO values. Simulations were also performed for false positive rates of .01 and .4 and results can be found in Additional Files 2, 3, 4, 5, 6 and 7.
False negative rate (1-p): Probability that a gene associated with an active category is deactivated. GenGO reported primary results for false negative rates of 0.1 and 0.5. MGSA reported results for false negative rates of 0.25 and 0.4. The results shown below use a value of 0.25, which matches one of the GenGO settings and is in the between the two MGSA values. Simulations were also performed for false negative rates of .1 and .4 and results can be found in Additional Files 8, 9, 10, 11, 12 and 13.
Enrichment threshold for precision/recall calculations (σ): The prior benchmarking work by Bauer et al, Lu et al and others computed precision/recall statistics on the rank ordering of analyzed categories irrespective of the actual enrichment significance assigned by the analysis method. Although this is a straightforward evaluation approach that makes comparative evaluation easier, it fails to accurately reflect the performance or actual usage patterns of the underlying enrichment analysis methods. Even though a given method may return all active categories (i.e., 100% recall) with only a few false positives (i.e., high precision), if few of the active categories had enrichment p-values that were significant, a user would have ignored most of these valid results, making the reported precision/recall values misleading. Similar issues also occur when generation of significant enrichment values for the top set of valid categories also results in significant enrichment values for a much larger set of invalid categories. Users analyzing such a result set would need to consider a much larger set of significantly enriched categories despite the high reported precision/recall. Given these factors, we also compared enrichment methods using precision/recall numbers generated using only categories with significant enrichment scores after multiple hypothesis correction. We used a threshold of 0.5 for the MGSA marginal posterior probability, which is the level at which categories are more likely than unlikely according to MGSA (this is the default threshold used for this method in the Ontologizer tool and was the threshold used for MGSA by Bauer et al for their analysis of experimental data). For all other methods, we used a p-value threshold of 0.01 after multiple-hypothesis correction using the Bonferroni method.
To demonstrate the utility of the MCOA method on real experimental data, we compared the enrichment results generated by MCOA, GenGO, MGSA and the standard hypergeometric test on differentially expressed genes from a study of Parkinson's post-mortem brain samples available in the Gene Expression Omnibus (GEO)  as dataset GDS3129 .
The R GEOquery package  was used to retrieve both the raw microarray data and the genes associated with the array platform, which were used as the reference gene list for subsequent enrichment analysis. The set of genes significantly differentially expressed between cases and controls was computed using the R limma  package by fitting a linear model, applying empirical bayes shrinkage to compute moderated t-statistics and finally using Benjimani-Hockberg multiple hypothesis correction. Those genes with a false discovery rate below .05 were kept for further analysis and, following the recommendation of Falcon and Gentleman , this set was divided based on t-statistic sign into a group whose expression was positively correlated with Parkinson's cases and a group whose expression is negatively correlated with Parkinson's cases. Only the positively correlated group was considered for further analysis with the modified t-statistic values used as a gene weight for the MCOA method. Using the modified t-statistic as a weight enabled us to leverage MCOA's ability to support continuously valued data. Although the MCOA, GenGO and MGSA methods are all able to estimate the false positive rate (q) and false negative rate (1-p) from the data, for this comparison, we ran all methods with fixed false positive and false negative rates of 0.05. For MCOA, the regularization constant β was set to 0.6.
To validate our approach, generate experimental results for this paper and analyze real biomedical data, we have created a prototype implementation of the MCOA core methodology and MCOA enrichment analysis method described above. The core MCOA method was implemented in Java™(version 1.6) using JUNG  for the creation of the graphical model and calculation of eigenvector components, Apache Commons Math  for basic statistical computations and Jena  for processing and reasoning over OWL ontologies .
The MCOA-based enrichment analysis method was implemented in Java™ as an extension to the Ontologizer 2 framework  and the Ontologizer implementation of the GenGO algorithm. We used the Ontologizer GenGO implementation both to enable comparison with the MGSA benchmarking results and because the original GenGO implementation is not accessible for extension. The benchmarking results reported below were computed using a modification of the Ontologizer benchmarking framework used by Bauer et al  for evaluating MGSA with additional data processing and statistical computation performed via R.
The MCOA enrichment analysis application can be accessed at the project homepage .
Class overlaps. Figures 3A and 3B illustrate the two key types of overlaps between non-hierarchically related classes. In Figure 3A, the overlap is due to a single instance being associated with both C2 and C3 (i.e., a multiple annotation overlap). As illustrated by the information content values, analysis based on annotation frequency ignores this overlap and assigns equal weight to C2, C3 and C4. The MCOA method, on the other hand, detects the overlap and divides the impact of the shared instance between C2 and C3 giving these two classes a higher information rank than C4. In Figure 3B, the overlap is due to classes C6 and C7 being associated with multiple parent classes. Because classes C2, C3 and C4 still have equal numbers of instances, they look identical from an information content perspective. The MCOA method also detects this type of overlap and correctly assigns C2 and C4 lower information rank values than C3.
Continuously valued data. Figure 3C contains a variation of the simple ontology from Figure 1A in which some of the instances have been assigned continuous weights. As shown in the figure, a binary assessment of annotation frequency results in uniform information content values for classes C2, C3 and C4. The MCOA approach, because it generates a score that is sensitive to continuous weights, produces the correct differential ranking of C3, C2 and, lastly, C4.
Inter-instance relationships. In Figure 3D, the members of the dataset are connected via inter-instance links with the C2 instances having a balance of in and out links, the C3 instances having net out-links and the C4 instances having net in-links. The MCOA methodology is able to directly integrate the impact of these links and, as shown by the information rank scores, correctly identifies a differential ranking of C4, C2 followed by C3. From the perspective of information content, all three classes appear identical.
Semantic distance. Figure 3E provides a trivial example of semantic distance. Because class C2 is the only child of class C1, it is indistinguishable from an information content perspective. The information rank measure, through the random jump parameter α, reflects the relative semantic distance between the classes, with more specific classes given a higher weight. In this case, the MCOA method correctly assigned C2 a lower information rank than its parent C1.
Sparse data. Figure 3F shows a simple example of a sparse dataset in which one of the classes, C4, lacks associated instances. The MCOA approach, when used with a non-zero α and non-zero ω, supports smoothing of sparse datasets through a form of prior probability weighting resulting from the uniform distribution of random jump probability. As shown in the example, this form of smoothing gives C4 a low, but non-zero, steady state probability and correspondingly high relative information rank.
As the precision/recall curves in Figures 4, 5 and 6 show, the performance of the MEA methods MCOA, GenGO and MGSA dominate the comparable results of the weight, parent-child union and hypergeometric methods for all species and all parameter configurations.
When precision/recall metrics are calculated irrespective of enrichment values, as show in Figures 4A, 5A and 6A, the MCOA method performs measurably better than GenGO for all species, slightly better than MGSA on E. coli and Homo sapiens and on par with MGSA for Drosophila (average precision values for MCOA of 0.781, 0.834 and 0.859 on E. coli, Drosophila and Homo sapiens compared to 0.694, 0.751 and 0.804 for GenGO and 0.763, 0.838 and 0.846 for MGSA). Figures 4B, 5B and 6B show these same results with only statistically significantly enriched GO categories counted as positives for precision/recall statistics. When enrichment significance is considered during precision/recall calculations, the performance edge of the MGSA method disappears and MCOA becomes the clearly superior approach (average precision values for MCOA of 0.753, 0.821 and 0.851 on E. coli, Drosophila and Homo sapiens compared to 0.66, 0.729 and 0.778 for GenGO and 0.62, 0.706 and 0.738 for MGSA). Although p-value and marginal posterior probability thresholds are not directly comparable and a lower threshold for MGSA could plausibly be selected, which would narrow the average performance delta, any reasonable marginal probability threshold would still give MCOA a measurable performance delta over MGSA.
Overall, the MCOA method provides superior enrichment performance across a range of species and experimental parameters. It is important to note that these benchmarking tests, in order to support comparison against other state-of-the-art methods, only reflect performance on data sets that exercise the class overlap and semantic distance challenges. On datasets that incorporate continuous data values, inter-instance relationships, non-hierarchical class relationships or sparse data, the relative advantage of the MCOA method should be even more significant.
The top GO terms returned by the standard hypergeometric method are all at a very high level in the GO tree (the forth ranked result is the root biological process) and a number of terms are redundant due to hierarchical overlap. Although both GenGO and MGSA generate results that are generally similar in content and specificity to those returned by MCOA, a close inspection reveals important differences impacting result quality and utility to experimental scientists. The second term in the GenGO results, carbohydrate homeostasis, receives all relevant experimental annotations from the single child glucose homeostasis. Glucose homeostasis should therefore be flagged for enrichment instead of carbohydrate homeostasis. Because the MCOA regularization term penalizes semantic distance, it correctly ranks glucose homeostasis above carbohydrate homeostasis. GenGO also fails to return GO term lymphocyte mediated immunity in the top ten results and instead identifies the nearby, but less significantly enriched and more general, term leukocyte mediated cytotoxicity (leukocyte mediated cytotoxicity is a sibling of leukocyte mediated immunity which is parent of lymphocyte mediated immunity). In this case, all nine differentially expressed genes annotated to leukocyte medidated cytotoxicity are also annotated to lymphocyte mediated immunity. Because MCOA divides the contribution of a gene between all annotated terms, the more granular lymphocyte mediated immunity with some direct gene annotations is preferred over leukocyte mediated cytotoxicity. GenGO also includes the overly specific positive regulation of angiogensis rather than parent regulation of angiogensis. The parent is more appropriate since the other two children (negative regulation of angiogenesis and regulation of cell migration involved in sprouting angiogenesis are also enriched leading to a much more significant enrichment p-value for regulation of angiogenesis (.000199) vs. positive regulation of angiogenesis (.0042)). MCOA correctly identifies regulation of angiogenesis in the top ten results.
The results returned by the MGSA method have similar issues, when compared to MCOA, as the GenGO results (e.g., MGSA also fails to identify lymphocyte mediated immunity and ranks positive regulation of angiogenesis in the top ten rather than regulation of angiogenesis). In terms of utility for users, however, a more significant difference between MCOA and MGSA relates to MGSA's use of marginal posterior probabilities and the impact these probabilities have on ranking and interpretation of enrichment results. Although both MCOA and MGSA identify many similar GO terms in the top rankings, the marginal posterior probability rankings of MGSA can differ substantially from what is achieved when hypergeometric p-values are used on the terms that optimize the objective function. We believe that the use of hypergeometric p-values by MCOA and GenGO leads to a top set of rankings whose relative order and statistical significance is more easily interpretable by scientists.
Ontology-based data analysis methods such as enrichment analysis and semantic similarity clustering have become critical tools for processing the experimental results of modern biomedical science. Without the abstract lens of classifications such as GO and KEGG, the large gene and protein lists generated by molecular biological research would be difficult to analyze manually and almost impossible to compare meaningfully across experimental populations or species. Despite the important role that these methods play in interpreting and guiding biomedical research, their utility has been hampered by the limitations of traditional analytical methods to handle the complex interdependencies present in real biomedical data and associated data models. The members of real biological datasets do not cleanly sort into independent classes but instead group into complex collections of nested and overlapping categories, with direct relationships between dataset members and a mixture of continuous and categorical data values.
Tackling this complexity requires methods that perform a global, rather than local, analysis of the ontology and dataset to capture the full range of structural interdependencies and data values. Although recent methods in the GSEA and MEA categories have made notable advances in this area, specifically in addressing class overlaps and continuously valued data, the interesting features of many biological datasets remain inaccessible to analytical tools. To help address the challenge of biological complexity, we developed the MCOA method as a network analytic framework capable of addressing the class overlap and continuously valued data challenges targeted by MEA and GSEA methods as well as supporting continuous relationship values, inter-instance relations, non-hierarchical class relations, semantic distance and sparse data.
Underlying the MCOA method's analytical behaviour and its ability to successfully detect structural complexity is the method employed for building a Markov chain model and computing steady state probabilities. Several features of the MCOA Markov chain model are critical to its functionality:
Assignment of probabilistic weight per instance rather than per annotation. Under the MCOA Markov chain model, the weight for each dataset instance is divided among all of the classes to which the instance is annotated. This weight is initially divided among all direct annotations of the instance and, as it propagates through the Markov chain, consolidates in an increasingly smaller number of classes until the entire instance weight is concentrated at the root. The MCOA approach contrasts with the annotation frequency approach in which the full instance weight is assigned to each annotated class with the effect that instances shared by many classes contribute the same weight as instances annotated to only a single class. MCOA uses the differential contribution of instances with a large number of class annotations and those with small number of annotations to help detect class overlaps resulting from multiple annotations and multiple parents.
Flexible relationships. Traditional analysis methods only model hierarchical class relationships and class-to-instance annotations. Some methods, such as GenGO and MGSA, ignore most hierarchical information by analyzing a collapsed representation of the ontology graph. The MCOA method, in contrast, analyzes the full ontology and dataset network and can additionally handle relationships, such as inter-instance relationships and non-hierarchical relationships between classes, that are important for modelling real biomedical data but are not directly supported by existing MEA approaches.
Semantic distance computation. The use of a random jump parameter allows semantic distance to be quantified and hierarchical overlaps to be detected, since the amount of transferred rank naturally decays with each transition up the ontology hierarchy. Although semantic distance is captured at some level by enrichment methods such as elim and weight, it is ignored by the more recent MEA approaches GenGO and MGSA as well as by techniques in the GSEA category.
Continuous values for instances, classes and relationships. A non-uniform distribution of random jump probabilities can be used in the MCOA method to mirror differential class and instance weights. The Markov chain model also enables continuous values to be applied to inter-class, class-to-instance or inter-instance relationships. With existing state-of-the-art analysis methods, support for continuous data values is usually limited to dataset instances.
Prior weighting. The non-uniform distribution of random jump probability also allows the MCOA method to apply any desired prior probability distribution to achieve smoothing of sparse data or to align with a Bayesian analysis approach.
We chose enrichment analysis as the context in which to explore and validate the functionality of the MCOA method. In developing and benchmarking a MCOA-based enrichment analysis approach, we aimed to create an enrichment tool with the best performance among existing state-of-the-art methods on simulated datasets created to highlight the complexities encountered in real biomedical data. We also aimed to create a practical methodology capable of generating enrichment results on real data sets that are specific, non-overlapping and of high utility to experimental biologists. The superior performance achieved by the MCOA enrichment analysis approach can be understood in terms of the kinds of type I and type II errors encountered by the other generative methods (GenGO and MGSA) but avoided by MCOA.
In this context, type I, or false positive, errors represent cases where an enrichment method incorrectly identifies a non-active category as enriched. There were two varieties of type I errors commonly made by the other generative methods that were avoided by MCOA:
Incorrectly flagging non-active categories that are more general than an active category. In these cases, the more general category appears enriched because it is inheriting all of the annotations from the active category along with a significant number of additional annotations enabled due to noise. MCOA is able to correctly ignore these categories because the contributions from the active category are discounted due to both semantic distance and overlaps with other classes. GenGO and MGSA, because they collapse the ontology graph and give each annotation equal weight regardless of the number of annotations, do not discount the contributions from the active category and incorrectly flag the more general category as enriched.
Incorrectly flagging non-active categories that are not hierarchically related to an active category, have a small number of associated genes and few or no direct annotations. In these cases, the non-active category appears enriched due to noise. Because these categories have few annotated genes and almost no directly annotated genes, MCOA assigns the category a low steady-state probability and does not include it in the set of significantly enriched categories. Because the other generative methods assign weight per annotation and ignore semantic distance, they give the category an incorrectly high weight and mark it as enriched.
Type II, or false negative, errors represent cases where an enrichment method fails to identify an active category as enriched. In our experiments, the other generative methods commonly failed to identify as enriched active categories that had a small number of directly annotated genes. When analyzed by MCOA, these categories have a higher relative steady-state probability due to both the lack of a semantic distance discount for the direct annotations and the fact that direct annotations will not have overlaps due to multiple parents. Because of this higher relative steady-state probability, MCOA is able to successfully mark these categories as enriched. GenGO and MGSA, on the other hand, do not give any special weight to the direct annotations and therefore fail to detect the relatively higher enrichment of these categories.
Limitations of the MCOA method and MCOA-based enrichment analysis include a comparatively high computational complexity relative to other methods (see Additional File 14 for execution time statistics), reliance on the GenGO approach for objective function optimization through greedy search and sensitivity to the specified values of the false positive and false negative rates (variation in the p and q values can dramatically impact the number of GO terms that optimize the objective function for a given data set).
Although the discussion and examples in this paper have primarily focused on the use of the MCOA method for enrichment analysis, the same general approach can be used to support other ontology-based analysis applications, such as:
Semantic similarity clustering: Semantic similarity algorithms that use the information content of classes (e.g., Resnik ) can be modified to use information rank instead.
Ontology evaluation: Similar to the modification of semantic similarity algorithms, existing statistical ontology evaluation approaches that leverage information content (e.g., Alterovitz et al) can be modified to use the MCOA-based information rank. The underlying steady state probabilities can also be employed to weight class-specific evaluation metrics when computing overall ontology evaluation scores.
Ontology-driven information retrieval. If the Markov chain is created such that state transitions flow from classes in the ontology to instances, instance-level steady-state probability values can be computed that quantify the importance of each instance relative to the classes in the ontology.
Ontology comparative analysis. If state transitions flow from the classes, through a set of associated instances and into the classes in another ontology, it becomes possible to use the MCOA method to quantify the importance of one set of classes relative to another set of classes based on the annotations of a common dataset. Comparative analysis of multiple ontologies can also be enabled through non-hierarchical relationships between the classes in one ontology and the classes in another ontology.
Biomedical ontologies have become increasingly critical for the analysis, retrieval and integration of large and complex datasets. Of particular importance are applications, such as enrichment analysis, that measure the importance of ontology classes relative to a collection of domain data. Current analysis methods, however, remain limited in their ability to detect and accurately quantify a range of complex structural features at the ontological and dataset levels. To help address these challenges, we developed the Markov Chain Ontology Analysis (MCOA) methodology and used this method to create the MCOA extension of the GenGO enrichment analysis approach.
The core MCOA method can detect structural features including class overlaps, continuous data values, relationships between data instances, semantic distance and sparse data that are difficult to detect using standard annotation frequency analysis. In benchmarking studies on simulated Escherichia coli, Drosophila melanogaster and Homo sapiens datasets highlighting the complexities of biomedical data, the MCOA enrichment analysis method provides the best performance of comparable state-of the-art Gene Ontology enrichment methods. On real experimental data, MCOA has been shown to provide specific, non-redundant and scientifically valid results.
As next steps, we plan to conduct benchmarking on datasets that capture a wider range of analytical challenges (e.g., continuous weights and inter-instance relationships), use the MCOA enrichment analysis method to analyze and interpret additional experimental data sets, and perform enrichment against ontologies other than the Gene Ontology. We also plan to explore the use of the MCOA information rank value for applications that have traditionally employed information content, such as ontology evaluation and semantic similarity clustering.
An implementation of the MCOA-based enrichment analysis tool can be accessed at the project homepage .
This work was supported by an anonymous foundation and the Harvard Catalyst | The Harvard Clinical and Translational Science Center (NIH Grant #1 UL1 RR 025758-01 and financial contributions from Harvard University and participating academic health care centers). We thank the anonymous reviewers for their insightful comments and suggestions.