How to decide which are the most pertinent overly-represented features during gene set enrichment analysis
© Barriot et al; licensee BioMed Central Ltd. 2007
Received: 05 December 2006
Accepted: 11 September 2007
Published: 11 September 2007
The search for enriched features has become widely used to characterize a set of genes or proteins. A key aspect of this technique is its ability to identify correlations amongst heterogeneous data such as Gene Ontology annotations, gene expression data and genome location of genes. Despite the rapid growth of available data, very little has been proposed in terms of formalization and optimization. Additionally, current methods mainly ignore the structure of the data which causes results redundancy. For example, when searching for enrichment in GO terms, genes can be annotated with multiple GO terms and should be propagated to the more general terms in the Gene Ontology. Consequently, the gene sets often overlap partially or totally, and this causes the reported enriched GO terms to be both numerous and redundant, hence, overwhelming the researcher with non-pertinent information. This situation is not unique, it arises whenever some hierarchical clustering is performed (e.g. based on the gene expression profiles), the extreme case being when genes that are neighbors on the chromosomes are considered.
We present a generic framework to efficiently identify the most pertinent over-represented features in a set of genes. We propose a formal representation of gene sets based on the theory of partially ordered sets (posets), and give a formal definition of target set pertinence. Algorithms and compact representations of target sets are provided for the generation and the evaluation of the pertinent target sets. The relevance of our method is illustrated through the search for enriched GO annotations in the proteins involved in a multiprotein complex. The results obtained demonstrate the gain in terms of pertinence (up to 64% redundancy removed), space requirements (up to 73% less storage) and efficiency (up to 98% less comparisons).
The generic framework presented in this article provides a formal approach to adequately represent available data and efficiently search for pertinent over-represented features in a set of genes or proteins. The formalism and the pertinence definition can be directly used by most of the methods and tools currently available for feature enrichment analysis.
The combination of sequencing and post sequencing approaches together with annotations efforts and in silico analysis have produced a tremendous amount of available biological data and knowledge. As technologies evolve, the production of raw data is now becoming daily routine. While transcriptomics produce lists of differentially expressed or co-regulated genes, proteomics produce lists of proteins that are differentially expressed, that carry unusual post-translational modifications or that interact to form a complex. The characterization of those sets of genes or proteins in the light of all available knowledge is therefore a crucial task for the biological researchers and the computational biologists.
To characterize sets of genes or proteins, many tools and methods have been developed (see  for a review of most of them) and their main principle is to look for over-represented or enriched features.
Undoubtedly, the key to the success of this technique is its ability to confront heterogeneous data: the set of genes of interest can be compared to the sets of genes i) having the same annotation (e.g. Gene Ontology  or keywords from UniProt ), ii) involved in the same pathway (e.g. KEGG Pathways ), iii) co-cited in the literature, iv) co-localized on the chromosome, and so on.
Despite the variety of methods proposed to perform such an analysis, very little has been done in terms of formalization, and this has unfortunate consequences. First computationally, the lack of formalism offers very few possibilities for reusable optimizations causing a waste of resources (tool developers time, computation costs and storage space). Considering the growing rate of data, this might soon become an issue. Second and more importantly for the users, current methods generally ignore the structure of the confronted data which leaves the user with numerous enriched features of varying relevance to manually filter and synthesize.
Based on this formalism, we are able to face the problem of the relevance of the enriched features and the structure of the data, for which we introduce the concept of the pertinence of target sets in the context of a given query set of genes. In our synthetic example of figure 1, we observe a redundancy in the hits composing the results: RNA splicing, RNA processing, mRNA metabolic process, etc, are all reported and it is noticeable that RNA splicing matches exactly the content of our query, and therefore only this particular hit should be presented to the user. The other hits are due to the hierarchical structure of the annotations and should be omitted: RNA processing (directly above RNA splicing in the ontology) appears in the results because it includes RNA splicing and another gene (the gene a) not present in the query which should make this target set not pertinent. Similarly, regulation of RNA splicing (directly below RNA splicing) appears because it is included in RNA splicing but contains fewer genes of our query, and this, again, should make this target set not pertinent. These observations allow us to formally define the pertinence of a target set in the context of partially ordered sets. For simplicity here, the target set matches exactly the query, but it is rarely the case, and more than one target set can be pertinent as we explain later. Interestingly, this definition of pertinence holds for the various dissimilarity indices used by current methods (hypergeometric distribution or Fisher's exact test, binomial distribution, χ2, and percentage). Having a formal pertinence definition, we solve a classical query optimization problem involving a time-space trade off and an early pattern evaluation: instead of storing a very large number of target sets (possibly infeasible), we need to generate only the interesting ones on the fly from a less explicit representation. In this paper, we present algorithms working on compact representations of the data for the generation and evaluation of pertinent target sets. Compact representations exploit the structure of the data (i.e. the set inclusions) and algorithms efficiently use rules derived from the pertinence definition.
Results and discussion
In this section, we first provide formal definitions of neighborhoods (i.e. feature data) and target sets pertinence. Then, we introduce an algorithm for the identification and the comparison of pertinent target sets in a DAG when all the set compositions are directly available (like in figure 1b where sets corresponding to DAG nodes are explicitly stored). Next, we propose a generic compact representation of neighborhoods and detail the adaptation of the previous algorithm in this context. The rest of this section focuses on specific representations and algorithms relying on the DAG properties that lead to further time and space optimizations.
Uniform representation of data: DAGs defining sets partially ordered by the inclusion relation
We will denote S the set of objects considered in the remaining of this paper. For example, S can be the set of proteins of an organism. We consider that a neighborhood is a set N (of sets of elements of S) partially ordered by the inclusion relation ≺. Partially ordered sets (posets) are generally represented by Hasse diagrams in which there is an edge from y to x if and only if y covers x (denoted x ≺ y). This means that x ≺ y and there is no other element z such that x ≺ z ≺ y (see  for more details).
In the following, we consider that a neighborhood is a Hasse diagram (the DAG of figure 1b) that defines a poset (N, ≺).
Target sets pertinence
Several methods and tools use a similarity or dissimilarity index to compare sets, we can cite amongst others: FunSpec , BlastSets , GOStat , EASE , PANDORA , aBandApart , goCluster , see  for a review. Even though, these methods are using various similarity indices to compare the query and target sets (hypergeometric, binomial, χ2, Fisher's exact test, or percentages), they all have in common that they consider only counts of elements such as the sizes of the query and target sets, or the number of common and differing elements. Thus, when comparing two sets, the bigger the number of common elements and the smaller the number of differing elements, the more similar they are considered.
Formally, this corresponds to any similarity index F between a query set Q and a target set T, such that F(Q, T) increases with |T ∩ Q| and decreases with |T - Q|. Then, given such a similarity index for the comparison of a given query set to a neighborhood, it is not necessary to perform the comparisons with all the target sets in the neighborhood. We introduce the notion of pertinence of a target set for its comparison to a given query set, which allows to consider target sets that are likely to have good similarity values (elements in common with the query) and to ignore target sets that will give redundant results (not different enough from other target sets because of the set composition dependencies in the Hasse diagram representing the neighborhood).
Mathematically, the observation above is written simply as follows:
Definition. A target set T in a neighborhood N is pertinent for its comparison to a given query set Q if and only if:
T ∩ Q ≠ ∅ (1)
∄T' ∈ N such that T' ⊂ T and T' ∩ Q = T ∩ Q (2)
∄T' ∈ N such that T' ⊂ T and T' - Q = T - Q (3)
This mathematical definition suggests that one must test all possible T' to decide if a target set T is pertinent. However, it is easy to deduce that only the parent and child nodes of T in the Hasse diagram representing the neighborhood N must be checked. Let us suppose that such a T' exists for (2) (resp. (3)). Then, due to the inclusion relation between T and T', all the sets in N on the path linking T and T' also satisfy (2) (resp. (3)), and then especially a child node (resp. a parent node) of T.
As only the parent and child nodes of T need to be considered, the test of pertinence can be performed on the number of common and differing elements. This is because if these numbers are equal then we are in presence of the same elements (inclusion relation).
As a result, the mathematical definition can be simplified into the following 3 rules (illustrated in figure 3) that are more suitable for the design of an efficient algorithm:
Rule 1: |T ∩ Q| ≠ 0
Rule 2: ∄T' such that T' ≺ T and |T ∩ Q| = |T' ∩ Q|
Rule 3: ∄T' such that T ≺ T' and |T - Q| = |T' - Q|
Structures and algorithms
Algorithm for the identification and the comparison of pertinent target sets in the explicit representation
Algorithm 1 assumes that the set compositions are available. In the following, we introduce compact representations of neighborhoods and algorithms efficiently working on such representations.
Generic compact representation of neighborhoods
Algorithm for the identification of pertinent target sets in the generic compact representation
Like in Algorithm 1, the principle is to start from the leaves representing query elements and traverse the DAG to search for pertinent target sets among the sets sharing elements with the query. The difficulty we have to solve is that the set compositions are not available. We thus (i) store the size of the set corresponding to a node, as illustrated in figure 6, (ii) order the nodes in the queue by their size and (iii) propagate the common elements during the search.
In order to test Rule 2, we need the number of common elements of the node processed and its child nodes. With the breadth-first order, the node 'mRNA metabolic process' of size 6 of figure 6 would have been processed before the node of size 4 'RNA splicing' (shorter path from the leaves), and thus the common element b would not have been propagated yet. The solution is to maintain the queue ordered by the set sizes to ensure that sets smaller than the node processed (this includes all its descendants) have already been processed. This way, all the common elements have been propagated and Rule 2 can be tested at the level of the node processed.
In order to test Rule 3, we need the number of differing elements of the node processed and its parent nodes. Unfortunately, this number is not available at this time for the parent nodes because all the common elements may not have been propagated to the parent nodes yet. For example, the node 'RNA processing' (size 5) in figure 6 is processed before the element g has been propagated to the node 'RNA metabolic process' of size 7. The solution is to consider the node processed as the parent node and test if Rule 3 is not violated for its child nodes that is, we test the pertinence of its child nodes.
As a result, the pertinence decision is divided in two steps:
Step 1: Rule 2 is tested at the level of the node processed.
Step 2: Rule 3 is tested at the level of the child nodes of the node processed.
Specific representations of neighborhoods
The previous compact representation is general and can be used for any neighborhood. Nonetheless, we identified cases where further time and space optimizations can be envisaged. It arises when:
the DAG is actually a tree (each node has only one ancestor). In this case, the tree can be stored as a parenthesized expression without the need to store the size of the sets for each node. Typical examples of this situation correspond to the gene expression profiles hierarchically clustered, or the IUBMB Enzyme Nomenclature  that is, sets of genes/proteins annotated with the same EC number.
the DAG obtained after building the neighborhood is implicit and thus, does not need to be stored. It corresponds for example to a correspondence analysis or a principal component analysis of the codon usage (see ) or the sets of genes that are adjacent on the chromosome. In the latter case, we only need to know the order of the genes: any pair of genes defines an interval which defines a set of adjacent genes.
In the following, we present efficient algorithms for the identification of pertinent target sets in these specific representations.
Algorithm for the identification and the comparison of pertinent target sets in the tree compact representation
The main advantage to searching pertinent target sets in a tree is that for a given node, the child nodes define a non overlapping partition of the set their parent node represents, and thus, only the number of common and differing elements need to be propagated.
Compute the number of common and differing elements corresponding to this node by summing up the values of the triplets contained in the top stack.
Test Rule 2 for current node: if the number of common elements is bigger than all of the triplets contained in the top stack then the tag is set to potentially pertinent (not pertinent otherwise).
Test Rule 3 for child nodes: if child nodes tagged potentially pertinent have less differing elements than the current node, then they are pertinent and the comparison is performed.
Replace the top stack by the triplet of computed values.
Stop if all the query elements are included or if the target set size exceeds max_target_size.
Compared to the previous algorithm, this one avoids (i) the merging of the common elements for each node and (ii) the extraction of the next element of the queue. The tree is composed of |S| leaves and at most |S| - 1 nodes, thus, the worst-case time complexity of this algorithm is O(|S|).
Algorithm for the identification and the comparison of pertinent target sets in implicit compact representations
An implicit representation requires us to provide a specific algorithm for each different implied DAG. However, this loss in genericity allows considerable time saving in the search for pertinent target sets and considerable space savings for storing the neighborhoods. We have chosen to present the sets of adjacent genes on the chromosome because it can be described very simply and briefly, leads to a straightforward algorithm, and was often encountered in our experiments.
For our illustration, we only need to store the genes in the order they appear on the chromosome. Thus, the space requirement is θ(|S|), instead of θ(|S|2) needed for the DAG representation.
To identify pertinent target sets, we only need to know the position on the chromosome of each of the query elements. Then, each pair of positions defines a lower and an upper bound of an interval that, in turns, defines a set. For such a set to be pertinent, the bounds of the interval must be such that the position just before (resp. after) the lower (resp. upper) bound must not be an element of the query since it would violates Rule 3. Rule 2 holds because the bounds correspond to query elements. The worst-case time complexity of the resulting algorithm is O(|Q|2), Q being the query set. Compared to Algorithm 2 working on the generic compact representation, this algorithm spares (i) the merging of common elements and (ii) the extraction of the next element of the queue.
Testing and validation
In this section, we illustrate the gain in storage space, number of comparisons performed and quality of the results (redundancy reduction) through a typical search of Gene Ontology  annotations enrichment in sets of proteins corresponding to multi-protein complexes, and compare the results obtained with and without considering the pertinence of target sets. For convenience, we used the BlastSets system  to obtain these results because it allows to use all the sets of a neighborhood (protein complexes) as query sets to be searched for feature enrichment (Gene Ontology annotations), but any of the previously cited methods and tools may be used instead.
The query sets of proteins correspond to protein complexes of the yeast Saccharomyces cerevisiae referenced at the MIPS  (version 14112005 filtered against a list of validated open reading frames from the GDR Genolevures ). The motivation for this choice is that once the proteins involved in a complex are identified, the next step is often to search for a molecular function or a biological process with which to annotate the newly grouped set of proteins. Moreover, the yeast proteome is well annotated with 4 211, 4 936 and 5 451 annotated gene products (among about 6 000) respectively for the molecular function, biological process and cellular component branch of the Gene Ontology (source: ). We extracted 1062 query sets of proteins, one for each protein complex.
the DAG of the Gene Ontology is the generic compact representation of the neighborhood,
to the previous DAG, we add (leaf) nodes corresponding to proteins, and connect them as child nodes for each GO term they are annotated with,
we recursively traverse the DAG in a bottom-up fashion to compute the size of the set corresponding to each node.
This construction implies that when a protein is annotated with a GO term, all GO terms on the paths from this term to the root (more general terms) are also annotating this protein.
Overall validation performances
Each of the 1 062 protein complexes served as a query set of proteins, and each was searched for similar sets in the three Gene Ontology branches constructed neighborhoods. The threshold for set similarity significance was set to 0.05. This corresponds to the probability of obtaining a similarity value (here F is the hypergeometric distribution) at least as good by submitting a random set of the same size, see  for more details.
Storage space requirements results
explicit size in MU (figure 1b)
generic size in MU (figure 6)
1, 007, 838
2, 137, 806
common elements comp.
pertinent target sets comp.
Redundancy reduction results
ratio pertinent hits/hits
Typical outcome for a protein complex
Sets found similar to the MIPS '440.30.10 mRNA splicing' protein complex
nuclear mRNA splicing, via spliceosome
RNA splicing, via transesterification reactions with bulged adenosine as nucleophile
RNA splicing, via transesterification reactions
protein complex assembly
nuclear mRNA splicing via U2-type spliceosome
U2-type spliceosome dissembly
U2-type nuclear mRNA branch site recognition
nuclear mRNA branch site recognition
spliceosomal conformational changes to generate catalytic conformation
Among the 21 hits, only 5 are actually pertinent, 14 hits violate at least Rule 2, the test of Rule 3 not being performed, and 2 hits violate Rule 3. The target sets are sorted in order to better understand the violations of pertinence: pertinent target sets are listed first in bold, followed by target sets that are not pertinent because of the previous pertinent set. The first hit, GO:0000398, is pertinent and the following 5 hits are not as they have the same number of common elements but correspond to less specific GO terms, which violates Rule 2. The same scenario occurs for the next 2 pertinent target sets, GO:0006396 and GO:0000245 with respectively 6 and 1 following hits that violate Rule 2. Then, GO0006374 is pertinent, and the following, GO:0000391, is not because it has the same differing elements (none) and has less common elements i.e. it is less general, which violates Rule 3.
Only 34 of the 36 query elements are found together in a hit. In the general case, this may (i) highlight bad annotations or (ii) provide a hint or indications on the role of the missing elements. Here, the missing elements are the products of the genes YGL128c and YKL078w. YGL128c is annotated as 'Component of a complex containing Cef1p, putatively involved in pre-mRNA splicing'. It is currently annotated with 'biological process unknown' which explains why it is not found in the results. Interestingly, it is associated with 'spliceosome complex' in the cellular components branch which complies with its supposed involvement in pre-mRNA splicing. Moreover, its association to a complex containing Cef1p strongly suggests that YGL128c should be annotated with GO:0000398. YKL078w is annotated as 'Predominantly nucleolar DEAH-box RNA helicase, required for 18S rRNA synthesis'. It is annotated as GO:0007046 ribosome biogenesis of the biological process branch of the GO. This term corresponds to a set of size 150 that is not part of the results that is, it is not significantly similar to 440.30.10. In , the authors state that YKL078w is not required in pre-mRNA splicing, but it is required for pre-rRNA cleavage (18S rRNA synthesis), and thus its GO annotation is consistent.
Comparison to other methods
Since 2001, several methods (more than 15) have been successfully applied to search for over-represented features based on a dissimilarity index to compare a query set to target sets (most of those are reviewed in ). More recently, alternative or complementary approaches have been developed, mainly to find more relevant features or to combine multiple features. Hereafter, we discuss and compare our methods to the major trends in the field.
Frequent itemset mining
The problem of identifying pertinent target sets resembles the frequent or closed itemset mining problem (see  for details) in many aspects. Unfortunately, the methods developed for frequent itemset mining cannot be applied to our context. Indeed, these methods rely on the anti-monotonic property of the support function (minimum frequency of the itemsets in the database). In our situation, the pertinence test is not anti-monotonic: a non pertinent target set can have ancestors that are pertinent. As a result, the pertinence definition cannot be used in the same way to prune the search. Moreover, closed itemsets permit the generation of all frequent itemsets contrary to pertinent target sets (all their subsets are not necessarily pertinent and may also not be present in the neighborhood).
Dissimilarity index based over-representation methods and tools
The methods we describe in this article improve the global quality of the results found by using a statistical test to decide the over-representation of a given feature in a given set (reviewed in ). Depending on the test performed (hypergeometric, binomial, χ2,...) and the correction for multiple testing (bonferroni, false discovery rate,...), the set of over-represented features will vary in size but the top features (very significant p-values) will remain essentially the same. Among those features, we clearly showed mathematically and also with biological results that a lot of them are actually redundant and non-informative. As it is based on the same principles (dissimilarity index and adjustment for multiple testing), our method can only perform at least as good as those others. For example, we submitted the query set of proteins of the complex 440.30.10 to GOStats  and we obtained 24 significantly enriched GO terms in the biological processes branch among which 7 are not in table 4. The observed differences (additional hits) are due to different versions of the data and to the multiple testing adjustment method (more low similarity hits). The additional hits are related to the pertinent hits found in table 4 and do not bring much additional insight to our query set biological function.
An original approach exploiting the GO structure was proposed in . Like us, they consider the GO as a partially ordered set and work on the DAG. Their method and ours diverge due to the dissimilarity index. To score target sets, they define a pseudo distance which can be stated roughly as the average distance between the genes of the query and the target GO term. While this approach is formal and applicable to ontologies in general, it suffers some significant limitations. First computationally, for each query set they need to score all the GO terms. Second statistically, because of the use of a distance, they are only able to rank the GO terms and cannot assess the significance of the results. And finally, they also encounter the problem of redundancy and pertinence of the results. They partially address it by finding in the top ranked GO terms, the ones that are not comparable (i.e. the one that should bring more non redundant information).
Information content based methods
An interesting approach has been proposed by  to take into account the GO hierarchy. The difficulty to address when dealing with the GO hierarchy is that the level of a GO term in the hierarchy does not reflect the degree of specificity of this term. As a result, the degree of specificity (GO level) at which to look for enrichment should not be specified in the query because it can yield to misleading results or missed discoveries. In , the authors propose an information theoretic approach that allows to specify the degree of specificity desired for the enriched features. This is done by splitting the graph of the Gene Ontology into subgraphs. The split is such that the resulting subgraphs (partition of the GO terms) contain comparable information content, i.e. they concern the same number of genes. To illustrate their approach, they analyze the 'MAP00190 oxidative phosphorylation' set of proteins corresponding to GenMAPP proteins involved in oxidative phosphorylation. For this analysis, the Gene Ontology was split into 6 partitions among which a clear enrichment in 'transport' is visualized. In contrast, a corresponding GO biological process levelwise analysis performed at depth 2 exhibits visual enrichments in 'cellular process' and 'physiological process' which is misleading.
Pertinent sets found similar to the 'MAP00190 oxidative phosphorylation' consisting of GenMAPP proteins involved in oxidative phosphorylation
generation of precursor metabolites and energy
copper ion transport
copper ion homeostasis
regulation of pH
mitochondrial electron transport, NADH to ubiquinone
With our method, it is not possible to specify a given degree of specificity as with their tool GOPaD . However, similar results can be obtained by constructing other neighborhoods for the Gene Ontology that would correspond to different level of specificity or information content. An alternative solution can also be to use GO Slim (reduced version of the GO aimed at giving an overview of its content) instead of the whole Gene Ontology. More generally, by looking at all the levels of the GO hierarchy, our method successfully identifies pertinent target sets, which automatically selects the most relevant levels to look at. Besides, our method is more generic in the sense that it can be applied to any hierarchically defined sets. For example, it can be applied to the hierarchical clustering of gene expression profiles which results in a dendogram (figure 2a) or the gene localization on chromosomes (figure 2b). In those cases, an information content method is of no help because the degree of specificity needs to be specified a priori and this is typically not known.
Integration of multiple data sources
Another trend in the field is the search for enrichment in combination of features by the use of multiple data sources. The direct approach consists in intersecting target sets as proposed in  where target sets of genes with composite GO annotations are obtained. This allows to find enrichments that are significant for the composite annotation (e.g. 'cation transport' and 'ATPase activity') while not being enriched in the original annotations (i.e. 'cation transport' alone or 'ATPase activity' alone). A similar approach has been proposed in  where frequent co-annotations (keywords, GO terms, and KEGG pathways) are mined. The principle is to search for frequent itemsets in the features of the query genes (features co-occurring frequently), and then to look at the significance of the enrichment in the combined features. Alternatively, the converse approach consists in the addition of GO term relationships such as is-involved-in as proposed in . The principle is to augment the Gene Ontology to connect terms from different branches (sub-ontologies) to reflect the fact that a molecular function is involved in a biological process which takes place in a cellular component.
Although, the enrichment of combination of features is not addressed in this paper, similar results can be obtained by manipulating the neighborhoods. For example, it is possible to combine the GO biological process and the GO molecular function neighborhoods by adding nodes corresponding to the set intersections (composite annotations) to the Hasse diagram representing the neighborhood. Similarly, the augmentation of neighborhoods such as the additional Gene Ontology layer proposed in  can be achieved by adding the corresponding edges between GO terms and by propagating the gene products through the newly created paths. This could prove useful as we have seen in the results obtained for complex 440.30.10 with the gene YGL128c that the GO annotations are sometimes missing in a particular branch whereas present in another one.
Numerical features can be very interesting to consider for feature enrichment. For example, it can be used to discover that some genes of a query set are surprisingly close to each other on a chromosome, or that all the molecular weights of the query proteins fall within a surprisingly small range. To our knowledge, our approach is the sole capable of searching for enrichments in numerical features such as the gene localization on chromosomes (see figure 2a and the results section on implicit compact representations). This might be because (i) it is inefficient and sometimes unfeasible to store and compare all the sets corresponding to adjacent genes and (ii) because the redundancy in the results (if not filtered for pertinence) makes them unexploitable.
In this article, we addressed the problem of the characterization of a set of genes or proteins by finding pertinent over-represented features. The key advances presented here are a formalism for representing and manipulating the data to be searched, and the introduction of the concept of target set pertinence and its formal definition. The choice of partially ordered sets as a formal representation was naturally driven by the generalization of the concept of neighborhood between genes or proteins: biological relationships (e.g. similar expression profile, similar function, similar annotation) group genes or proteins into sets of neighbors, which can be nested. These foundations exhibit their strength in many aspects. First, they make it possible to take into account the structure of the data and get rid of the non informative results. Second, their generic and universal aspect make them directly usable by most of the current methods and tools (the pertinence definition holds for most of the dissimilarity indices in use). Third, they provide a solid basis on which to develop optimized structures and algorithms such as those presented in this article: a generic compact representation applicable to any neighborhood, a specific compact representation for trees (e.g. hierarchical clustering of gene expression profiles), and an example of an implicit compact representation for gene location on chromosomes. The validation was performed by searching enriched GO annotations in 1062 protein complexes. The performances observed clearly show the usefulness of our approach: in terms of resources, we were able to save up to 73% storage for the data and to avoid up to 98% of the comparisons performed between sets during the search. More importantly, we observed up to 64% of statistically significant enriched features that were actually not pertinent and that should be discarded. This means that the biological researchers and the computational biologists will be presented far less results to interpret, making the characterization of gene sets faster, safer and easier.
In this article, we illustrated our methods with sets of genes and proteins examples but they can be applied to other data as well: formally, our approach is already general because it only considers elements of a finite set. Good candidate data sets should exhibit a finite set S of elements and various neighborhood relationships. These relationships can be inferred for example from a many to many relation between elements of S and elements of another set, or, a hierarchical structure and a relation associating elements of S to nodes of the hierarchy. For example with the growing number of complete genomes available, it should be interesting to build sets of genomes based on various neighborhood relationships and test what features result in similar groupings.
The methods presented in this paper naturally lead to a new challenge: the identification of similar sets between two neighborhoods. Such a task is of utmost importance as it would allow to analyze nearly automatically large amounts of data. For example, for gene expression data (as in figure 2a), the clusters of co-expressed genes would be matched to pertinent Gene Ontology terms. A naive solution would generate all the sets of one neighborhood and submit them as independent query sets to identify similar sets in the other neighborhood. This approach has two significant drawbacks. First, it implies the generation of all the sets of a neighborhood, which is exactly what we sought to avoid, and more importantly, the query set inclusions will cause redundancy both in the computations and in the results. Second, the pertinence definition is not symmetric, that is, given two neighborhoods N1 and N2, the results obtained will differ depending on which neighborhood (N1 or N2) will serve as the query and the target neighborhood. This is because all the query sets are assumed pertinent, which is typically not the case. In the example of the hierarchical clustering of gene expression profiles and the Gene Ontology, a "two-sided" pertinence definition would allow to identify the pertinent clusters to be compared to the pertinent GO terms. Thus, the pertinence definition should be reviewed in this context to ideally permit the design of algorithms that search both neighborhoods simultaneously.
The authors thank Antoine de Daruvar and Pascal Durrens for useful discussions and remarks, and are grateful to the reviewers for their helpful comments. The BlastSets project is supported by funds allocated by the ACI IMPBio from the French ministry of Research. RB was financed by Université Bordeaux I and is now financed by research Council Katholieke Universiteit Leuven, Center of Excellence EF/05/007 SymBioSys.
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. 10.1093/bioinformatics/bti565.PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticleGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LSL: The Universal Protein Resource (UniProt). Nucleic Acids Res. 2005, 33 (suppl 1): D154-159.PubMed CentralPubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.PubMed CentralView ArticlePubMedGoogle Scholar
- Danchin A: The Delphic boat: what genomes tell us. translated by Alison Quayle. 2002, Cambridge, MA: Harvard University PressGoogle Scholar
- Danchin A: The Delphic boat or what the genomic texts tell us. Bioinformatics. 1998, 14 (5): 383-10.1093/bioinformatics/14.5.383.View ArticlePubMedGoogle Scholar
- Barriot R, Poix J, Groppi A, Barré A, Goffard N, Sherman D, Dutour I, de Daruvar A: New strategy for the representation and the integration of biomolecular knowledge at a cellular scale. Nucleic Acids Research. 2004, 32 (12): 3581-3589. 10.1093/nar/gkh681.PubMed CentralView ArticlePubMedGoogle Scholar
- Birkhoff G: Lattice theory. 1967, American Mathematical Society, Providence, 3Google Scholar
- Robinson M, Grigull J, Mohammad N, Hughes T: FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics. 2002, 3: 35-10.1186/1471-2105-3-35.PubMed CentralView ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics. 2004, 20 (9): 1464-1465. 10.1093/bioinformatics/bth088.View ArticlePubMedGoogle Scholar
- Hosack D, Dennis G, Sherman B, Lane H, Lempicki R: Identifying biological themes within lists of genes with EASE. Genome Biology. 2003, 4 (10): R70-10.1186/gb-2003-4-10-r70.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaplan N, Vaaknin A, Linial M: PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucl Acids Res. 2003, 31 (19): 5617-5626. 10.1093/nar/gkg769.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Vooren S, Thienpont B, Menten B, Speleman F, Moor BD, Vermeesch J, Moreau Y: Mapping biomedical concepts onto the human genome by mining literature on chromosomal aberrations. Nucl Acids Res. 2007, 35 (8): 2533-2543. 10.1093/nar/gkm054.PubMed CentralView ArticlePubMedGoogle Scholar
- Wrobel G, Chalmel F, Primig M: goCluster integrates statistical analysis and functional interpretation of microarray expression data. Bioinformatics. 2005, 21 (17): 3575-3577. 10.1093/bioinformatics/bti574.View ArticlePubMedGoogle Scholar
- IUBMB: Enzyme Nomenclature: Recommendations (1992) of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. 1992, Academic Press, San Diego, CAGoogle Scholar
- Barriot R: Intégration des connaissances biologiques à l'échelle de la cellule. PhD thesis. 2005, Université Bordeaux 1, Laboratoire Bordelais de Recherche en InformatiqueGoogle Scholar
- Mewes HW, Frishman D, Mayer KFX, Munsterkotter M, Noubibou O, Pagel P, Rattei T, Oesterheld M, Ruepp A, Stumpflen V: MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucl Acids Res. 2006, 34 (suppl 1): D169-172. 10.1093/nar/gkj148.PubMed CentralView ArticlePubMedGoogle Scholar
- Sherman D, Durrens P, Iragne F, Beyne E, Nikolski M, Souciet JL: Genolevures complete genomes provide data and tools for comparative genomics of hemiascomycetous yeasts. Nucl Acids Res. 2006, 34 (suppl 1): D432-435. 10.1093/nar/gkj160.PubMed CentralView ArticlePubMedGoogle Scholar
- Saccharomyces Genome Database. [http://www.yeastgenome.org/]
- Colley A, Beggs JD, Tollervey D, Lafontaine DLJ: Dhr1p, a Putative DEAH-Box RNA Helicase, Is Associated with the Box C+D snoRNP U3. Mol Cell Biol. 2000, 20 (19): 7238-7246. 10.1128/MCB.20.19.7238-7246.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Han J, Kamber M: Data Mining. Concepts and Techniques. 2006, Morgan Kaufmann, 2Google Scholar
- Joslyn CA, Mniszewski SM, Fulmer A, Heaton G: The Gene Ontology Categorizer. Bioinformatics. 2004, 20 (suppl_1): i169-177. 10.1093/bioinformatics/bth921.View ArticlePubMedGoogle Scholar
- Alterovitz G, Xiang M, Mohan M, Ramoni MF: GO PaD: the Gene Ontology Partition Database. Nucl Acids Res. 2007, 35 (suppl_1): D322-327. 10.1093/nar/gkl799.PubMed CentralView ArticlePubMedGoogle Scholar
- Nam D, Kim SB, Kim SK, Yang S, Kim SY, Chu IS: ADGO: analysis of differentially expressed gene sets using composite GO annotation. Bioinformatics. 2006, 22 (18): 2249-2253. 10.1093/bioinformatics/btl378.View ArticlePubMedGoogle Scholar
- Carmona-Saez P, Chagoyen M, Tirado F, Carazo J, Pascual-Montano A: GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biology. 2007, 8: R3-10.1186/gb-2007-8-1-r3.PubMed CentralView ArticlePubMedGoogle Scholar
- Myhre S, Tveit H, Mollestad T, Lagreid A: Additional Gene Ontology structure for improved biological reasoning. Bioinformatics. 2006, 22 (16): 2020-2027. 10.1093/bioinformatics/btl334.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.