TAFFEL: Independent Enrichment Analysis of gene sets
© Kurki et al; licensee BioMed Central Ltd. 2011
Received: 19 January 2011
Accepted: 19 May 2011
Published: 19 May 2011
A major challenge in genomic research is identifying significant biological processes and generating new hypotheses from large gene sets. Gene sets often consist of multiple separate biological pathways, controlled by distinct regulatory mechanisms. Many of these pathways and the associated regulatory mechanisms might be obscured by a large number of other significant processes and thus not identified as significant by standard gene set enrichment analysis tools.
We present a novel method called Independent Enrichment Analysis (IEA) and software TAFFEL that eases the task by clustering genes to subgroups using Gene Ontology categories and transcription regulators. IEA indicates transcriptional regulators putatively controlling biological functions in studied condition.
We demonstrate that the developed method and TAFFEL tool give new insight to the analysis of differentially expressed genes and can generate novel hypotheses. Our comparison to other popular methods showed that the IEA method implemented in TAFFEL can find important biological phenomena, which are not reported by other methods.
Gene expression studies often compare samples from two or more experimental conditions, the most typical outcome being a set of genes that differ in expression between the conditions. Several databases, computational methods and software programs have been recently published for analysis of such differentially expressed (DE) gene sets. Usually these tools are aimed at finding out associated (differentially active) biological mechanisms by searching associations of DE genes to various biological functions, processes and pathways reported in the biological databases such as Gene Ontology (GO) . The output of these tools is usually a list of biological terms (functions, processes, pathways etc.) that are more frequently associated to the gene set than expected by chance. Therefore, this analysis is often referred to as enrichment analysis (EA) (for an extensive review of these methods see ). This type of analysis is implemented in tools such as GENERATOR , DAVID , FatiGO , GOToolBox , GenMAPP , GoMiner , Gostat  and OntoTools .
Standard EA has some notable shortcomings that should be taken into account, especially in the case of DE genes. First, DE genes tend to be associated to multiple distinct biological phenomena rather than one or a few. This problem has been recently addressed by applying various clustering methods for finding gene subgroups with homogeneous functional annotations [3, 4, 6], and combining similar functional annotations together . Clustering can reveal interesting gene subgroups, but so far, there are no definitive methods available to verify them or obtain further interpretation about their biological significance in the studied cases, other than calculating the internal homogeneity of clusters. Secondly, the result of EA is largely dependent on statistical cut-off values used for selecting the list of DE genes. By choosing a loose cut-off value, many important processes may be obscured by false positive (FP) genes and thus are not observed. This is partly addressed by the aforementioned clustering methods, which can separate FP genes  and methods like Gene Set Enrichment Analysis (GSEA)  and Functional Class Scoring (FCS) , not based on fixed cut-off. Still, these methods do not show any further evidence about the importance of resulting genes or gene groups.
In order to demonstrate the utility of our method and the associated software, we applied TAFFEL to two datasets. Firstly, we analyzed differentially expressed genes in human HEK293T cell culture after treatment with forskolin, a cyclic AMP (cAMP) pathway inducer. Using TAFFEL we show that the list of differentially expressed genes comprise separate functional and regulatory gene subsets that relate to parts of cAMP related pathways. The result indicates correctly that there are also other major mechanisms launched by cAMP besides the CREB binding protein related pathway that is most commonly linked to cAMP in the literature.
Secondly, we analyzed differentially expressed genes between human ruptured and unruptured saccular intracranial artery aneurysm (sIA) walls obtained during surgery. Subarachnoid hemorrhage from ruptured sIA (aSAH) is a devastating form of stroke that affects working age population . The sIA disease is a complex trait that is poorly understood. In previous comparisons of ruptured and unruptured sIA walls, intimal hyperplasia, endothelial injury, luminal thrombosis, mitosis, apoptosis, T-cell and macrophage infiltration , expression of growth factor receptors , complement activation  and MAPK-signalling  were associated with the sIA wall rupture. In addition, in our whole genome mRNA profiling of 11 ruptured and 8 unruptured sIA walls inflammation, response to turbulent blood flow, leukocyte migration, oxidative stress and vascular remodelling were associated to the rupture and In Silico transcription factor analyses identified enriched NF-κB, HIF1A and ETS transcription factor binding sites among up-regulated genes . This dataset was re-analyzed using TAFFEL in order to demonstrate the capability of TAFFEL to find novel phenomena overlooked in standard analysis and to identify factors that might be causing the reported phenomena. The results suggest novel molecular mechanisms and demonstrate the usefulness of TAFFEL in snapshot type research settings and in diseases of poorly characterized molecular pathogenesis.
We compared TAFFEL gene clustering results against results from five other methods or tools used for enrichment analysis: standard list of GO-terms sorted according to Fisher's Exact test p-values, a sorted list of GO-terms and transcription factors resulting from FatiGO+ tool , annotation sets resulting from the Functional Annotation Clustering tool available in DAVID , co-occurring sets of GO-terms and transcription factors resulting from apriori association rule discovery algorithm implemented in GeneCodis  and results from GSEA . The comparison shows that TAFFEL can discover important individual themes and relations between transcription factors and biological processes that are not reported at all by other methods.
Description of the method and tool
TAFFEL uses a non-nested hierarchical clustering scheme  for finding gene subgroups that are homogeneous in GO terms or TF annotations. The gene subgroups are a partition of the whole gene set i.e. they are disjoint sets that cover the whole gene set.
For each gene cluster, TAFFEL reports both the enriched GO terms and TF annotations, regardless of what information (GO or TF) was used for clustering. For the first level of the tree, representing the whole analyzed gene list, the enrichment is measured in the list versus the genome. This is analogous to the traditional enrichment analysis and can be used for observing the most interesting themes in general. This enrichment is also reported for the annotations in the clusters of subsequent tree levels (column "List p-value" in the software) as additional evidence of their biological significance. However, as a principal description for each cluster in the subsequent tree levels, TAFFEL reports annotations that are enriched in each cluster versus the whole gene list (column "Cluster p-value" in the software). This gives the user a compact overview of the different biological phenomena present in the analyzed list of genes.
In order to gain more evidence about the biological meaningfulness of resulting clusters, TAFFEL performs two types of extrinsic evaluation steps. Firstly, in the IEA evaluation, each functionally homogeneous gene cluster is evaluated in terms of enrichment of TFs, and each gene cluster homogeneous in TFs is evaluated in terms of enrichment in GO terms. Secondly, TAFFEL allows measuring correlations of gene memberships between all possible cluster pairs where one cluster comes from the GO tree and another from the TF tree. This measure, referred to as inter-correlation, can be used to identify the gene clusters that share same genes regardless of using TF's or GO terms as a basis for clustering. Both the IEA and inter-correlation can be used for validating the biological significance of gene subgroups, and to interpret relations between transcription regulators and processes they regulate.
Availability and running the program
TAFFEL is a Java Web Start application written using Java Standard Edition 6 with NetBeans integrated development environment (http://www.netbeans.org). MySQL (http://www.mysql.com) database is used to store all the persistent data. Running TAFFEL requires Java Runtime Environment version 6. TAFFEL program, help-pages and example data sets are freely available under LGPL license from http://www.oppi.uku.fi/bioinformatics/taffel.
Typical analysis flow
A typical analysis flow with TAFFEL is shown in Figure 1. Firstly, the gene list is imported to TAFFEL and clustered using GO terms and TF annotations as data. Secondly, the root levels of the GO and TF trees are observed to study the themes associated to the whole gene list in general. Thirdly, the clusters at the tree levels with the smallest dAIC scores in both the GO and TF trees are observed in order to find which separate themes are associated to the analyzed gene list and which respective gene subgroups constitute it. Fourthly, the coherency of these clusters is evaluated by observing their conservation throughout the tree. Finally, special focus is set on the clusters in the selected levels by using IEA and inter-correlation methods for cluster evaluation. The independently enriched themes in each cluster can be used to infer the TFs that drive a particular biological process or function in the analyzed condition.
The resulting clusters can be further analyzed by multiple ways such as highlighting the clusters including particular GO terms or TFs, to find correlations between clusters in different trees, and to show the list of genes associated to specific GO terms and/or TF annotations in each cluster. The results can be exported from the program in text form, and all results can be saved in one XML file.
Analysis of forskolin effect on HEK293T cells
In order to test the developed method, we applied TAFFEL for the analysis of differentially expressed genes in HEK293T cells incubated with forskolin for four hours . Forskolin increases the concentration of intracellular cyclic adenosine monophosphate (cAMP), a key mediator in several signalling pathways. The genes were clustered with TAFFEL using separately the GO biological process terms and TF annotations up to 15 clusters. The results were interpreted using the typical analysis flow described in Methods. Special attention was paid on the level at depth 11 in the GO tree and the level at depth 13 in the TF tree, both of which having obtained the best dAIC scores.
As expected, the results from enrichment of the complete gene list indicated that forskolin had overtaken the cAMP pathway from the G-protein receptor controlled pathways at the 4 h check point as there were a large number of genes induced by cAMP related GO terms. However, the TAFFEL clustering was able to detect a more complex network of interactions between the MAPK and AhR pathways. The gene clusters in dAIC selected level from the TF tree was enriched with certain expected TFs, such as ATF and CREB, AhR and HIF, variable E2 family and Rb complexes and EF dimers, and EGR-1. In turn, the clusters from the GO tree were enriched with lipid metabolism, cell adhesion, macromolecule localization, DNA metabolism and apoptosis related terms. Most of these clusters were also conserved at several tree levels, suggesting their coherency.
Statistically significant clusters in the forskolin (FSK) and up-regulated (sIA↑) and down-regulated (sIA↓) aneurysm datasets
transcription from RNA polymerase II promoter
positive regulation of macromolecule biosynthetic process
positive regulation of gene expression
establishment of protein localization
metal ion transport
amine biosynthetic process
nervous system development
generation of neurons
organic acid metabolic process
carboxylic acid metabolic process
lipid metabolic process
The gene clusters were compared between GO and TF trees using inter-correlation method (data not shown). This analysis brought up again the GO cluster with positive regulation of macromolecule biosynthetic process as the highest correlating pair in the TF-tree being HIF-1A related gene cluster. As the HIF1A, one of the few hypoxia inducible factors, is the closest paralog to AhR in human, and both factors require ARNT or ARNT2 as dimerization partner, this even further suggests that the basic-helix-loop-helix transcription factors such as AhR have a role in cAMP signalling activated transcription.
Analysis of activated and deactivated genes in ruptured intracranial aneurysm wall
We also applied TAFFEL to the analysis of differentially expressed genes in ruptured human sIA walls as compared to unruptured sIA walls. The overexpressed (marked with sIA↑) and the under expressed genes were clustered separately by using TF annotations and GO terms up to 20 clusters as this seemed to far exceed the best scoring cluster number according to dAIC measure. As this data set was already analyzed using standard enrichment method , we focused only on the IEA and inter-correlation methods in the best scoring clustering levels: level 8 for the GO tree and level 11 for the TF tree.
A few of these clusters obtained significant independent enrichment in IEA after correction for multiple testing (table 1). One was the protein phosphorylation (MAP kinases) and cation transport-related cluster among the over expressed genes. MAPK-signalling (MAPKS) in the sIA wall has previously been shown to be associated with rupture . In IEA, MTF-1 (metal responsive transcription factor 1) was significantly independently enriched (FDR corrected p = 0.048). MTF-1 is stress and metal-activated (especially zinc) TF and drives the expression of antioxidant and anti-inflammatory genes, e.g., in atherosclerosis . It controls, for example, metallothioneins (MT), zinc-transferring proteins. The cluster is enriched in ion-transferring proteins and contains MT2A, a primary target of MTF-1.
Another cluster found in IEA from the analysis of under expressed genes was related to oxidation-reduction and independently enriched the NF-1 (nuclear factor 1 C, NF1C) transcription factor (FDR corrected p = 0.037). NF1C activation capability is repressed by oxidative stress and NFIC knockout decreases the activity of Cytochrome p450 family gene CYP1A1 . The cluster contains 2 CYP-family genes, and many other lipid and amino acid metabolizing genes as well as genes protecting against or controlling oxidative stress (NXN, OXR1).
The third cluster found in IEA was identified among the down-regulated genes. The cluster was enriched with neuron development related GO terms, cell development, cell motion (not visible among top three), cell projection and organization (not visible among top three) and independently enriched Tal-1 transcription factor (FDR corrected p = 0.031). Tal-1 protein is known to drive endothelial cell migration and morphogenesis in angiogenesis [24, 25]. Tal-1 regulates VE-cadherin expression in endothelial cells. VE-cadherin concentrates on cell-to-cell adherens junctions and maintains cell adhesion, controls vascular permeability and relays signals necessary for vascular stabilization. VE-cadherin is a positive controller of TGF-β signalling and deletion of various components of this signalling pathway leads to several vascular manifestations, often including hemorrhages .
In order to find out whether the clustering by GO terms and TF annotations would yield any clusters with common genes, TAFFEL inter-correlation method was applied. The link between apoptosis and TFs MEF2A and Lhx3a was strongly observed (FDR corrected p = 7.5E-6). MEF2A is a myocyte enhancer factor, which controls many muscle-specific genes. Low number of smooth muscle cells with disorganized architecture has been associated with aneurysm rupture . MEF2A has also been implicated as a candidate gene for coronary artery disease, and our results suggest that MEF2A dysregulation might be involved in smooth muscle cell apoptosis in the ruptured sIA walls.
Comparison of TAFFEL to other methods
Several different approaches for analyzing differentially expressed gene sets exists, such as GENERATOR , DAVID , FatiGO , GOToolBox , GenMAPP , GoMiner , OntoTools , and GSEA , which can report the enriched terms e.g. the functional annotations, or TF information but no relation between these concepts. The main advancement of TAFFEL is that the developed IEA method, which allows statistically interpretable evaluations for the found clusters, helps to pay attention to the most interesting gene clusters among many, and provides information about the control of regulator proteins in functionally homogeneous gene subgroups.
We performed extensive comparison of TAFFEL to four other popular methods DAVID, GeneCodis, GSEA and FatiGO+, which are targeted to address similar challenges as TAFFEL. For full explanation of comparison results from forskolin dataset and methodology see Additional file 1 and for result tables from sIA dataset see Additional file 2.
Similar ways of clustering gene sets are implemented in GENERATOR , GOToolBox  and DAVID  tools. Our comparison between TAFFEL and DAVID clustering and standard EA indicates the advantage of clustering methods over standard EA: clustering can ease the interpretation of results by reducing the amount of resulting categories and may additionally highlight some potentially important categories not revealed as significant in the whole gene set. Furthermore, IEA implemented in TAFFEL presents two new improvements. First, pointing a few clusters out of many eases the interpretation of results. Secondly, TAFFEL IEA can point gene clusters or GO terms that are not statistically significantly enriched in the whole gene list, and thus not reported by standard EA or DAVID clustering, but are still potentially biologically meaningful due to enrichment of TF annotations.
FatiGO+  tool provides information about the enriched GO terms and TFs using TRANSFAC and cisRED databases, similarly as TAFFEL. The main difference between FatiGO+ and TAFFEL is that FatiGO+ searches enrichment in the complete list of DE genes and does not consider genes as subsets like TAFFEL. Also, it does not provide relation between TFs and different enriched biological processes. The same results can be obtained from TAFFEL GO and TF tree root levels, which analyze the enrichment in the complete gene set. Additionally, TAFFEL clustering and IEA can discover novel themes from the data and provide clues to the regulatory control of identified biological processes.
Analysis with GSEA  method did not produce very good results with the tested data sets. No annotations were significant after multiple testing corrections. The problems regarding the robustness of GSEA with various situations have been reported before [27, 28]. However the strength of the GSEA method is that the analyst does not need to define fixed statistical cut-off for producing differentially expressed gene set. GSEA seeks the enrichment of terms (functional classifications, transcription factor binding sites, etc.) in the top or in the tail of the gene set, which is sorted according to e.g. fold change or p-values. Nevertheless GSEA seeks the enrichment of annotations separately and does not consider any relations between annotation terms as TAFFEL does.
We considered the comparison of TAFFEL to GeneCodis as the most important of all presented comparisons as these two tools address partly the same concern by seeking relations between different annotation systems within a set of genes. GeneCodis, however, does not perform any clustering and therefore may miss important biological phenomena, which are not enriched in the whole gene set but in a subset of genes. It should be noted that GeneCodis does not particularly aim at finding only relations between GO terms and TFs, but rather co-occurrences of any terms within one or several annotation systems. This can be important clustering in itself, and is provided also by TAFFEL in the form of enriched GO-terms or TFs in each resulting gene cluster. However, the results from GeneCodis for our data sets show a very large list of annotations with ambiguous repetition of the same GO terms and/or TFs coupling with each other in multiple various combinations (see additional file 2 for the 50 first ranks of the 4538 total ranks reported significant after FDR correction). The numbers of genes associated to such co-occurring annotations were also very low although reported significant. Using the forskolin data set, the maximum of associated genes was 6 with co-occurring annotations including terms from only one annotation system and 4 with co-occurring annotations including terms from both GO and TF annotation systems. In order to compare GeneCodis to TAFFEL IEA method, we paid special attention to the few co-occurrences including both GO terms and TFs (see Additional file 1 for forskolin dataset and Additional file 2 for full results from both datasets). Some of the themes such as transcription regulation were common with the results from other tools. However, the results contained ambiguous repetition of the same process with several different sets of TFs. As a comparison, TAFFEL clusters resulting from IEA in Table 1 include 22 - 58 genes and of these genes the independently enriched (statistically significant after FDR correction) GO or TF terms cover 20 - 60%. This suggests that it would be advantageous to perform such clustering analysis instead of associating individual GO terms and TF annotations. GeneCodis may however work better when dealing with two or more annotation systems with highly overlapping annotations, such as GO and KEGG.
We present a novel method for the analysis of differentially expressed (DE) genes for the discovery of co-functional and co-regulated subsets of genes, and for further analysis of such clusters with functional annotations and regulatory protein information. As information about gene regulatory elements, we have used TF predictions and annotations from cisRED database where putative binding sites are validated in terms of evolutionary conservation . Such validation has shown to be advantageous as it can significantly reduce the amount of false positives in predictions [29, 30]. Moreover, our clustering of genes using TF information as well as the further validation of discovered clusters using functional annotations should reveal relevant patterns from the data and reduce amount of noise.
A major limitation in our and many other methods employing GO and TF data is that the knowledge on gene functions (the GO annotations)  and regulation (TFs) is incomplete. Furthermore the GO annotations are biased towards well-studied biological phenomena and the predicted TF binding sites (cisRED) often contain large number of false positives . Still the clustering method alleviates this problem in the sense that the clustering is not driven by randomly distributed annotations (false positives or negatives) but by stable annotations shared by many genes. The constantly improving quality of the annotations is also likely to improve the results obtained using our method. It should also be noted that gene expression is not necessarily functional in the sense that co-expressed or similarly expressed genes do not necessarily share any GO annotations. Thus our clustering approach does not necessarily produce clusters of co-expressed genes which likely results to fewer significant IEA clusters. Also the used AIC method for cluster number selection is not necessarily optimal, but rather it strikes a good balance between accuracy and number of parameters. The cluster number selection is a very general problem and usually there is no single best solution for every dataset (see for example reviews [33, 34]). In our method we use cluster number selection as a guide for the analyst to focus on some particular clustering level to start the analysis with.
The result of TAFFEL analysis for the DE genes after forskolin treatment of human HEK293T cells in culture showed expected results at the first level of clustering tree, e.g., the enrichment of cAMP related GO terms and CREB TF. Interestingly, the clustering analysis was able to identify a piece of a complex MAPK-AP1-AhR related transcription network, related to proliferation and regulation of metabolism. The most prominent result was the independent enrichment of AhR and HES-1 TFs in the macromolecule localization related gene cluster. The AhR activation alone causes up-regulation of xenobiotic-metabolizing enzymes. MAP kinases are known to be involved in process in which AhR and the heat-shock chaperone complex are translocated to nucleus . AhR is also phosphorylated by calcium-controlled protein kinase C, and by several other kinases . JUN and ELK1, included in the cluster, are not typically considered as direct target for the AhR agonist, but are known to be phosphorylated after AhR activation . However, although the enrichment of AhR TF was not the most expected result, recent information shows that cAMP is indeed a direct mediator of AhR signaling . Hairy and Enhancer of Split homolog-1 (HES-1) is a transcriptional repressor with basic helix-loop-helix structure. It has been suggested that HES-1 and AhR factors have cross-talk [38, 39] although the cross-talk between AhR and other transcription factors is complex and poorly understood . Recent literature indicates that AhR has a role in regulation of the transcription in human HEK293T cells and in mouse kidney (Boutros et al., 2009), generally agreeing with results of TAFFEL analysis.
In the analysis of over and under-expressed genes in the ruptured saccular intracranial aneurysm (sIA) walls TAFFEL identified several interesting clusters, some in line with prior data [15–18], some providing basis for new hypotheses on mechanism behind the sIA wall rupture. In the TAFFEL analysis for the over-expressed genes, previously described phenomena of MAPK and apoptotic signalling related to the sIA wall rupture  were detected. MAPK is also known to control a wide spectrum of other biological processes, including the cell cycle, cellular metabolism, motility and survival. However, the anti-apoptotic and pro-apoptotic control of MAPKS is not presently well known . Secondly, TAFFEL results for the over-expressed genes support forming new hypotheses relating to the signalling that ensures endothelial integrity. Inflammatory cell infiltration has been reported to associate to the sIA wall rupture  but the etiology or mechanisms for this phenomenon remain unknown. Our data suggests the possibility that abnormal function of Tal-1 transcription factor, being in the centre of endothelial cell integrity preserving regulatory cascade of TGF-β and VE-cadherin signalling, might lead to excess vascular permeability and endothelial dysfunction, leading in turn to enhanced inflammatory cell infiltration and vascular wall instability.
Other significant IEA cluster for the under-expressed genes in the ruptured sIA walls was the regulation of oxidation reduction and metabolism genes by NF1C. NF1C activity is repressed by oxidative stress  and thus the down-regulation of the genes in this cluster might be caused by inactivation of NF1C by oxidative stress possibly present in the ruptured aneurysms . The exact consequences of down-regulation of these metabolic genes in ruptured aneurysms must be investigated in further studies.
Final observation from IEA for the under-expressed genes in the ruptured sIA walls was the regulation of metallothioneins (MT; genes associated to GO term metal ion transport) by MTF-1 transcription factor. MT activation and reduced zinc bioavailability is known to associate with aging and cardiovascular diseases in the elderly . It is also known that the risk of sIA wall rupture and subarachnoid hemorrhage increases with age . Although MTF-1 is mainly vascular protective, chronic low grade inflammation can maintain long-term elevation of MTs, which in turn may lead to pro-inflammatory response plausibly due to decreased zinc bioavailability . Thus, the active regulation of MT genes by MTF-1, proposed by TAFFEL, suggests that long-term inflammation and zinc deficiency may play crucial roles in the rupture, caused by either a de-stabilization or reactive changes in the sIA wall tissue. Dysregulation of other metal ions such as calcium might be other outcome of MTF-1 signalling. In fact, a calcium channel blocker nimodipine is recommended as a standard treatment for patients with aneurysmal subarachnoid hemorrhage to prevent secondary vasospasm and ischemic brain injury .
In conclusion, we have demonstrated that the developed method and TAFFEL tool give new insight in to the analysis of differentially expressed genes and can generate novel hypotheses. Our comparison to other popular methods showed that the IEA method implemented in TAFFEL can find important biological phenomena, which are not reported by other methods at all.
Firstly, the analysis of forskolin-treated HEK293T cells indicates that TAFFEL will identify well-known and expected phenomena such as differential expression of CREB regulated genes, but can also lead to new hypotheses, e.g., on the role of AhR. Secondly, the results with the sIA wall rupture related data give confidence to the usefulness of TAFFEL in the analysis of complex and poorly characterized clinical conditions, affected by inherited and acquired risk factors. These findings suggest that TAFFEL is an efficient method to generate new hypotheses to be further tested in basic or applied molecular genetic research. The testing of such hypotheses is crucial for finding novel targets for new biological approaches, e.g, diagnostic tests for the identification of sIA carriers in population, or non-invasive methods to close or stabilize the rupture-prone sIA wall.
Annotation data sources
For the functional grouping of genes, TAFFEL uses Gene Ontology  annotations (December 2008 release used in this study) from Ensembl database  (version 53 used in this study). The included species are human, mouse, rat and C. elegans. The current version of TAFFEL can use biological process and molecular function ontologies from GO, either separately or in parallel.
Secondly, TAFFEL uses information about predicted TFBSs available in the public cisRED database , containing genome wide collections of sequence motifs conserved in gene regulatory regions. The motifs have been annotated by transcription factors (TFs) found in TRANSFAC  and JASPAR  databases. In TAFFEL, we have included all TF annotations from both of these databases that have similarity p-value < 0.001 with the found sequence motif. We have included data for human (version 9), mouse (version 4) and C. elegans (version 4).
Gene clustering method
In order to perform gene clustering, associations between genes and annotations (GO terms and TFs) are represented as a binary matrix. Each row in the matrix represents a gene and each column represents an annotation. In the matrix, the cell value one indicates association and zero indicates no association between the row (gene) and the column (annotation) (Figure 1). For clustering, we apply Non-negative matrix factorization (NMF)  based approach. This approach has been advantageous in clustering of sparse binary data and finding clusters that are defined in a (possibly small) subset of all data attributes . Both of these features are important in our cases described here. Firstly, the data are sparse by nature. Secondly, one set of genes often associates to numerous biological attributes (TFs and GO terms), many of which may not be relevant .
Selection of number of clusters
In order to choose a clustering solution with suitable balance between goodness of fit in the data and complexity, TAFFEL uses Akaike Information Criterion (AIC)  for statistical model selection. AIC is calculated by taking the number of parameters of the statistical model representing the evaluated clustering solution and subtracting them from the maximized log-likelihood of the data for the same model. Due to simplicity and robustness of the method, it has been widely used in similar clustering applications (see e.g. [49–51]).
As the abundance of dimensions (GO terms or TFs) in the gene annotation data are distributed randomly in resulting clusters, the clusters tend to exist in a relatively small subset of all dimensions . Besides being problematic for clustering, this behaviour is also problematic for model selection. The model selection tends to be overwhelmed by such dimension and systematically favour a result with only one or a few clusters with different data sets. Thus, we also calculated a modified AIC, referred to as dAIC, for which we used only the dimensions that are distributed in a non-random fashion in at least one of the clustering solutions with >2 clusters in the whole TAFFEL tree. This was tested by comparing the AIC score of the dimension in the whole gene list versus the AIC score in each clustering solution. If the AIC score is better (smaller) in any of the clustering solutions, then the evaluated dimension was included in calculus of dAIC. The same set of dimensions was then used for calculating dAIC for different clustering solutions including the whole gene list as one cluster. This feature selection filtered out at least 50% of the GO terms in our forskolin and sIA datasets (see Results section for detailed description of the datasets). When the remaining dimensions were used for calculating AIC score the number of selected clusters was systematically higher than when using all dimensions.
The statistical testing of enrichment in TAFFEL is calculated using Fisher's exact test. Only annotations with occurrences in a cluster are used in the testing. The resulting p-values are corrected for multiple testing using Benjamini-Hochberg False Discovery Rate (FDR) .
The interpretation of p-values reported by TAFFEL warrants a special note. In each cluster, enrichment is analyzed for the annotations of the same (Dependent Enrichment Analysis, DEA) and different (IEA) annotation system that was used in clustering. The p-values resulting from IEA have reasonable statistical interpretation as they test null hypotheses such as: "TF x is not dependent of the gene group y homogeneous in GO terms". Due to statistical independence between variables x and y, these p-values can be used reasonably to detect their biological significance and dependence of each cluster. As an opposite, the p-values from DEA would test null hypotheses such as: "GO term x is not dependent of gene group y homogeneous in GO terms". Here, variable y is statistically dependent on x and thus treating the resulting values as standard p-values for statistical decision-making would lead to circular argumentation. Still, these values from DEA are suitable as relative enrichment scores representing the most characteristic annotations in each cluster.
The inter-correlation measurements are also calculated using Fisher's exact test with Benjamini-Hochberg correction. As dependencies exist among the clusters between different inter-correlation comparisons, the correction tends to be highly conservative for this situation and should be interpreted with care.
The correlation between each cluster pair between the adjacent clustering solutions in the same clustering tree is calculated using standard correlation between two binomial distributions representing the gene memberships in the clusters.
Processing of demonstration microarray data sets
Gene expression microarray data (GSE2060 Affymetrix Human Genome U133A Array) concerning the effect of forskolin in human HEK293T culture was downloaded from Gene Expression Omnibus (GEO) and normalized using RMA method. Forskolin-treated and control HEK293T cells (both in duplicates) in culture were compared at 4 hours to find out differentially expressed genes. Welch's t-test with Benjamini-Hochberg correction was used. Due to a low number of replicates, the fold change was used as an additional measure for filtering. P-value < 0.05 and fold change > 1.25 resulted in 691 differentially expressed genes.
Whole genome expression data of 11 ruptured and 8 unruptured sIA wall samples resected after microsurgical clipping of the sIA neck were compared using Affymetrix HG-U133 Plus 2.0 microarrays . The data was RMA normalized and compared using Welch's t-test with Benjamini-Hochberg correction for p-values. Genes with p-value < 0.05 were regarded as differentially expressed genes. This resulted in 498 overexpressed and 491 underexpressed genes in the ruptured sIA wall group.
independent enrichment analysis
False Discovery Rate
saccular intracranial aneurysm
a cyclic AMP
transcription factor binding site
Kyoto Encyclopedia of Genes and Genomes
Gene Expression Omnibus
Robust Multi-array Average
Cyclic AMP response element-binding
Gene Set Enrichment Analysis
Akaike Information Criterion
The authors would like to thank Petri Törönen for helpful comments during the development of the tool and data analysis. We are also grateful for Liisa Heikkinen for making constructive suggestions. This work was supported by Emil Aaltonen Foundation and Finnish Cultural Foundation to PP, the Finnish Graduate School of Molecular Medicine to MK, and the Saastamoinen Foundation to JP and GW.
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
- Pehkonen P, Wong G, Toronen P: Theme discovery from gene lists for identification and viewing of multiple functional groups. BMC Bioinformatics 2005, 6: 162. 10.1186/1471-2105-6-162PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Al-Shahrour F, Diaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics 2004, 20(4):578–580. 10.1093/bioinformatics/btg455View ArticlePubMedGoogle Scholar
- Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B: GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol 2004, 5(12):R101. 10.1186/gb-2004-5-12-r101PubMed CentralView ArticlePubMedGoogle Scholar
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat Genet 2002, 31(1):19–20. 10.1038/ng0502-19View ArticlePubMedGoogle Scholar
- Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol 2003, 4(4):R28. 10.1186/gb-2003-4-4-r28PubMed CentralView ArticlePubMedGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Bhavsar P, Shah A, Krawetz SA, Tainsky MA: Onto-Tools, the toolkit of the modern biologist: Onto-Express, Onto-Compare, Onto-Design and Onto-Translate. Nucleic Acids Res 2003, 31(13):3775–3781. 10.1093/nar/gkg624PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Pavlidis P, Qin J, Arango V, Mann JJ, Sibille E: Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochem Res 2004, 29(6):1213–1222.View ArticlePubMedGoogle Scholar
- Robertson G, Bilenky M, Lin K, He A, Yuen W, Dagpinar M, Varhol R, Teague K, Griffith OL, Zhang X, Pan Y, Hassel M, Sleumer MC, Pan W, Pleasance ED, Chuang M, Hao H, Li YY, Robertson N, Fjell C, Li B, Montgomery SB, Astakhova T, Zhou J, Sander J, Siddiqui AS, Jones SJ: cisRED: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Res 2006, 34(Database issue):D68–73.PubMed CentralView ArticlePubMedGoogle Scholar
- Van Gijn J, Kerr RS, Rinkel GJ: Subarachnoid haemorrhage. Lancet 2007, 369(9558):306–318. 10.1016/S0140-6736(07)60153-6View ArticlePubMedGoogle Scholar
- Frösen J, Piippo A, Paetau A, Kangasniemi M, Niemelä M, Hernesniemi J, Jääskelainen J: Remodeling of saccular cerebral artery aneurysm wall is associated with rupture: histological analysis of 24 unruptured and 42 ruptured cases. Stroke 2004, 35(10):2287–2293. 10.1161/01.STR.0000140636.30204.daView ArticlePubMedGoogle Scholar
- Frösen J, Piippo A, Paetau A, Kangasniemi M, Niemela M, Hernesniemi J, Jaaskelainen J: Growth factor receptor expression and remodeling of saccular cerebral artery aneurysm walls: implications for biological therapy preventing rupture. Neurosurgery 2006, 58(3):534–41. discussion 534–41 discussion 534-41View ArticlePubMedGoogle Scholar
- Tulamo R, Frosen J, Junnikkala S, Paetau A, Pitkaniemi J, Kangasniemi M, Niemela M, Jaaskelainen J, Jokitalo E, Karatas A, Hernesniemi J, Meri S: Complement activation associates with saccular cerebral artery aneurysm wall degeneration and rupture. Neurosurgery 2006, 59(5):1069–76. discussion 1076–7 discussion 1076-7PubMedGoogle Scholar
- Laaksamo E, Tulamo R, Baumann M, Dashti R, Hernesniemi J, Juvela S, Niemela M, Laakso A: Involvement of mitogen-activated protein kinase signaling in growth and rupture of human intracranial aneurysms. Stroke 2008, 39(3):886–892. 10.1161/STROKEAHA.107.497875View ArticlePubMedGoogle Scholar
- Kurki MI, Häkkinen S, Frösen J, Tulamo R, Fraunberg M, Wong G, Tromp G, Niemelä M, Hernesniemi J, Jääskeläinen JE, Ylä-Herttuala S: Upregulated signaling pathways in ruptured human saccular intracranial aneurysm wall: an emerging regulative role of Toll like receptor signaling and NF-κB, HIF1A and ETS transcription factors. Neurosurgery 2011, in press.Google Scholar
- Carmona-Saez P, Chagoyen M, Tirado F, Carazo JM, Pascual-Montano A: GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists. Genome Biol 2007, 8(1):R3. 10.1186/gb-2007-8-1-r3PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang X, Odom DT, Koo SH, Conkright MD, Canettieri G, Best J, Chen H, Jenner R, Herbolsheimer E, Jacobsen E, Kadam S, Ecker JR, Emerson B, Hogenesch JB, Unterman T, Young RA, Montminy M: Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc Natl Acad Sci USA 2005, 102(12):4459–4464. 10.1073/pnas.0501076102PubMed CentralView ArticlePubMedGoogle Scholar
- Giacconi R, Caruso C, Malavolta M, Lio D, Balistreri CR, Scola L, Candore G, Muti E, Mocchegiani E: Pro-inflammatory genetic background and zinc status in old atherosclerotic subjects. Ageing Res Rev 2008, 7(4):306–318. 10.1016/j.arr.2008.06.001View ArticlePubMedGoogle Scholar
- Barouki R, Morel Y: Repression of cytochrome P450 1A1 gene expression by oxidative stress: mechanisms and biological implications. Biochem Pharmacol 2001, 61(5):511–516. 10.1016/S0006-2952(00)00543-8View ArticlePubMedGoogle Scholar
- Chetty R, Dada MA, Boshoff CH, Comley MA, Biddolph SC, Schneider JW, Mason DY, Pulford KA, Gatter KC: TAL-1 protein expression in vascular lesions. J Pathol 1997, 181(3):311–315. 10.1002/(SICI)1096-9896(199703)181:3<311::AID-PATH775>3.0.CO;2-BView ArticlePubMedGoogle Scholar
- Lazrak M, Deleuze V, Noel D, Haouzi D, Chalhoub E, Dohet C, Robbins I, Mathieu D: The bHLH TAL-1/SCL regulates endothelial cell migration and morphogenesis. J Cell Sci 2004, 117(Pt 7):1161–1171.View ArticlePubMedGoogle Scholar
- Rudini N, Felici A, Giampietro C, Lampugnani M, Corada M, Swirsding K, Garre M, Liebner S, Letarte M, ten Dijke P, Dejana E: VE-cadherin is a critical endothelial regulator of TGF-beta signalling. EMBO J 2008, 27(7):993–1004. 10.1038/emboj.2008.46PubMed CentralView ArticlePubMedGoogle Scholar
- Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y: Improving gene set analysis of microarray data by SAM-GS. BMC Bioinformatics 2007, 8: 242. 10.1186/1471-2105-8-242PubMed CentralView ArticlePubMedGoogle Scholar
- Damian D, Gorfine M: Statistical concerns about the GSEA procedure. Nat Genet 2004, 36(7):663. author reply 663 author reply 663View ArticlePubMedGoogle Scholar
- Kankainen M, Pehkonen P, Rosenstom P, Toronen P, Wong G, Holm L: POXO: a web-enabled tool series to discover transcription factor binding sites. Nucleic Acids Res 2006, 34(Web Server issue):W534–40.PubMed CentralView ArticlePubMedGoogle Scholar
- Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW: oPOSSUM: identification of over-represented transcription factor binding sites in co-expressed genes. Nucl Acids Res 2005, 33(10):3154–3164. 10.1093/nar/gki624PubMed CentralView ArticlePubMedGoogle Scholar
- Rhee SY, Wood V, Dolinski K, Draghici S: Use and misuse of the gene ontology annotations. Nat Rev Genet 2008, 9(7):509–515. 10.1038/nrg2363View ArticlePubMedGoogle Scholar
- Hannenhalli S: Eukaryotic transcription factor binding sites--modeling and integrative search methods. Bioinformatics 2008, 24(11):1325–1331. 10.1093/bioinformatics/btn198View ArticlePubMedGoogle Scholar
- Halkidi M, Batistakis Y, Vazirgiannis M: Clustering validity checking methods: Part I. ACM SIGMOD Record 2002, 31(2):40–45. 10.1145/565117.565124View ArticleGoogle Scholar
- Halkidi M, Batistakis Y, Vazirgiannis M: Clustering validity checking methods: part II. ACM SIGMOD Rec 2002, 31(3):19–27. 10.1145/601858.601862View ArticleGoogle Scholar
- Puga A, Ma C, Marlowe JL: The aryl hydrocarbon receptor cross-talks with multiple signal transduction pathways. Biochem Pharmacol 2009, 77(4):713–722. 10.1016/j.bcp.2008.08.031PubMed CentralView ArticlePubMedGoogle Scholar
- Tan Z, Chang X, Puga A, Xia Y: Activation of mitogen-activated protein kinases (MAPKs) by aromatic hydrocarbons: role in the regulation of aryl hydrocarbon receptor (AHR) function. Biochem Pharmacol 2002, 64(5–6):771–780. 10.1016/S0006-2952(02)01138-3View ArticlePubMedGoogle Scholar
- Oesch-Bartlomowicz B, Oesch F: Role of cAMP in mediating AHR signaling. Biochem Pharmacol 2009, 77(4):627–641. 10.1016/j.bcp.2008.10.017View ArticlePubMedGoogle Scholar
- Thomsen JS, Kietz S, Strom A, Gustafsson JA: HES-1, a novel target gene for the aryl hydrocarbon receptor. Mol Pharmacol 2004, 65(1):165–171. 10.1124/mol.65.1.165View ArticlePubMedGoogle Scholar
- Rowlands JC, Gustafsson JA: Aryl hydrocarbon receptor-mediated signal transduction. Crit Rev Toxicol 1997, 27(2):109–134. 10.3109/10408449709021615View ArticlePubMedGoogle Scholar
- Krishna M, Narang H: The complexity of mitogen-activated protein kinases (MAPKs) made simple. Cell Mol Life Sci 2008, 65(22):3525–3544. 10.1007/s00018-008-8170-7View ArticlePubMedGoogle Scholar
- de Rooij NK, Linn FH, van der Plas JA, Algra A, Rinkel GJ: Incidence of subarachnoid haemorrhage: a systematic review with emphasis on region, age, gender and time trends. J Neurol Neurosurg Psychiatry 2007, 78(12):1365–1372. 10.1136/jnnp.2007.117655PubMed CentralView ArticlePubMedGoogle Scholar
- Dorhout Mees SM, Rinkel GJ, Feigin VL, Algra A, van WM, Vermeulen M, van Gijn J: Calcium antagonists for aneurysmal subarachnoid haemorrhage. Cochrane Database Syst Rev 2007, (3):CD000277. (3) (3)
- Flicek P, Aken BL, Beal K, Ballester B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, Down T, Dyer SC, Eyre T, Fitzgerald S, Fernandez-Banet J, Graf S, Haider S, Hammond M, Holland R, Howe KL, Howe K, Johnson N, Jenkinson A, Kahari A, Keefe D, Kokocinski F, Kulesha E, Lawson D, Longden I, Megy K, et al.: Ensembl 2008. Nucleic Acids Res 2008, 36(Database issue):D707–14.PubMed CentralPubMedGoogle Scholar
- ingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996, 24(1):238–241. 10.1093/nar/24.1.238View ArticleGoogle Scholar
- Bryne JC, Valen E, Tang MH, Marstrand T, Winther O, da I, Krogh Piedade A, Lenhard B, Sandelin A: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res 2008, 36(Database issue):D102–6.PubMed CentralPubMedGoogle Scholar
- Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401(6755):788–791. 10.1038/44565View ArticlePubMedGoogle Scholar
- Lavrac N, Gamberger D, Todorovski L, Blockeel H: Proceedings of the Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases. Springer-Verlag 2003.Google Scholar
- Akaike H: A new look at the statistical model identification. Automatic Control, IEEE Transactions on 1974, 19(6):716–723. 10.1109/TAC.1974.1100705View ArticleGoogle Scholar
- Chen X, Murphy RF: Objective clustering of proteins based on subcellular location patterns. J Biomed Biotechnol 2005, 2005(2):87–95. 10.1155/JBB.2005.87PubMed CentralView ArticlePubMedGoogle Scholar
- Liu T, Lin N, Shi N, Zhang B: Information criterion-based clustering with order-restricted candidate profiles in short time-course microarray experiments. BMC Bioinformatics 2009, 10: 146. 10.1186/1471-2105-10-146PubMed CentralView ArticlePubMedGoogle Scholar
- Huang J, Shimizu H, Shioya S: Clustering gene expression pattern and extracting relationship in gene network based on artificial neural networks. J Biosci Bioeng 2003, 96(5):421–428.View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society.Series B (Methodological) 1995, 57(1):289–300.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.