FACT – a framework for the functional interpretation of high-throughput experiments
© Kokocinski et al. 2005
Received: 14 March 2005
Accepted: 28 June 2005
Published: 28 June 2005
Skip to main content
© Kokocinski et al. 2005
Received: 14 March 2005
Accepted: 28 June 2005
Published: 28 June 2005
Interpreting the results of high-throughput experiments, such as those obtained from DNA-microarrays, is an often time-consuming task due to the high number of data-points that need to be analyzed in parallel. It is usually a matter of extensive testing and unknown beforehand, which of the possible approaches for the functional analysis will be the most informative
To address this problem, we have developed the Flexible Annotation and Correlation Tool (FACT). FACT allows for detection of important patterns in large data sets by simplifying the integration of heterogeneous data sources and the subsequent application of different algorithms for statistical evaluation or visualization of the annotated data. The system is constantly extended to include additional annotation data and comparison methods.
FACT serves as a highly flexible framework for the explorative analysis of large genomic and proteomic result sets. The program can be used online; open source code and supplementary information are available at http://www.factweb.de.
A variety of algorithms and programs have been introduced to accomplish the processing of raw data as well as the statistical analysis of data from high-throughput experiments. But besides the mathematical complexity that needs to be handled, there is a biological complexity inherent to the data sets, too. Current means to analyze large-scale data sets usually target very specific questions and often fail to provide solutions that can be adapted to different types of data. Nevertheless, common and generalized questions for the interpretation of such data can be established as follows: i) What information is known about the analyzed features (clones, genes, e.g.)? ii) Are there correlations between the experimental outcomes and the additional information (shared pathways, etc.)? iii) Is the outcome comparable with results of other experiments (genomic or gene expression data sets, publications, etc.)?
The program Flexible Annotation and Correlation Tool (FACT) was developed to address these questions by integrating data sources, tools and algorithms in a single open framework. First, FACT allows merging information from various data sources into one comprehensive annotation for an experimental data set. It then provides functional analysis tools to inspect and correlate this heterogeneous information. The functionality of FACT can be extended through the inclusion of new data sources, algorithms and programs by defining additional modules from a prototype. This flexibility is achieved by a strong level of abstraction from the actual data, by the design of the underlying database and by the modular architecture of the software itself. The task to identify relevant biological interconnections reflected by the experimental results (e.g. participation of the analyzed genes in shared pathways) is what we are targeting with the software introduced here.
The integration of bio-molecular data from diverse sources such as public databases or clinical parameters (annotation) is a key challenge in the process of the analysis of high-throughput experiments. While the interpretation of the outcome of a standard experiment used to depend on the knowledge of one human expert from the field, today's screening tools produce data quantities not manageable by human inspection. After receiving a list of differentially regulated genes from a microarray gene expression experiment comprising several hundreds or thousands of entries, it is not efficient to start the interpretation of these results by manually searching through publications. As a first step, broad biological themes should be identified and followed into a more detailed inspection.
Using network technologies, the availability of data sources is no more the limiting factor, but if accomplished manually, the obstacles for their integration are numerous. Often data are made available in different formats (HTML pages, flat files, direct database access) and very heterogeneous layouts. In addition the nomenclature (e.g. gene names) as well as the relationship of different systems to each other are often inconsistent and require many manual selection and modification steps. At the same time, as much knowledge as possible should be integrated about the data features analyzed, since interesting unknown pathways and interconnections might be hidden behind biological complexity.
Differences in nomenclature and the problem of relating one type of experiment to an other, as the second obstacle for data integration, has been addressed for gene and protein centered research by the development of the GeneOntology system (GO) . This hierarchical framework of a directed acyclic graph of annotations for gene attributes has become a de facto standard, which can be employed in the functional analysis of experiments. Similar projects have been initiated for example to organize the classification of molecular interactions in pathways and molecular complexes (Genome Knowledgebase / Reactome ). Using these resources FACT is able to compare experimental results from technically distant applications.
However, most experiments differ in focus and design and usually no standard solution for their interpretation can be applied. The third aspect for an integrating approach therefore is high flexibility concerning the application of diverse analysis methods that have already been developed for the interpretation of results or might be used in future.
Different types of experimental data can be loaded using dedicated parser functions. This can be a simple functions to read tab-delimited data defined e.g. as a gene list with associated expression values. It can also be a more complex solution to handle case descriptions from comparative genomic hybridizations (CGH), a method that is employed to monitor copy numbers changes of all regions of a genome simultaneously on chromosome spreads. The individual modules read the specific file format, perform transformations to the generalized format (i.e. convert them into data features) and use core library functions to store the data.
Varying data sources can be utilized for the annotation of experimental data sets by different data-access functions (e.g. GO terms for gene names). Modules achieve this for instance through access to an online database or to a local copy of such database. Data of interest are then gathered and stored.
Different functions can be used to inspect the annotated information and highlight underlying patterns (e.g. overrepresented GO-terms). The modules typically produce textual and graphical output to draw the researcher's attention to the most promising features of his data.
Currently available functions of FACT are described below. Further flexibility is achieved by the concept of experimental and annotational data being reduced to the basic model of one data set with several data features, as described above.
All these modules use the FACT API. It offers a defined interface for the effortless extension to new sources and functions. Prototype modules for each category implementing this API are supplied. For the integration of annotation sources, available data can either be transferred to the local system (data warehousing) or linked to the original source (database federation); the FACT system allows both options to be used by the module functions. Currently remote databases are accessed by the EnsEMBL, BBID and Reactome modules; the CpG and CGAP functions use locally stored information (see below). The update of the local data is accomplished semi-automatically be invoking of the respective update function in the separate modules.
Finally, as there is an active development of software for the annotation and analysis of gene expression data in the language R (Bioconductor project ), and the handling of large data matrices is accomplished faster in R, we used the Perl/R interface RSPerl  in different modules to encapsulate analysis functions written in R. Other modules employ the functionality from the BioPerl  and Ensembl Perl API .
Data Types and Sources accessible by current annotation modules.
Data source, access method
Data provider, data location
Type of annotation used by FACT
Ensembl, Perl API access to local or remote database
Ensembl ID, Gene Symbol, Gene Name, Chromosomal Location, Homologues Genes, Interpro Domains, RefSeq Accession Number, Affymetrix ID
euGenes, local database
euGene ID, Gene Symbol, Gene Name, GDB ID, OMIM ID, Genomic Localization, GeneOntology Terms, Protein Accession Numbers
Image Consortium, local database
Clone Image ID
Biological Biochemical Image Database, HTTP parser and HTTP request
National Institute of Aging, NIH (USA) , http://bbid.grc.nia.nih.gov/cgi-bin/pathwaysearch.pl
Pathway Name and Image-link
GeneOntology, local database
GeneOntology Consortium , http://www.geneontology.org/GO.current.annotations.shtml
ID and Name of GO-Term (Biological Process, Molecular Function, Cellular Localization)
Cancer Genome Anatomy Project, local database
Biocarta name, Biocarta short name, KEGG Pathway Name, KEGG Pathway ID, PFAM ID
LocusLink / EntrezGene, local database
A. LocusLink ID, Gene Symbol, Gene Name, Genomic Localization, GeneOntology Terms, OMIM ID B. Key references (PubMed links)
Mouse Genome Database, local database
MGI ID / Gene Symbol
Internal CloneBase, local database
Deutsches Krebsforschungs zentrum, Div. Molecular Genetics (D)
General Information on available Clones
CpG, local database
University of California Santa Cruz (USA), ftp://hgdownload.cse.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/database/cpgIsland.txt.gz
Calculated relative CpG content of genomic region
STRING, local database
Protein interaction data (computed and imported from other databases)
Affymetrix CEL files
Affymetrix Inc. / FACT, http://www.affymetrix.com
Use of Affymetrix probe IDs
Reactome, local database and HTTP request
Current data analysis and display modules
Count and display of occurrences of annotation terms
In part from GO::TermFinder 
Detection of significantly overrepresented GO terms in Gene List, based upon hypergeometric tail probability
Bio::Biblio (M. Senger, EBI)
List Publications with co-occurrences of terms
Deutsches Krebsforschungs-zentrum, Div. Molecular Genetics (D)
Compare CGH results to archived data
goCluster, G. Wrobel, available at http://www.bioconductor.org
Detection of significantly overrepresented GO terms (based upon Fisher's exact test) in Clusters built with k-means algorithm
In part from GeneMerge 
Detection of significantly overrepresented terms of any kind, based upon hypergeometric tail probability
CGH – Expression Comparison
Detect correlation between genomic and expression data sets, based on two-sided T-Tests
Display values or occurrences in genomic context
To read in experimental results, a simple list (with terms or term-value pairs) or table can be used, or more specialized parser functions can be employed to read and decipher the notations of different types of results. One parser (2_Colums) reads tab-, or semicolon-separated lists and stores the data as terms of the data type that is passed as a parameter (e.g. gene symbol or clone id) and the respective value. Another function (LongList) expects terms only (list of genes that are to be annotated) or Affymetrix probe ids (AffyCelFile). The parser for CGH results translates ISCN notations of cytogenetic alterations  into the distinct chromosomal bands that are affected while reading the data file. The bands are stored with the alteration -1 (loss of genomic material), +1 (gain), or +2 (high level amplification).
Data sources that can then be used for annotating these experimental results contain among others EnsEMBL databases , with functions providing numerous gene annotations on the human and murine genome (gene name, accession numbers, genomic localization, GO terms and other ids). The annotation data is fetched by using the EnsEMBL API or direct sql queries. Chromosomal locations expressed as cytogenetic bands can be translated in megabasepair positions. This can permit the direct comparison of results from genomic and expression experiments. Most common identifiers (IMAGE IDs, DDBJ/EMBL/GenBank accession numbers, international clone names, MGD (Mouse-Genome Database) IDs or Probe-IDs as used on expression microarrays in the Affymetrix system) are recognized and used by the different annotation modules. Additionally, homologous genes, sequence features, InterPro protein domains, CpG content and PubMed references can be acquired. The euGenes database  modules delivers an additional set of broad annotations (Gene Symbol and Name, GDB ID, OMIM ID, Genomic Localization, GO terms, etc.). The BBID module searches for representations of affected pathways in the Biological Biochemical Image Database . We store links to the images which sometimes allow the clarification of interactions better than textual description alone. Additionally data is used from STRING  and Reactome  to point out protein interactions and involvement in molecular complexes.
Annotation steps can be concatenated, allowing deriving from specific (e.g. Affymetrix Probe-IDs) to more general terms (e.g. gene symbols). This broad annotation that is facilitated by FACT is crucial for the researcher to acquire a complete picture of his data. The user can export the combined lists of annotated data in HTML, XML or text format. If desired the system can send them by email.
The Flexible Annotation and Correlation Tool has proven especially helpful with the interpretation of results from genomic and expression microarray experiments, but most functions can be applied to a broad variety of experimental data.
We demonstrate the benefits of FACT for the functional interpretation of the results from a comprehensive analysis of gene expression patterns in the development of non-melanoma skin-cancer conducted within our group [L. Hummerich et al.: Identification of novel tumor-associated genes in the process of squamous cell cancer development; submitted]. Using two different sets of microarrays with 15.000 and 20.000 cDNA fragments respectively, the chemically induced multi-step development of squamous cell carcinoma was monitored. We used the dorsal skin of mice as a well-studied system for the development of epithelial cancer  with the carcinogen 7,12-dimethylbenz-[a]-anthracene and the tumor promoter 12-O-tetradecanoylphorbol-13-acetate as inducing agents. Expression values were measured at different time-points of tumor formation. Genes with differentiated expression patterns are expected to play a role in the development of human epidermal tumor development as well.
There are several large-scale database projects that incorporate an immense spectrum of information about genes and gene product (Ensembl, euGene, LocusLink/EntrezGene, etc.). The Ensembl project for example also allows the user to display his own selected data sources in the context of the full genome annotation through the Distributed Annotation System , and it offers the possibility of mining the data of several genomes using the EnsMart software . FACT uses Ensembl, LocusLink/EntrezGene and euGene data and complements them with other annotation resources; it allows the user to apply different analysis functions on the combined data.
Recently, a variety of computational tools have been introduced to aid in the interpretation of results, some of which are of interest concerning FACT. The majority of the programs use GO annotations to gain an interpretation of gene expression data. OntoExpress  was introduced in 2002 and offers the options to use hypergeometric, chi-square, binomial and Fischer's exact test to score annotation term derived from gene lists. It also allows the appliance of different methods (False Discovery Rate (FDR), Bonferroni, Holm, Sidak) for the multiple experiment correction . The program can also include KEGG pathway information and chromosomal localization and is now part of the Onto-Tools collection to offer further functionality . EASE / DAVID  offer a broad variety of annotation options in their latest version including all major database identifiers, protein domain and pathway information. The Fisher's exact test is used for the detection of enriched terms. GoMiner  and numerous other tools listed at the GO website  can be used for the annotation of gene lists with GO terms. GeneMerge  uses the hypergeometric tail probability with Bonferroni correction to test GO terms, genomic localizations and KEGG pathway information and is used in parts within the FACT system. FACT uses the available Perl code of GO::TermFinder  for the GO annotation and the detection of significantly overrepresented terms using the FDR. It also includes a function combining K-means clustering and Fisher's exact test. GEPAS  and GECKO  are two recently introduced large software packages that include functional analysis and visualization steps. In contrast to FACT they are focused on the initial statistical evaluation and on the analysis of gene expression microarray data. GFINDer  is a system that offers annotations on GO, pathway information, protein domains and genetic disorders. It analyses with count functions and appropriate tests (Hypergeometric, Binomial, Fisher's, Z or Poisson Test).
This list of available tools is far from complete and not all aspects are covered. The FACT system was developed with the focus to include and extend the functionality of tools like these. To our knowledge, the individual programs do not offer the same degree of flexibility and openness to different data sources and analysis methods. New functions can be added to FACT by simply uploading the respective module. The system is designed as an open framework for the explorative analysis using a variety of methods on annotational data. It is not restricted to or focused on Gene Ontology-based interpretation or the analysis of gene expression data alone and should facilitate the development and application of new analysis approaches. The system is constantly being extended to include additional aspects. With the submission as an open-source project we want to encourage other researchers to participate in this development.
To gain a more complete picture of results obtained from high-throughput experiments such as DNA-microarrays, automated procedures are required for annotation and analysis. At the same time it is usually a matter of testing and not known beforehand, which of the possible approaches for the functional analysis will be the most informative or appropriate. The Flexible Annotation and Correlation Tool offers the flexibility to integrate and compare annotation data and different algorithms in one environment by using a unified data basis. Data sets of different nature and format can be incorporated, diverse analytical algorithms can be applied and the user can add his own data integration and analysis functions. As a flexible framework for the explorative meta-analysis of genomic, proteomic or other experiments, FACT can help with the task of analyzing the biological complexity, allowing researchers to bridge gaps between different kinds of experiments and acquiring a more complete interpretation of large-scale experiments.
- Project name: Flexible Annotation and Correlation Tool (FACT).
- Project home page: http://www.factweb.de
- Operating system: tested on Linux SUSE 9.1
- Programming language: Perl (5.8.1)
- Other requirements: MySQL database (4.0.15); for specific modules: R (1.8.0 with RSPerl) and Bioconductor (1.4.0); for full installation: Apache web-server (apache2-prefork-2.0.48); additional Perl modules: BioPerl (1.2.1), Ensembl (currently 28). Please refer to website for full listing.
- License: Open Source GNU GPL (see licence document)
- Any restrictions to use by non-academics: written licence needed
application programming interface
comparative genomic hybridization
Flexible annotation and correlation tool
international system for human cytogenetic nomenclature
We are grateful for the contributions of Regina Mueller, Jochen Hess and Peter Angel within the non-melanoma skin cancer project and the helpful comments from Anja Kolb-Kokocinski and Imre Vastrik. The FACT project was supported by grants from the German Ministry for Education and Research (NGFN 01 GR 0101, NGFN 01GR 0417 and NGFN 01GR 0418).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.