AffyMiner: mining differentially expressed genes and biological knowledge in GeneChip microarray data
© Lu et al; licensee BioMed Central Ltd 2006
Published: 12 December 2006
DNA microarrays are a powerful tool for monitoring the expression of tens of thousands of genes simultaneously. With the advance of microarray technology, the challenge issue becomes how to analyze a large amount of microarray data and make biological sense of them. Affymetrix GeneChips are widely used microarrays, where a variety of statistical algorithms have been explored and used for detecting significant genes in the experiment. These methods rely solely on the quantitative data, i.e., signal intensity; however, qualitative data are also important parameters in detecting differentially expressed genes.
AffyMiner is a tool developed for detecting differentially expressed genes in Affymetrix GeneChip microarray data and for associating gene annotation and gene ontology information with the genes detected. AffyMiner consists of the functional modules, GeneFinder for detecting significant genes in a treatment versus control experiment and GOTree for mapping genes of interest onto the Gene Ontology (GO) space; and interfaces to run Cluster, a program for clustering analysis, and GenMAPP, a program for pathway analysis. AffyMiner has been used for analyzing the GeneChip data and the results were presented in several publications.
AffyMiner fills an important gap in finding differentially expressed genes in Affymetrix GeneChip microarray data. AffyMiner effectively deals with multiple replicates in the experiment and takes into account both quantitative and qualitative data in identifying significant genes. AffyMiner reduces the time and effort needed to compare data from multiple arrays and to interpret the possible biological implications associated with significant changes in a gene's expression.
DNA microarrays are a powerful tool for monitoring the expression of tens of thousands of genes simultaneously . Affymetrix GeneChips are widely used microarrays with a collection of 11 – 20 probe pairs called a probe set that measures the expression of each transcript. The probe pairs comprise a perfect match (PM) and a single base mismatch (MM) to the target mRNA region.
GeneChip microarrays use a statistical algorithm in the Microarray Suite 5.0 (MAS 5.0; Affymetrix) to estimate the variance among probe pairs within a probe set and to compute an expression index that represents transcript abundance . The MAS 5.0 algorithm uses the One-Step Tukey's Biweight Estimate to compute the Signal intensity for each probe set, and employs the Wilcoxon signed-rank test to assess the Detection calls and p-values for a single array analysis [3, 4]. The algorithm uses normalization and scaling techniques to correct for variations between two arrays . The comparison analysis of two arrays results in data matrices such as Change p-value, Change, and Signal Log Ratio. In the case of replicate sample analysis, the two sample statistical tests such as the Student t-test or the Mann-Whitney test can be used to test the hypothesis whether the signal intensity values for each probe set are significantly different in the treatment group compared with the control group. Such statistical tests are not ideal for finding significant genes, because only a few replicate samples (< 4) are usually used in the microarray experiments. Determining the most appropriate statistical method for detecting differentially expressed genes in GeneChip replicate data remains a challenging issue.
Several methods have been developed to improve the sensitivity and selectivity for detecting significant genes in GeneChip microarray experiments. The widely used algorithms include the robust multiarray average (RMA) , the model based expression index/intensity (MBEI) implemented in dCHIP software , and the positional dependent nearest-neighbor model (PDNN) . These algorithms effectively deal with the 'probe effect', that is, some probes in a probe set tend to give higher values than others , through re-computing of the signal intensity for each probe set using the processed image data exported from Affymetrix Microarray Suite or GeneChip Operating Software (GCOS). These methods rely solely on the quantitative data, i.e., signal intensity for the comparison analysis. However, qualitative data such as Detection call are also important parameters in detecting significant genes. Using a threshold fraction of Present detection calls can ultimately eliminate the unreliable probe sets while preserving the most significant ones . A combination of a qualitative parameter (change call) and two quantitative parameters (fold change and signal mean ratios) reduces greatly the false positives, while using a single parameter has a greater than 30% false positive rate .
Here we present a software tool called AffyMiner that uses both the quantitative and the qualitative data metrics for detecting differentially expressed genes in GeneChip data. In addition, AffyMiner has functions for connecting gene annotation information and Gene Ontology (GO) descriptions to the detected significant genes for better biological interpretation of the results.
These requirements were established from discussions with the users of our Microarray Core Facility over the past three years.
Compatibility with the data formats exported from Affymetrix MAS or GCOS. The exported data contain Probe sets, Signal detection, Signal value, Signal log ratio, Change, Change p-value, etc.
Provide the user the flexibility choosing different data metrics and different threshold values for filtering for differentially expressed genes.
Incorporate statistical analysis for the selection of significant genes.
Facilitate exploratory analyses such as clustering analysis.
Incorporate information from Gene Ontology and metabolic pathways.
Have easy-to-use graphical interfaces and provide ready-to-publish charts and tables.
The algorithm for detecting significantly down-regulated genes is as follows: 1) eliminate the probe sets with signal Detection calls of "Absent" in the control samples; 2) select the probe sets with signal Change calls of "Decrease"; 3) eliminate the probe sets with a Signal Log Ratio above a threshold defined by the user; and 4) remove the probe sets with the p-value above a threshold defined by the user.
The Gene Ontology (GO) Consortium produces structures of biological knowledge using a controlled vocabulary consisting of GO terms . GO terms are organized into three general categories, biological process, molecular function, and cellular component. The terms within each category are linked in defined parent-child relationships that reflect current biological knowledge. All genes from different organisms are systematically associated with the GO terms, and these associations continue to grow in complexity and details as sequence databases and experimental knowledge grow . GO provides a useful tool to look for common features shared within a list of genes.
The high-level description of the algorithm in building the GO tree is as follows, 1) read the output file generated by GeneFinder; 2) write in an array the GO IDs and their corresponding Affymetix probe set IDs; 3) find the GO Path IDs for each GO ID in the array and add the GO Path IDs to each element in the array; 4) sort by the GO Path IDs and compute the sum of the probe sets associated with each node; 5) build the entire tree based on the GO Path IDs and write in each node the GO term, GO ID, and the number of probe sets.
AffyMiner was programmed in Visual Basic (VB) .Net on the Microsoft .Net platform. VB .Net is the latest version of the Microsoft Visual Basic language. It has many attractive features, such as easy of use, fully object-oriented, and true visual development .
GeneFinder has two programs: Significant Genes for finding differentially expressed genes satisfying the user defined criteria, and Annotation for linking gene annotation information with the gene list.
The Significant Genes program has interactive interfaces to set up parameters, upload input files, and define the output, respectively. The parameter-setting window contains three frames for setting up the number of replicates, the direction of a robust change, and the data metrics for detecting differentially expressed genes. AffyMiner limits the maximum number of replicates to five. This is a reasonable assumption because the reproducibility of Affymetrix GeneChip array data is high and most publications use two to three replicates in their experiments. The data metrics consist of Signal Detection, Signal Change, Signal Log Ratio and Statistical Test. The user can choose the data matrices and threshold values for each analysis.
The Annotation program links the annotation information with gene lists, and generates a user-defined table with quantitative data such as signal log ratio and qualitative data such as annotation information. The NetAffx annotation file needs to be in the CSV (Comma Separated Value) format, which can be downloaded from the Affymetrix website .
The input file for the gene list can be the result generated by Significant Genes or any text file with a column corresponding to Affymetrix probe set IDs. Once these two files are uploaded, the data items in the output table can be chosen from the left list box. If not ideal, the user can remove the selected items from the right list box, which will not be shown in the output table.
Interfaces to Cluster and GenMAPP
Both Cluster and GenMAPP programs need to be downloaded and installed on the local computer (see below for system requirements of the computer). Go to the websites, http://rana.lbl.gov/EisenSoftware.htm and http://www.genmapp.org/download.asp to download Cluster and GenMAPP, respectively. In the main window, clicking the button "Set Path ..." will set up the path to the corresponding program file (Figure 1). Clicking the button Cluster or GenMAPP will run the program for analysis.
AffyMiner has been tested by multiple users and their feedback has been incorporated into its current version. Results analyzed by AffyMiner have been presented in several publications [18, 19]. In the following example, we describe a case study using AffyMiner to compare the lists of differentially expressed genes detected by AffyMiner and the RMA method.
Our group (M. Fromm and Y. Xia) studied the gene expression changes in the retroperitoneal white adipose tissue (RP-WAT) in mice fed trans-10, cis-12 conjugated linoleic acid (t10c12 CLA) . The Affymetrix Mouse Genome 430 2.0 microarrays were used to detect the expression changes of about 34,000 transcripts. Mice were sampled 1, 2, 3, 4, 7, 10, or 17 days after being fed control or 0.5% t10c12 CLA diets, generating 7 time points in total. At each time point, the RP-WAT tissues of ten control and ten t10c12 CLA-fed mice were harvested in groups of five mice each to provide two control and two treatment samples for microarray analysis.
Differentially expressed genes detected by approaches of AffyMiner and RMA
Common in both
Microarray technology has revolutionized the analysis of gene expression. The challenge associated with this high throughput technology is the statistical analysis and biological interpretation of microarray data. AffyMiner was developed to address these issues through finding genes with significant changes in gene expression, and linking these genes with the annotation and Gene Ontology information. Functionally, AffyMiner has overlap with other existing programs, but has the distinguishing features discussed below.
Affymetrix Data Mining Tool (DMT) can filter genes of interest based on the thresholds of certain quantitative and qualitative parameters, but not as powerful as AffyMiner in this aspect. AffyMiner takes full advantage of the range of the different data metrics available from MAS 5.0. AffyMiner provides the flexibility to choose different data metrics (Signal Detection, Signal Change, Signal Log Ratio, and Statistic Test) and to set threshold values for analyzing differentially expressed genes. This flexibility is very important since there is not a single analysis method that outperforms other methods of analyzing microarray data [23, 24]. It is evident from the different gene lists generated by AffyMiner and the RMA based approach. Incorporating the qualitative data metrics such as Detection and Signal Change would increase the selectivity of detecting differentially expressed genes [24, 25].
GenePicker has certain functions similar to those in AffyMiner . GenePicker was developed for the analysis of replicates of Affymetrix gene expression microarrays. The GenePicker analysis is done through defining analysis schemes, data normalization, t-test/ANOVA, and Change-fold Chang-analysis, and the use of Change Call, Fold Change, and Signal mean ratios. GenePicker provides a comparison of noise and signal analysis scheme for determining a signal-to-noise ratio in a given experiment, which is not available in GeneFinder. However, GeneFinder uses one more data matrix, i.e., Detection. As mentioned earlier, GeneFinder also has the function of incorporating gene annotation information with expression data, which is not available in GenePicker.
The Affymetrix NetAffx Gene Ontology Mining Tool can create a graph of GO terms associated with the input probe sets. However, the graph is very difficult to read as compared with the one generated by AffyMiner (Figure 7). AffyMiner has the flexibility of displaying the GO tree at different levels and the probe sets associated with the GO terms can be viewed easily. Another GO tool called GoSurfer was developed for the GO analysis of Affymetrix GeneChip data [7, 14, 17]. GoSurfer associates user input gene lists with GO terms and visualizes such GO terms as a hierarchical tree. GoSurfer compares two lists of genes in order to find which GO terms are enriched in one list of genes but relatively depleted in another. GoSurfer can not map genes from a single list onto the GO descriptions. In this regard, GOTree and GoSurfer complement each other in the analysis of Gene Ontology.
As a whole, AffyMiner fills an important gap in finding differentially expressed genes from Affymetrix GeneChip microarray data. AffyMiner effectively deals with multiple replicates in the experiment, provides users flexibility choosing different data metrics for detecting significant genes, and is capable of incorporating various gene annotations. AffyMiner has been used for analyzing the GeneChip data for several publications, which has reduced the time and effort needed to compare data from multiple arrays and to interpret the possible biological implications associated with significant changes in a gene's expression.
Availability and requirements
Project name: AffyMiner project
Project home page: http://bioinfo-srv1.awh.unomaha.edu/affyminer/
Operating system(s): Microsoft Windows 2000 or later
Programming language: Visual Basic .Net.
Installation: To install AffyMiner, double click on AffyMinerInstaller.msi and follow the instructions.
Any restrictions to use by non-academics: yes, contact the author GL for details.
This publication was made possible by NSF Grant Number EPS-0346476 from the NSF EPSCoR program and by NIH Grant Number P20 RR16469 from the INBRE Program of the National Center for Research Resources. GL acknowledges the Pre-tenure Award from University of Nebraska at Omaha. The authors are grateful to Dr. L. Harshman who allows us to use the Drosophila microarray data and a number of users for providing feedbacks on AffyMiner.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S4.
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, et al.: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996, 14(13):1675–1680. 10.1038/nbt1296-1675View ArticlePubMedGoogle Scholar
- Clarke JD, Zhu T: Microarray analysis of the transcriptome as a stepping stone towards understanding biological systems: practical considerations and perspectives. Plant J 2006, 45(4):630–650. 10.1111/j.1365-313X.2006.02668.xView ArticlePubMedGoogle Scholar
- Hubbell E, Liu WM, Mei R: Robust estimators for expression analysis. Bioinformatics 2002, 18(12):1585–1592. 10.1093/bioinformatics/18.12.1585View ArticlePubMedGoogle Scholar
- Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, et al.: Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 2002, 18(12):1593–1599. 10.1093/bioinformatics/18.12.1593View ArticlePubMedGoogle Scholar
- Affymetrix: GeneChip expression analysis – data analysis fundamentals. 2006.Google Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMedGoogle Scholar
- Li C, Wong WH: Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci U S A 2001, 98(1):31–36. 10.1073/pnas.011404098PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol 2003, 21(7):818–821. 10.1038/nbt836View ArticlePubMedGoogle Scholar
- McClintick JN, Edenberg HJ: Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics 2006, 7: 49. 10.1186/1471-2105-7-49PubMed CentralView ArticlePubMedGoogle Scholar
- Finocchiaro G, Parise P, Minardi SP, Alcalay M, Muller H: GenePicker: replicate analysis of Affymetrix gene expression microarrays. Bioinformatics 2004, 20(18):3670–3672. 10.1093/bioinformatics/bth416View ArticlePubMedGoogle Scholar
- GeneSpring Analysis Platform[http://www.agilent.com/chem/genespring]
- Dudoit S, Gentleman RC, Quackenbush J: Open source software for the analysis of microarray data. Biotechniques 2003, (Suppl):45–51.Google Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C, et al.: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 2004, 32(Database):D258–261.PubMedGoogle Scholar
- Zhong S, Tian L, Li C, Storch F, Wong W: Comparative Analysis of Gene Sets in the Gene Ontology Space under the Multiple Hypothesis Testing Framework. Proc IEEE Comp Systems Bioinformatics 2004, 425–435.Google Scholar
- Smiley J: Learn to Program with Visual Basic.NET. Osborne McGraw-Hill 2002.Google Scholar
- Affymetix, Inc[http://www.affymetrix.com/]
- Zhong S, Li C, Wong WH: ChipInfo: Software for extracting gene annotation and gene ontology information for microarray analysis. Nucleic Acids Res 2003, 31(13):3483–3486. 10.1093/nar/gkg598PubMed CentralView ArticlePubMedGoogle Scholar
- Alvarez-Venegas R, Sadder M, Hlavacka A, Baluska F, Xia Y, Lu G, Firsov A, Sarath G, Moriyama H, Dubrovsky JG, et al.: The Arabidopsis homolog of trithorax, ATX1, binds phosphatidylinositol 5-phosphate, and the two regulate a common set of target genes. Proc Natl Acad Sci U S A 2006, 103(15):6049–6054. 10.1073/pnas.0600944103PubMed CentralView ArticlePubMedGoogle Scholar
- Alvarez-Venegas R, Xia Y, Lu G, Avramova Z: Phosphoinositide 5-Phosphate and Phosphoinositide 4-Phosphate Trigger Distinct Specific Responses of Arabidopsis Genes; Genome-Wide Expression Analyses. Plant Signaling & Behavior 2006, 1(3):140–151.View ArticleGoogle Scholar
- Larosa PC, Miner J, Xia Y, Zhou Y, Kachman S, Fromm ME: Trans-10, Cis-12 Conjugated Linoleic Acid Causes Inflammation And Delipidation Of White Adipose Tissue In Mice: A Microarray And Histological Analysis. Physiol Genomics 2006.Google Scholar
- Smyth GK, Michaud J, Scott H: The use of within-array replicate spots for assessing differential expression in microarray experiments. Bioinformatics . Bioinformatics 2005, 21(9):2067–2075. 10.1093/bioinformatics/bti270View ArticlePubMedGoogle Scholar
- Smyth GK: Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments. Statistical Applications in Genetics and Molecular Biology 2004., 3(1): Article 3 Article 3Google Scholar
- Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004, 573(1–3):83–92. 10.1016/j.febslet.2004.07.055View ArticlePubMedGoogle Scholar
- Millenaar FF, Okyere J, May ST, van Zanten M, Voesenek LA, Peeters AJ: How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results. BMC Bioinformatics 2006, 7: 137. 10.1186/1471-2105-7-137PubMed CentralView ArticlePubMedGoogle Scholar
- Affymetix Inc: GeneChip expression analysis – data analysis fundamentals. 2006.Google Scholar
- AffyMiner project[http://bioinfo-srv1.awh.unomaha.edu/affyminer/]
- Microsoft,Net Framework Developer Center[http://msdn.microsoft.com/netframework/]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.