MAAMD: a workflow to standardize meta-analyses and comparison of affymetrix microarray data
© Gan et al.; licensee BioMed Central Ltd. 2014
Received: 9 October 2013
Accepted: 27 February 2014
Published: 12 March 2014
Skip to main content
© Gan et al.; licensee BioMed Central Ltd. 2014
Received: 9 October 2013
Accepted: 27 February 2014
Published: 12 March 2014
Mandatory deposit of raw microarray data files for public access, prior to study publication, provides significant opportunities to conduct new bioinformatics analyses within and across multiple datasets. Analysis of raw microarray data files (e.g. Affymetrix CEL files) can be time consuming, complex, and requires fundamental computational and bioinformatics skills. The development of analytical workflows to automate these tasks simplifies the processing of, improves the efficiency of, and serves to standardize multiple and sequential analyses. Once installed, workflows facilitate the tedious steps required to run rapid intra- and inter-dataset comparisons.
We developed a workflow to facilitate and standardize Meta-Analysis of Affymetrix Microarray Data analysis (MAAMD) in Kepler. Two freely available stand-alone software tools, R and AltAnalyze were embedded in MAAMD. The inputs of MAAMD are user-editable csv files, which contain sample information and parameters describing the locations of input files and required tools. MAAMD was tested by analyzing 4 different GEO datasets from mice and drosophila.
MAAMD automates data downloading, data organization, data quality control assesment, differential gene expression analysis, clustering analysis, pathway visualization, gene-set enrichment analysis, and cross-species orthologous-gene comparisons. MAAMD was utilized to identify gene orthologues responding to hypoxia or hyperoxia in both mice and drosophila. The entire set of analyses for 4 datasets (34 total microarrays) finished in ~ one hour.
MAAMD saves time, minimizes the required computer skills, and offers a standardized procedure for users to analyze microarray datasets and make new intra- and inter-dataset comparisons.
DNA microarrays were first developed to simultaneously quantify the expression of large numbers of genes. They are now commonly used for a variety of genetic analyses such as alternative splicing, microRNA regulation, and SNP detection. Many DNA microarray platforms have emerged but their fundamental design has remained relatively standard; they consist of arrayed DNA oligonucleotides (known as probes) that are complementary to specific DNA sequences . Since the majority of biomedical journals require that raw and normalized microarray data be accessible to the public at the time of publication, a significant number of datasets are publically available. Turning this huge trove of accessible data into useful conclusions is becoming increasingly problematic for scientists.
Currently, several types of microarray instrument platforms and their associated data formats are available, including: Illumina, Affymetrix, and Agilent. These datasets, once quality controlled and normalized, provide significant opportunities for conducting novel intra- and inter-dataset comparisons. Though only 872 of the roughly 11,000 microarray platforms currently found in the NCBI gene expression omnibus (GEO), are Affymetrix gene chips. Affymetrix platforms currently comprise ~17,300 among total ~38,700 datasets (series). This provides oppportunities to conduct meta-analyses minimizing platform biases from combining data from different array platforms, which can complicate meta-analyses .
Gene-set enrichment analysis (GSE), such as gene ontology and pathway enrichment, is a common method for microarray data analysis [3, 4]. Improved functional annotations of multiple genomes, increasing numbers of experimentally determined protein-protein interactions and improved pathway level relationships have dramatically increased the quality and scope of results from GSE analyses. Reanalysis of old datasets with the updated gene-set enrichment tools such as Gene Ontology and the Molecular Signatures Database (MSigDB) is likely to identify new signaling networks associated with the analyzed dataset . However, the analysis of microarray data involves multiple steps: data organization, quality control, normalization, differential gene expression calling, clustering and pathway analysis. Therefore, significant bioinformatics skills are required to properly conduct such microarray analyses. The development of a software-based workflow to automate these procedures would improve the efficiency and serve to standardize the multiple inter- and intra-dataset analyses.
Advances in computational techniques have also increased the number of open-source bioinformatics analysis tools for genomic analysis. Gricia et al. developed a workflow generator for microarray and sequence data . Pelizzola et al. developed an R-based package to automate microarray analysis . More R-based bioconductor packages such as GEOquery are now available, however running packages in R requires, for the most part, comfort in command-line interface since GUIs are not available for most R packages. Salomonis et al. developed an open source Python-based software called AltAnalyze for microarray and RNA sequencing analysis with the integrated gene-set enrichment software GO-Elite . AltAnalyze can also be run from a user-friendly GUI or compiled binaries can be run with simple command lines. These features combined with the increasing number of analysis options available in AltAnalyze facilitate its use by both general bench biologists and bioinformaticists wishing to incorporate its many I/O functions into analysis pipelines.
Scientific workflow systems provide an environment to aid the scientific discovery process through the combination of available tools for scientific data management, analysis, simulation, and visualization. Moreover, they provide a comfortable and intuitive graphical interface for designing and modifying workflows . Kepler is open-source software that supports user-customized scientific workflows. Stropps et al. has developed an automated workflow in Kepler utilizing Pelizzola’s package for Affymetrix microarray data analysis . Stropps’ workflow conducts the normalization, differentially expressed gene analysis (DEG), clustering analysis, and gene ontology statistics for one Affymetrix data set. In this study, we sought to develop a scientific workflow to facilitate and standardize the meta-analysis of multiple Affymetrix microarray datasets, regardless of species, utilizing state-of-the-art open-source bioinformatics programs: AltAnalyze and R/bioconductor packages. MAAMD is designed to encapsulate all steps involved in analyzing raw Affymetrix data files available in the GEO repository. MAAMD automates multiple dataset downloading, data organization, data quality control as well as normalization, several statistical methods for differential gene expression determination, mutltiple testing adjustments, clustering, and GO-Elite pathway and gene set enrichment analysis into a single, easy to use workflow. MAAMD is then expanded by enabling across-experiment/species comparisons extending the traditional intra-dataset differential analysis to rapid inter-dataset comparisons.
Kepler is a free, open-source software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows [10, 11]. It has been widely used in many areas of science that require sequential and parallel manipulation of large and complex datasets . The free availability, the capacity of multiplatform development, and the capability of utilizing existing packages or tools developed by R and Python make Kepler ideal for academic applications. In addition, Kepler presents workflows in a graphical format which promotes rapid comprehension of inputs, data flow and data processing . All components in Kepler are easily customizable, which leads to a fast and efficient extension and modification of the Kepler workflow. A new platform, “bioKepler”, developed for bioinformatics, will be available soon (bioKepler.org). Decades of available bioinformatics tools will be embedded in bioKepler and hence simplify the development and scalability of scientific workflows for bioinformatics in Kepler.
AltAnalyze is a Python-based open-source, cross-platform, free application for microarray and RNA sequencing analysis . It is packaged with an intuitive easy to use GUI, but has also been compiled on various platforms to enable users to run it in command line. AltAnalyze is a multi-functional software package that can summarize, organize and filter exon and junction-level data, calculate statistical scores for alternative splicing, alternative promoter selection or alternative 3′end processing, annotate regulated alternative exon events, and assess downstream predicted functional consequences at the level of protein domains, microRNA binding sites, and biological pathways (http://altanalyze.org). AltAnalyze has its own up-to-date relational database structure that stores the information of platforms, probesets, and species. Affymetrix library and annotation files are automatically identified and downloaded by AltAnalyze for a large variety of existing Affymetrix array platforms and species. AltAnalyze can also conduct data normalization and pathway analyses for a variety of other platforms (e.g. high content RNA sequencing, Illumina and Agilent). This software is also compatible with user-provided Affymetrix library files and supports the processing of non-Affymetrix normalized or raw tab-delimited expression files, including proteomics and metabolomics data.
The data resulting from running AltAnalyze include a series of text files that can be easily opened and further analyzed with standard spreadsheet programs. Graphical QC plots, hierarchical clustering heat maps, Principle Component Analysis (PCA) and pathway diagrams can be produced using the operating system-specific compiled versions of the source code. A large set of ontologies and other gene sets, including transcription factor targets, are available for GSE from AltAnalyze via the integrated GO-Elite analysis tool . The latest version of AltAnalyze incorporates a multi-threading technique, which improves the processing speed. With these rich features, AltAnalyze can conduct an analysis of microarray data comprehensively and efficiently [12, 13], even for those new to such analyses.
Briefly, this workflow consists of one DDF director, five customized actors, one loop control, and two display actors. Five string inputs are required: the number of targeted datasets, the location of the MAAMD workflow file, the location of the AltAnalyze executable file, the working directory, and the location of the file containing information for targeted datasets. The output is a folder with a fixed structure containing original data files, meta-analyses results by AltAnalyze, and the comparison results across the datasets. Once initiated, minimal user-interactions are required to complete the workflow.
Figure 2 indicates the detailed implementation of MAAMD workflow. The input information for the targeted datasets was parsed by module A, followed by the downloading and uncompressing of targeted data in module B. Then the information of individual datasets was parsed one by one in module C, controlled by an index control loop. The parsed information of an individual dataset was delivered to a composite actor module D, which conducts data quality estimation, data selection, normalization, and analyses. When the analyses of all datasets finished, a comparison among the results of these datasets was then executed in module E.
Among these modules, module D (Figure 2C), which analyzes individual datasets, is a central module. This module is a composite Kepler actor containing five basic actors. Briefly, the sample information in a single dataset was parsed in module A. The data was re-organized to facilitate the data identification and operation in module B. The data quality was then estimated by a composite actor, module C, whose structure was represented in Figure 2D. Based on the quality report, the user selects the qualified data in module D. The selected data was then normalized and analyzed by AltAnalyze in module E. When module E finishes its processing, it sends out a “Done” signal to trigger the next step.
Here are the brief introductions of each module in MAAMD.
The module “ReadSets” is the first module of MAAMD that is used to parse the dataset information contained in “datasets.csv” file. This module is an R-based Kepler actor. The input of this module is a string parameter that describes the location of csv file which is the parameter “DataFile” in Figure 2A. The output is the parsed information.
The module “DownloadData” is also an R-based Kepler actor that downloads and decompresses targeted datasets. The inputs of DownloadData are the parsed information from Module A and the path of work directory. Its output is the uncompressed.CEL files that are stored in a folder named as the assigned “Dataset” value in the work directory assigned by the parameter “WorkPath”.
The module “ReadSet” is designed to get the information for one dataset record, controlled by an index loop. The inputs of ReadSet are the parsed total information for all datasets listed in “datasets.csv” from Module A and a trigger signal from Module B. The output of ReadSet is the parsed information for one dataset.
The module “AAMD” is a composite actor containing 5 child actors. The inputs of AAMD are the parsed information from Module B, the location of AltAnalyze, and the working directory. The output of AAMD is a “Done” signal and a folder containing analyzed results. The following are the child actors involved in AAMD.
The actor “ReadCSV” reads information of samples in the targeted dataset including the original sample names and the corresponding customized names as well as group information. An AltAnalyze command is generated based on the input information. The inputs of ReadCSV are all inputs of AAMD module. The outputs are the parsed sample information and an AltAnalyze command.
The actor “ReorganizeCelFiles” renames the CEL files with its corresponding customized names. The inputs are the old file names, the new file names, and the location of files. The output of ReorganizeCelFiles is a “Done” signal.
“AnalysisPrepare” allows the users to select CEL files for further analysis based on the quality control report. The actor creates a “groups.txt” file describing group information and a “comps.txt” file describing the selected comparisons for further AltAnalyze analysis. A web page listing all possible comparisons between groups is created to allow users to select specific group comparisons. The inputs of AnalysisPrepare are parsed sample information and a trigger signal. The output is a “Done” signal.
“RunAltAnalyze” calls AltAnalyze by a command line and then analyzes the data in AltAnalyze. The inputs of RunAltAnalyze are the command line and the location of selected data. A “result” folder is created to store the analysis results in the folder named by its corresponding dataset name. The output is a “Done” signal.
The module “MultiComparison” is a composite actor to compare the differential gene expression among datasets across experimental conditions or species. The detailed structure of MultiComparison is shown in Figure 2B. This module is independent and can run without other modules as long as AltAnalyze has performed the analyses of targeted datasets. MultiComparison collects the analyzed results of targeted datasets and converts them into orthologous genes. The orthologous genes existing in all targeted datasets are summarized in a “ComparisonSets.txt” file. The inputs of MultiComparison are “datasets.csv”, the location of analyzed results, and the location of MAAMD workflows. A Kepler display pops up at the end of the analyses to show the summary of the comparison results or to remind the user if no gene orthologue can be found across multiple datasets.
An example of the csv file to describe targeted datasets
An example of the csv file to describe the samples in one dataset
The following software programs are required to run MAAMD: Kepler 2.4 or higher; Java run-time environment (which is a Kepler dependency); R 3.0.0 to support the R actors in Kepler; AltAnalyze 2.0.8 or higher. When running the source version of AltAnalyze, rather than compiled, additional dependencies are required as described here: http://code.google.com/p/altanalyze/wiki/StandAloneDependencies. Several Bioconductor R packages including GEOquery, AffyQCReport, and arrayQualityMetrics must be installed in advance. Detailed instructions with illustrations for how to set up a computer environment to run MAAMD are described in Supplement A.
A minimum of 1GB of RAM and at least a 2GHz CPU are required to run this workflow. The required disk space for the workflow outputs depends on the size of selected targeted datasets. Additional RAM (up to 8GB) and hard-drive space (up to 3GB free) are recommended for large microarray datasets or splicing-sensitive platform studies.
We conducted a case study of MAAMD to test the speed at which it was able to download, qualtity control, analyze and compare multiple expermiments across diverse species (mouse and Drosophila). Hypoxia, exposure of cells and tissues to O2 levels below ~5%, is widely associated with many diseases and its effects contribute to millions of clinical cases every year in the United States [14, 15]. The underlying conserved cellular responses to hypoxia are poorly understood. A study of orthologues responding to different hypoxic conditions or in different species could identify potential conserved genes and help further understand conserved cellular adaptations to hypoxia. Thus, we used MAAMD to identify transcriptional responses to hypoxia that are conserved in multiple hypoxia experiments and between mice and Drosophila.
Affymetrix Microarray data was downloaded from the NCBI Gene Expression Omnibus (GEO), a publicly available functional genomics data repository. We selected the Affymetrix datasets GSE9400, GSE12160, GSE14981, and GSE15879 consisting of profiling data from flies or mice exposed to a range of hypoxic or hyperoxic conditions. Specifics about each of these datasets can be found at the GEO website.
A summary csv file “datasets.csv” and four csv files for individual datasets “datainfo-gse9400.csv”, “datainfo-gse14981.csv”, “datainfo-gse15879.csv”, and “datainfo-gse12160.csv” were prepared based on the experimental description by the user as the input files of MAAMD. The input parameters were set as the locations of input files, AltAnalyze, MAAMD, and the total number of targeted datasets. Then, MAAMD was executed using a computer configured with Intel i7-3517U CPU 1.9 GHz, 8 GB RAM, 120 G disk space, and a Windows 8 64 bit OS.
The list of common regulated genes between datasets
Common regulated genes
GSE15879 v.s. GSE14981
A comparison between chronic and acute hypoxia in flies
Hsp26, HSPA1A, GstD1, CG14120
CG3734, CG13607, Cyp4d2
GSE15879 v.s. GSE12160
A comparison between chronic hypoxia and chronic hyperoxia in flies
Hsp26, CG14120, RGN, CG31300, LOC423786
Lsp1beta, CG15766, Cyp4d2, CG5897, HPGD, Lsd-1, si:dkey-7814.10
GSE15879 v.s. GSE9400
A comparison of chronic hypoxia in mice and flies
GSE15879 v.s. GSE14981 v.s. GSE12160
A comparison of chronic hypoxia, chronic hyperoxia and acute hypoxia in flies
MAAMD generated about 850 Mb of results including the downloaded data for above four datasets. Visualizations such as heatmaps for clustering analysis, pathway maps for pathway analysis were generated based on the resulting values automatically. The detailed values along with statistical evaluation were stored in files in “ExpressionOutput” folder and “GO-Elite” folder of each dataset. Moreover, we were able to make multiple comparisions across all datasets quickly and automatically.
The results indicate highly conserved gene responses to whole-animal exposure to hypoxia or hyperoxia in all datasets. GSE9400 contained profiles of soleus muscles from C57BL/10 mice exposed to gradient 8% O2 for 2 weeks revealed 2293 differentially expressed probesets (>2 fold change up or down, p < 0.05). GSE15879 contained profiles of flies exposed to a decreasing levels of hypoxia for 16 days displayed 201 probesets using the same fold and P-value cutoffs. In the two drosophila datasets, 28 probesets were signficantly changed in GSE14981 (flies exposed to 1% O2 for 2.5 hours) and 109 probesets were changed in GSE12160 (hyperoxia-selected flies).
Given the related yet significantly different experimental conditions, inter-set comparisons were still able to identify conserved gene expression responses between different inter-set comparisons. Table 3 summarizes the comparisons made and corresponding findings. Based on the results, genes Hsp26 and CG14120 are conserved in responding to chronic hypoxia, acute hypoxia and chronic hyperoxia in drosophila. This result is consistent with the knowledge that heat shock proteins encoding Hsp genes are conserved proteins responding to diverse environmental stresses . Little is known about CG14120, suggesting that using this meta-analyis in MAAMD can identify novel conserved functionally-uncharacterized genes that are regulated by divergent yet related (hypoxia vs hyperoxia) stress conditions. Depending on the comparisons made, MAAMD was able to rapidly identify subsets of genes that were regulated both in the same directions and in opposite directions across different hypoxic stresses, tissues and species very quickly (Table 3).
In summary, these results highlight the orthologue comparison function of MAAMD, the dramatic savings in time to properly conduct a cross-species meta-analysis and identify highly conserved gene responses.
The “MAAMD” workflow was developed and demonstrated to work efficiently for Affymetrix microarray data, the predominant plaftorm currently represented in GEO. Because of its design in Kepler, it can be easily adapted to analyze other platforms. This workflow improves the effeciency of microarray analysis and significantly minimizes the required computational and bioinformatic knowledge for the user, especially once installed. The extended function, which compares the analyzed results from multiple microarray datasets across species, further simplifies the work required by the user and significantly broadens its utility.
MAAMD improves the analysis efficiency by linking together previously separate tasks of data flow and avoiding the repeated memory or CPU resource allocation since the resources are assigned once it starts to run. Furthermore, the embedded tools such as AltAnalyze, R-based comparison codes are efficient in data processing. For example, the cross-species analysis between the datasets took only 3 minutes, which could be a very complicated and time-consuming process. The factors which affect MAAMD running time include the size of dataset, the type of species being analyzed and internet speed. The size of the dataset being downloaded and analyzed has a large effect on workflow execution time at almost every step from data downloading to data quality control and data analysis. The speed of the internet connection can significantly impact the time required for data downloading considering a single Affymetrix CEL file is on average ~10 MB in size. The species analyzed will also have an impact on the performance time, as a result of genome complexity and degree of annotation.
As an automated workflow, MAAMD avoids random operational mistakes such as a misclassfication of data files or erroroneous group comparisons due to simple human error. MAAMD minimizes the required user interaction by only occassionally pausing to recieve user inputs. To improve the flexibility and feasbility of MAAMD, there are several user interactions included in this workflow, such as the selection of data files before and after quality control. Some microarray datasets contain CEL files acquired by different platforms. This usually causes errors if a user analyzes them as a single batch. Data selection for further analysis after sample QC and proper selection of similar control types between datasets should avoid these kinds of errors. All of the required user interactions appear to the users as a web page with listed options which are direct and easy to understand and control.
The inputs of MAAMD are csv files summarizing the information of targeted datasets. Since the inputs are text-based tables and selectable options in lists, use of MAAMD does not require advanced computational skills. Additionally, MAAMD provides a standard and easily modifyable workflow for genetic data analysis such as RMA normalization, differential expression analysis, and clustering analysis with its built-in tools, Bioconductor and AltAnalyze.
Another important feature of MAAMD is that it can identify and download the required library files and annotation files automatically. This feature was inherited from its embedded tool, AltAnalyze. There are many microarray platforms which have different probeset libraries. Thus, matched library and annotation files are required to analyze a microarray dataset when starting with raw data files. It takes time for the users to find the proper resources, to understand the library and annotation files, and to interpret the raw microarray data with the matched library and annotation files. MAAMD automatically downloads library files and generates results with gene symbol and pathway information which can be easily understood. Moreover AltAnalyze is able to automatically download the most current gene ID information, pathway and ontology databses with just a few clicks and supports every species in these analyses for which Affymetrix provides a microarray for.
As a Kepler-based workflow, MAAMD can be easily extended or modified in Kepler. For example, MAAMD can be easily separated into several independent workflows which are available in the website http://www.biokepler.org/use_cases/maamd-workflow-standardize-meta-analyses-affymetrix-microarray-data. The workflow “MAAMD-Comparison” separates “DataComparison” module from MAAMD and can run as an independent workflow as long as the analyzed data are ready. The workflow “MAAMD-Download” allows users to download all targeted datasets independently from the data analysis. The workflow “MAAMD-AltAnalyze” can run the data analysis for the downloaded data independenty. These separated workflows allow users to use MAAMD in a more flexible way, as long as the users ensure the correct inputs for each workflow. Users who have programming skills can readily add more modules into the workflow or modify the workflow according to their own requirements.
Currently, MAAMD works for Affymetrix microarray datasets. It can be expanded to support more microarray datasets such as Roche and Illumia as both AltAnalyze and arrayQualityMetrics are able to analyze such data platforms. Since AltAnalyze supports the meta-analyses of aligned junction and exon sequences, this makes it possible to extend the workflow for the meta-analyses of RNA sequencing data, integrating it with the proper pre-processing tools for sequencing data. However, due to the many differences in analysis steps between microarray data and sequencing data, significant additional modifications would be required to extend MAAMD for sequencing data, e.g. base-level quality control, adapter trimming, alignment, alignment quality control and exon bed file generation.
There are many important caveats that must be considered when drawing conclusions from meta-analyses and inter-dataset comparisons that can compound false-discovery rates . A number of factors to consider are biologically related (e.g. intra-group variability), technically related (platform specific biases, sample preparation biases, missing probe IDs ect…) and those that arise from increasing the numbers of repeated measures and sample sizes. While it is impossible to correct for many of these, we have designed the workflow in such a way to enable the user to at least be aware of potential issues related to some of these. The quality control measures were designed to flag issues with samples that would cause major intra- and inter-group variability. AltAnalyze can also conduct intra-dataset COMBAT analysis, which employs a both parametric and non-parametric empirical Bayes frameworks to remove batch effects . Missing values for orthologues can be expected when conducting meta-analyses and approaches for imputing these values, if necessary, could be explored . Moreover, one must always be wary of drawing extensive conclusions from experiments that do not contain experimental replicates, however there are strategies for dealing with these issues . One advantage of our approach is that MAAMD defines differentially expressed genes within each experiment first before making inter-dataset comparisons. In our experience this serves to control well for many technically related biases such as biases introduced by array type and sample preparation.
MAAMD standardizes, simplifies, and dramatically decreases the amount of time required to analyze Affymetrix microarray datasets and to make intra- and inter-dataset comparisons. The minimized computer skills and bioinformatics knowledge required to run MAAMD makes it an attractive tool for biologists with limited programming skills and bioinformatics background. The extensibility of MAAMD means it could be a solid starting point for those researchers who have advanced programming skills and want to expand and/or modify the workflow for additional purposes.
Operating system: Windows and Mac OSX.
Requirements: R and AltAnalyze are required. The installation instructions and application instance are available in above website.
Other requirements: Internet connection.
License: Free for non-commercial and academic use.
Thanks to San Diego Supercomputer Center for their support of this research. Thanks to Jeff V. Dorn for testing MAAMD on MAC. This project was funded in part by the following NIH grants: 8 P41 GM103426, P01 HL098053, 1R01HL105242, iDASH-U54 HL108460, UL1TR000100 and American Heart Association 10SDG2630130. BioKepler is funded by NSF DBI-1062565 under CI Reuse and Advances in Bioinformatics programs.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.