The IronChip evaluation package: a package of perl modules for robust analysis of custom microarrays

Background Gene expression studies greatly contribute to our understanding of complex relationships in gene regulatory networks. However, the complexity of array design, production and manipulations are limiting factors, affecting data quality. The use of customized DNA microarrays improves overall data quality in many situations, however, only if for these specifically designed microarrays analysis tools are available. Results The IronChip Evaluation Package (ICEP) is a collection of Perl utilities and an easy to use data evaluation pipeline for the analysis of microarray data with a focus on data quality of custom-designed microarrays. The package has been developed for the statistical and bioinformatical analysis of the custom cDNA microarray IronChip but can be easily adapted for other cDNA or oligonucleotide-based designed microarray platforms. ICEP uses decision tree-based algorithms to assign quality flags and performs robust analysis based on chip design properties regarding multiple repetitions, ratio cut-off, background and negative controls. Conclusions ICEP is a stand-alone Windows application to obtain optimal data quality from custom-designed microarrays and is freely available here (see "Additional Files" section) and at: http://www.alice-dsl.net/evgeniy.vainshtein/ICEP/

Background DNA microarrays are a popular high throughput technique to perform genome-wide molecular and genetic experiments. The method requires computational postprocessing of primary data, which can be a challenging task due to high data variability. Variability results from each of the many steps involved in such an experiment [1]. Computational data processing is required for image processing, extraction of raw data, storage and normalization of the raw data, feature extraction, final data analysis and biological interpretation of the results. Several software packages are available to perform the described tasks but it is frequently necessary to develop custom-made solutions to fulfil individual requirements and appropriate statistical evaluation of gene array data [2,3].
Experiments on whole-genome arrays, such as Affymetrix GeneChips are highly reproducible, but still are unable to provide reliable data for all genes, especially for those with low expression levels. Moreover, there are technical limitations on array design. For example, standard glass slide arrays can not accommodate more than 60,000 spots, including all controls and replicates. This is often not sufficient for complex genomes such as the human genome. The relationship between fluorescent signal intensity and gene expression levels is linear only for a certain range of concentrations of spotted material. Thus differences in the linearity range can become more pronounced for larger wholegenome arrays [4]. In contrast, customized microarrays contain a selection of genes. This allows to include more replicates and controls. Smaller gene numbers, and higher numbers of repetitions will increase reliability of the data obtained from each microarray experiment, especially for those genes that are expressed to low levels.
IronChip is a cDNA microarray platform specifically designed to analyze genes related to iron metabolism. We have developed two types of this platform to analyze both human and mouse genes [5,6]. The design of this microarray enables detection of small but physiologically significant changes in gene expression, due to the high number of repetitive features. The current version of the mouse IronChip contains 520 genes involved in iron homeostasis and related pathways. To improve array sensitivity and data robustness, each gene on the array is represented by several ESTs. Each EST, in turn, is represented by a minimum of six spots. Some of the most relevant iron-related genes are represented by up to 24 spots. This microarray further contains a collection of negative controls, specificity controls and positive (spike-in) controls [6]. Custom microarray platforms, such as the IronChip, provide more data than required or exploited in standard statistical analysis. To incorporate all the advantages such a chip design offers for data analysis we developed the IronChip Evaluation Package (ICEP). ICEP makes use of the high number of repetitions to improve data quality. The comparison of different ESTs enables reliable detection of transcript-specific regulation (e.g. alternative splicing variants of the same gene). Analyses of the positive and negative controls allow precise calculation of a reliable ratio cutoff as well as to estimate background noise, respectively. Implementation ICEP exploits a collection of Perl programs and utilities with a Perl Tk GUI (graphical user interface). The Perl routines were all newly custom written for the purpose of rapid and solid microarray data analysis. ICEP features a decision-tree based algorithm to optimize spot selection and exploit here in particular multiple repetitions of ESTs. ICEP applies grouping rules in its decision tree algorithm to calculate signal intensity ratios for each individual group of ESTs representing the same transcript (this is explained step by step in the application of the analysis pipeline together with supporting online material http://www.alice-dsl.net/evgeniy.vainshtein/ ICEP/ICEP_manual.html). The pipeline is summarized in Figure 1A. ICEP does not use or require any existing software libraries and it can directly process simple tab delimited tables of array data of any type. It adds its optimized spot selection, filtering and normalization procedure to standard software such as Bioconductor [3] and can used in combination with these or equally well alone.  [8]. ICEP uses these output files (generic tab-delimited text tables containing the normalized signal intensity and background data) for further analysis.
ICEP is packaged to a Windows executable with a PDK (Perl developer kit). It can be operated in a command line mode or in a batch mode for the analysis of multiple arrays. A simple editor allows to specify all microarray data files for batch analysis. In a command line mode the ICEP analysis core itself can be executed under any operating system supporting Perl. Some Windows specific features, such as exporting of output data to an Excel table or Perl Tk GUI requires adaptation to a specific operating system. In general, the modular structure of ICEP allows porting it to any other operating systems supporting Perl.

Data formats
ICEP recognizes any generic tab-delimited text tables from any type of gene microarray containing the normalized signal intensities and background data (e.g. from the ChipSkipper application; [7]). Chip-Skipper generates a tab-delimited text table containing raw and normalized signal intensity values, background signal intensity, physical coordinates of a spot relative to an upper left corner of a glass slide, cDNA sample position on original PCR spotting plate (row, column, plate Nr), some flags and statistical values related to the spot geometry and other internal values. The ICEP uses only few of those columns: spot and clone coordinates, comments and background-compensated and normalized signal intensity value from both channels. The build-in utility recognizes not only different formats of a Chip-Skipper output file, but any generic tab-delimited text file gene expression array data can be processed by ICEP using the provided flexible configuration tool and alternative input file formats are added using the provided flexible configuration tool. Results are saved in tab-delimited format or Microsoft Office Excel formats.

Performance
We tested ICEP performance by measuring time consumption to analyze microarray data from different mouse IronChip versions (version 2.0 contains 559 transcripts, while version 7.0 contains 932 transcripts) (Figure 2A) or by analyzing a set of virtual arrays (1000 to 9000 features, with 1000 features step). On average, ICEP could evaluate 208 features per second. The time per run increases linearly with an increasing number of analyzed features ( Figure 2B).

Results and Discussion
User interface ICEP has been developed as a stand-alone application and does not require any special environment. It runs under any Windows operating system (it was tested on Windows 2000, XP and Vista). The interface has been designed to be highly user-friendly and interactive. It helps the user to apply all analyses while hiding the complexity of the underlying statistical methods. The interface provides easy access to different layers of microarray analysis, including single array analysis, dyeswap analysis and the generation of a final report. A step by step user manual is available in additional file 1 and at: http://www.alice-dsl.net/evgeniy.vainshtein/ICEP/ ICEP_download.html

Application of the analysis pipeline
The ICEP data analysis package was designed to be both highly flexible and user friendly. Data analysis involves an analysis pipeline that is composed of three elements: single array analysis, dye-swap experiment analysis and final report generation ( Figure 1B). In addition, the pipeline can either be executed directly or in a batch analysis mode.
In our application example, we utilized the mouse IronChip microarray that contains 520 genes as well as positive and negative controls. The controls are represented by 1400 spike-in control spots (positive controls) and 2400 background control spots (negative and specificity controls). 520 genes are represented by 880 ESTs. Genes are represented by 2 or more ESTs. Each EST was spotted on the array at least 6 times. 5400 spots in total are located on the array.
At the first level of analysis (single feature level), ICEP performs logarithmical transformation of the data and separates background and control spots from the rest of the data. ICEP then calculates a background cut-off value based on median signal intensities of all background and negative control spots, and an intensity ratio cut-off value, based on signals from the spike-in controls. Intensity ratios of all remaining genes are calculated as well. At the same time, ICEP performs a feature extraction procedure, whereby all repetitive features, representing the same EST are grouped together. After calculating the background and ratio cut-off values ICEP assigns the following flags: (1) the P-call flag (true positive call), which is based on a comparison of a signal intensity of each channel with the background cut-off value (Table 1); (2) the regulation flag (significant difference in gene expression between two channels), which is based on comparison of a signal ratio to the ratio cut-off value ( Table 2). At the second level of analysis (EST level) ICEP assigns further flags to ESTs and estimates the data quality based on flags recorded on a single feature level. EST P-call flags are calculated by ICEP according to the rules given in Table  3. Definition of a P-call flag (at the EST level) is based on the P-calls of individual features. Corresponding threshold is set to 60%, due to the fact that a control microarray experiment (hybridization of a Hemin-and Desferrioxamine-treated HeLa cells) shows similar results to published data only when the EST P-call threshold is about 60%. Significant increase of the EST P-call threshold causes additional false negative results while decreasing of this value causes additional false positive results The comparison of average or median signal intensity ratios to the previously calculated ratio cut-off value yields UP/DOWN/NONE-flags, similar to the flag calculations described above. At the EST level ICEP calculates the relative error: the ratio between the standard deviation and the average of signal intensity ratios of all features representing single ESTs. At the transcript level ICEP uses the relative error as a measure of reliability of technical and biological replicates.
Preceding transcript level analysis, ICEP analyzes whether any bias has occurred as a consequence of the dye (Cy5 or Cy3 labelled nucleotides) incorporated into hybridization probes. To avoid such dye bias in two-colour microarray hybridizations the experimental and the control sample are routinely labelled with Cy5 and Cy3     labelled nucleotides, respectively, plus the other way around (dye swap). Such analysis avoids inconsistent signal intensity ratios that are artefacts due to the dye incorporated into the hybridization probe. Depending on whether an EST shows a similar average signal intensity ratio within the dye-swap data set, ICEP defines a dye-swap reliability flag (Table 4). On the transcript level ICEP applies grouping rules to calculate signal intensity ratios for a group of ESTs representing the same transcript. For this purpose ICEP is using all quality flags described before. On this level ICEP decides whether to average signal intensity ratios from different ESTs to a single value, to treat each EST as a separate transcript or to mark the complete set of ESTs as non-reliable. ICEP is able to analyze and group values from up to six similar ESTs representing a single gene (six is the maximum value in the current version of the IronChip microarray). To illustrate the grouping procedure a scheme is presented in Figure 3 which is based on 2 ESTs representing a single gene. Table 5 represents the possible flag combinations.

Validation
We applied the ICEP software to analyze IronChip microarray data. We selected a previously reported experiment that analyzes iron-loaded (hemin-treated) and iron-deficient (Desferrioxamine-treated) HeLa cells. Cellular iron overload or deficiency caused the expected changes in gene expression [6,8]. Table 6 represents the data of a dye-swap experiment. Application of ICEP reveals the previously reported and experimentally validated changes in mRNA expression of genes such as Tfrc (Transferrin receptor 1), Slc11a2 (NRAMP2; DMT1: Natural resistance-associated macrophage protein 2; Divalent metal transporter 1) and Ftl (Ferritin light chain; Ferritin L subunit).
The complete data set (additional file 2) and ICEP installation package (additional file 3) are available in the "Additional Files" section and on-line at the web page.

Comparison to other microarrays analysis software
We compared the performance of ICEP to other genome array tools such as Bioconductor [3], Gene-Spring (Agilent, Santa Clara, CA) [9] and a web-based tool called "Expression Profiler" [10]. While GeneSpring EST shows "P" or "M" p-calls and is UP-regulated in the Cy5 and DOWN-regulated in the Cy3 experiment, or vice versa.

TRUE
EST shows "P" p-call and is UP-regulated or DOWNregulated in the, Cy5 while the Cy3 experiment shows a tendency towards the correct direction based on the ratio cut-off value, or vice versa TRUE EST shows "P" or "M" p-call, but both experiments show NONE-regulated with one is UP-regulated or DOWNregulated other is NONE-regulated, a tendency of regulation towards the correct direction based on a ratio cut-off ICEP determines a reliability flag by evaluating the P-call and regulation flags of ESTs. The reliability flag is used by ICEP to distinguish reliable from unreliable expression changes. contains a big collection of array analysis tools it is inferior in the processing speed. Bioconductor is faster than ICEP (about 20% in running time) but as well as GeneSpring by default has limitations in processing multiple repetition features compared to ICEP. Using Bioconductor and R programming language it is possible to implement any feature recognition and statistical algorithm, but this requires additional coding work and tests. Expression profiler provides a very convenient interface and world-wide access to data analysis, but has a limited amount of statistical instruments due to its web-based nature. In general, most of the available software does not offer appropriate feature extraction from different multiple spotting regimes or recognition of feature groups. GeneSpring and Bioconductor support averaging of replicated features but do not support grouping rules for the analysis of repetitive ESTs.

Conclusion
We introduce a new flexible microarray analysis tool named ICEP, optimized for robust statistical analysis of specialized custom cDNA or oligonucleotide microarrays. Our analysis yields values for ratio cut-off, background noise, multiple repetitions and detailed feature extraction as well as grouping rules. ICEP is easily extended to support further input and output data formats and different data transformation steps can be added. ICEP allows rapid post-processing of microarray data on a user-friendly platform. Software, example data and a tutorial are open source, free and available for downloading.