A main objective of functional genomics is to understand the relationship among genes and molecular functions. Researchers try to elucidate these relations through the systematic investigation of genes activity, for instance by using the microarray technique that enables the monitoring of all the genes during the cell phases or during the response to an external stimuli [1]. The central dogma of molecular biology merges together DNA, RNA, and proteins in a close relation, so the investigation of the RNA elucidates the functions of DNA. DNA microarrays enable the investigation of the activity of genes in different conditions (e.g. in different temporal points or under different drug concentrations). Recent microarray chips, such as the Affymetrix [2] Human Gene 1.0 ST, enable the simultaneous investigation of more than 33.000 genes. Such technology uses a single chip to monitor the activity of a set of genes through the investigation of the mRNA. Each chip uses a large number of probes to bind the mRNA of the biological sample under investigation. Each probe is marked with a fluorescent colour, so the fluorescence intensity takes into account the amount of the mRNA present in the sample. Usually, microarrays employ a redundant number of probes in order to minimise the experimental error in intensity measurement. Thus the value of fluorescence intensity associated to each gene has to be deducted from the values of all the associated probes.
The first output of a microarray experiment is an image where pixels intensities are related to gene expressions values. Images are then encoded into numerical data by using proprietary tools that extract regions corresponding to probes and convert their pixel intensities into a numerical value. These files usually use a proprietary format defined by the microarray vendors and are not automatically readable by existing analysis platforms, thus the analysis of microarray data requires a preliminary preprocessing activity before the further analysis [3–7]. The preprocessing involves several phases, among those: denoising, background correction, normalization and summarization, i.e. the computation of a specific gene expression value obtained by combining the values of corresponding probes [8, 9].
For instance, let us consider the analysis of microarray data derived by Affymetrix chips with the TM4 [10] platform. This platform is not able to directly manage the Affymetrix CEL files, so the user has to perform some steps manually employing external tools such as the Affymetrix Expression console [2] or the Affymetrix Power Tools [2], or third part software such as RMAExpress [11]. The process starts with the summarization phase, i.e. converting images into numerical data. This phase combines multiple probe intensities into a single expression value. All arrays employ more than one probe for each gene as introduced before. Summarization takes into account all of the probes for the same gene and averages them enhancing the signal-to-noise ratio. Summarization requires the use of proper libraries that store the association among pixels and probesets. Such libraries are provided by Affymetrix as Chip Description File (CDF).
For Affymetrix arrays, summarization is usually done together with the normalization, using the same Affymetrix libraries which store the topological information about probes. There exist different summarization algorithms for expression arrays, such as the Robust Multi-array Average (RMA) [11] and the Probe Logarithmic Intensity Error (PLIER) [12], included into the Affymetrix tools. Moreover, for exon arrays, the Detection Above BackGround (DABG) [13] method is also used to generate a detection metric to enhance the detection of background noise.
The simplest approach for normalizing microarray data is to re-scale each expression value of a dataset. There exist two main approaches for normalization: the quantile algorithm [14] and the sketch-quantile algorithm that requires less computational resources.
For instance, the following command line shows the use of APT to normalize and summarize an Affymetrix dataset:
apt-probeset-summarize -a rma -d HuEx-1_0-st-v2.cdf -o/home/output -cel-files/home/list.txt
In particular, the rma option specifies the RMA summarization algorithm, the -d HuEx-1_0-st-v2.cdf option specifies the Human Exon 1.0 st library, while the -o -cel-files options specify respectively the output and input folders (where list.txt specifies the list of input cel files).
After summarization and normalization, the user has to associate to each expression value the related gene and eventually some further biological annotation, such as the gene symbol or information extracted by Gene Ontology [15]. Often annotation files are provided by the chip manufacturer and contain different levels of annotation, e.g. database identifier, description of molecular function, associated protein domains. It should be noted that not all the preprocessing tools allow the annotation of gene expression values. Finally, preprocessed data, organized in a suitable data structure (e.g. a comma separated value table), can be read and analyzed by the TM4 platform. The main drawbacks of such an approach are: (i) the need to generate and store intermediate files in a manual way that prevent the automation of the process; (ii) the need to know the details of the used chips and related preprocessing tools and libraries; (iii) the need to manually download the most updated summarization and annotation libraries from the vendor website, and (iv) the need to manually import preprocessed files in the analysis platforms, that may introduce errors. The automation of the preprocessing pipeline could speed-up the entire analysis process and reduce possible errors, allowing the user to concentrate on biological aspects. From this scenario, in order to enable the sharing and cross-comparison of microarray results from different platforms and laboratories, we propose a software tool to automatize the preprocessing of raw microarray data.
The proposed tool, named μ-CS, is able to preprocess Affymetrix microarray data in order to simplify and automatize the summarization, normalization, and annotation of microarray data. μ-CS is based on a distributed architecture and uses the client/server model. The μ-CS server is in charge of tracing the versions of libraries used to preprocess arrays data and made available by microarray vendors, allowing a transparent access to the right and most updated versions of preprocessing libraries. The μ-CS client implements the preprocessing methods by wrapping the Affymetrix APT preprocessing tools and implements the annotation process by joining summarized data with annotation libraries. The main idea of μ-CS is to reduce the number of information provided by the user during the data preprocessing, by embedding chip and software details into the μ-CS databases. Examples of such details are the chip type and version, the software version, the link where to download the last version of the Affymetrix Power Tools and annotation libraries, what chip library and annotation library need to be used for a given chip, and so on.