A Web-based and Grid-enabled dChip version for the analysis of large sets of gene expression data

Background Microarray techniques are one of the main methods used to investigate thousands of gene expression profiles for enlightening complex biological processes responsible for serious diseases, with a great scientific impact and a wide application area. Several standalone applications had been developed in order to analyze microarray data. Two of the most known free analysis software packages are the R-based Bioconductor and dChip. The part of dChip software concerning the calculation and the analysis of gene expression has been modified to permit its execution on both cluster environments (supercomputers) and Grid infrastructures (distributed computing). This work is not aimed at replacing existing tools, but it provides researchers with a method to analyze large datasets without any hardware or software constraints. Results An application able to perform the computation and the analysis of gene expression on large datasets has been developed using algorithms provided by dChip. Different tests have been carried out in order to validate the results and to compare the performances obtained on different infrastructures. Validation tests have been performed using a small dataset related to the comparison of HUVEC (Human Umbilical Vein Endothelial Cells) and Fibroblasts, derived from same donors, treated with IFN-α. Moreover performance tests have been executed just to compare performances on different environments using a large dataset including about 1000 samples related to Breast Cancer patients. Conclusion A Grid-enabled software application for the analysis of large Microarray datasets has been proposed. DChip software has been ported on Linux platform and modified, using appropriate parallelization strategies, to permit its execution on both cluster environments and Grid infrastructures. The added value provided by the use of Grid technologies is the possibility to exploit both computational and data Grid infrastructures to analyze large datasets of distributed data. The software has been validated and performances on cluster and Grid environments have been compared obtaining quite good scalability results.


Background
During the last years, genomics and proteomics have deeply changed the scientific approach to the study of the molecular basis of cells and tissues behaviors both in physiological and pathological conditions, giving a new comprehensive view to the research community.
As the interest on these fields has been more and more increasing, innovative and more suitable technologies have been developed. At present, one of the most promising and reactive fields is certainly the microarray technology, which has had, so far, a great scientific impact and a wide application area. In fact, several types of micorarrays have been developed and proposed, each focused on a specific type of analysis, from genetic screening to proteomics and from biological research to diagnostics.
Through the comparison of genomic profiles it is possible to study gene expression differences among cross-correlated conditions, thus understanding their meaning. Thanks to the microarray technology a large number of genes may be investigated at the same time to find which are differentially expressed on a certain cell type. Quantitative researchers have proposed a variety of methods for handling probe-level data from Affymetrix ® oligonucleotide arrays. Such methods employ different procedures for adjusting background fluorescence, normalizing data, incorporating information from "mismatch" probes, and summarizing probe sets.
Even if microarrays are a powerful instrument, studies on these data are often conditioned by technological limits, thus decreasing their capabilities. The most relevant limitation concerns the analysis of large datasets. In fact this kind of analysis requires long computational times rather than the availability of specific hardware. A huge availability of memory and computational power is required for analyzing microarrays and often researchers cannot succeed in performing their studies because of the impossibility to access suitable resources.
Several tools and algorithms had been developed in order to analyze microarray data, all of them consisting in standalone applications. Two of the most known free analysis software packages are the R-based Bioconductor and dChip [1,2]. This work is not aimed at replacing those systems, but it provides researchers with a new method to analyze large datasets without any hardware or software constraints, by simply using a common web browser. To reach this aim, dChip software has been modified by using appropriate parallelization strategies, to permit its execution on both cluster environments and Grid infrastructures, exploiting existing computational and storage capabilities. Since dChip is a wide application containing a large number of functionalities, this work is related to the computation and the analysis of gene expression.

Implementation
The goal of this work is focused on the design and development of a tool for the analysis of gene expression to be included in a more general Grid-enabled software application for the analysis of microarray data. As an added value, the use of Grid technologies makes it possible to exploit both computational and data Grid infrastructures to analyze large datasets of distributed data. The execution is supported on 64 bit computers too.
With regard to the user interface, as a first release, the application was implemented in a command line version to permit the execution on remote computing elements. Two input files are used: the first one contains specific options for the execution; the second one contains the list of the microarray files used for the analysis. As a second step, in order to simplify the use of above mentioned dChip versions, the executables have been integrated within a biomedical portal [3,4] which provides a simple graphical user interface to run the application. Such a portal integration allows unpractised users to store their experimental data on a complex storage system and access distributed data and services in a transparent way. Furthermore users can easily run the application from any computer or location with only Internet connection, without loosing time in installation and maintenance procedures. Moreover, users can use the software through a simple web interface and launch their analyses taking advantage of the possibility to orchestrate different portal services in a workflow strategy. Thanks to the ease of the web interface, users are not required to know technical dChip details.
The new software version has been designed to be modular, i.e. the original software has been divided into several independent modules, each performing a different part of the analysis. This approach has allowed to improve (i) optimization, by implementing the most appropriate parallelization strategy for each part of the analysis and (ii) scalability, by replacing in a transparent way one or more modules with other, more powerful, ones or with modules providing different functionalities. The application has been structured in three different modules that have to be executed sequentially ( Figure 1): • module 1: opening, reading and normalization of CEL files • module 2: computation of expression levels • module 3: filtering, extraction and clustering of differentially expressed genes Each of them has been designed as a standalone program working in an independent way. Data and overall information are moved through the modules using CSV (Comma Separated Values) file format. The final output is composed of three main files containing respectively: the expression values in an R compatible format, the list of the differentially expressed genes and the cluster tree.
Using large datasets, long execution times and great computational efforts are required. Parallelization strategies are necessary to improve performances and to allow the analysis of a large number of arrays in a short time. A first accurate analysis of dChip algorithm revealed the possibility to parallelize both the first and the second module that implement the most data intensive algorithms from a computational point of view. The applied parallelization does not affect the original algorithms of dChip but it is related to a data access strategy. Since the algorithms for normalization and expression calculation work in different ways, two different parallelization approaches have been adopted.
The normalization algorithm is based on the invariant set method. It works by processing each array separately with a baseline chosen as the median intensity array. Therefore the Module 1 has been parallelized according to the number of microarrays. Each parallel execution opens a restricted number of files, normalizes them against the baseline and writes the related CSV output files.
DChip algorithms concerning the calculation of gene expression (PM only and PM/MM methods) work in a different way, since they need all genes from all microarrays to work. So the Module 2 has been parallelized into a number of executions, each reading all CVS files of all normalized arrays but performing the calculation only on a restricted group of genes. The execution results of each subset are then merged and the output thus containing the expression levels of all data is produced.
The third module reads the CSV file containing the gene expression values and allows to perform the filtering over genes, the extraction of differentially expressed genes and some clustering operations by using the dChip unmodified algorithms.
Two different modalities of parallel execution are available: with or without MPI (Message Passing Interface) libraries. The second approach allows the execution on environments not supporting MPI technology, but requires specific scripts for the management, the submission and the monitoring of parallel jobs.
Finally the code has been modified to enable the submission to the Grid infrastructure. For this purpose the gLite [5] middleware has been considered. In order to allow to read and to write files on remote and distributed storage elements, GFAL API [6] has been used. In this way it is possible to access data, reading the whole files, or a part of them, directly where they are stored without moving them to Grid elements that actually run the calculation. Thanks to a Public Key Infrastructure (PKI) [7], which provides X.509 certificate based authentications, this solution allows to preserve user privacy and data security.

Results and discussion
An application able to perform the computation and the analysis of gene expression on large datasets of microarrays has been developed using dChip algorithms. In details, concerning pre-analysis, the invariant-set method has been used for normalization and PM-MM difference model or PM-only models can be chosen for genes expression calculation. Original dChip functionalities like filtering, differentially expressed genes (compare samples) discovery and clustering are provided by Module 3. Customized analyses can be performed by setting specific parameters inside the input file.
By modifying the Makefile with the appropriate options it is very easy to obtain different versions of the application Organization of dChip in different modules Figure 1 Organization of dChip in different modules. A graphical representation of developed dChip modules is shown. The original software was divided into three different modules concerning respectively (i) normalization, (ii) expression values computation, (iii) filtering, differentially expressed genes extraction and clustering.
depending on the kind of infrastructure chosen for the analysis: standalone Linux, MPI or Grid-enabled versions.
Starting from the developed application different tests have been performed in order to validate results and compare performances obtained on different infrastructures. To this goal, tests have been divided into two different categories: • Validation Tests

Validation Tests
In order to validate the results obtained with the developed software, a small dataset coming from a published study [8] has been used for the analysis. The considered case study concerns the comparison of results obtained from separated analyses of HUVEC (Human Umbilical Vein Endothelial Cells) and Fibroblasts, derived from same donors, treated with Interferon-α (INF-α), to the purpose of identifying interferon's effects on transcriptome of endothelial cells.
The dataset is divided in two parts with the following features: The datasets are both composed by baseline and experiment arrays (respectively untreated and treated with INFα) and for each of them the following steps, according to the original analysis, have been performed: • Normalization: Invariant-set method • Model-based expression: PM Only model [9,10] • Extraction of differentially expressed genes: fold change with threshold 2.
As a first test, microarrays have been analyzed using both original and modified versions of dChip (standalone Linux version, parallel and grid-enabled). The same options have been set in all tests in order to compare final results. Figures 2 and 3 represent the mean values of gene expressions, computed respectively on the baseline and experiment arrays on HUVEC data, coming from developed and original dChip versions. We notice that all the new dChip versions give the same results. There is a really small difference between Windows and Linux versions of dChip. This is probably due to the different approximations between compilers on Windows and Linux platforms. However, these little differences do not affect the global final result that can be considered pretty much the same.

Tables 1 and 2 and
As a second test, the same analysis has been performed by using R/Bioconductor software using both GCRMA [11] and RMA [12] algorithms and results have been compared with the previous obtained with dChip.
This comparison is principally for completeness purposes since the dataset was published with results coming from an R/Bioconductor analysis.
Although there are currently many different methods for processing and summarizing probe level data from Affymetrix oligonucleotide arrays, R/Bioconductor and dChip are two of the most popular methods that consistently produce the best agreement between oligo array and RT-PCR data for medium and high intensity genes [13,14]. It is known that often expression values computed with dChip and RMA algorithms show similar results, while results are different for GCRMA. Figures 2 and 3 show the comparison between dChip Linux and R/Bioconductor results obtained on the former data. It's observable that dChip and RMA present similar trends conversely to GCRMA results.

Tables 3 and 4 and
Ultimately, results of the entire analysis related to differences between HUVEC and FB are illustrated. We have found that using all dChip developed versions, in HUVEC, 239 genes were up-regulated (> 2-fold increase) by IFN, including genes involved in the host response to RNA viruses, inflammation, and apoptosis. Interestingly, 35 genes showed a > 4-fold higher induction compared  with human fibroblasts. Obviously, because the results of the published study had been obtained using GCRMA algorithm, they are not exactly the same of dChip's. These show 175 genes up-regulated by IFNs in HUVEC and 41 genes with a > 5-fold higher induction compared with human fibroblasts. However it's interesting to notice that quite similar results have been found.
In particular (Table 5) we have found that CXCL11 (chemokine (C-X-C motif) ligand 11) is selectively induced by IFN-α along with other genes associated with angiogenesis regulation, including CXCL10, TRAIL, and guanylate-binding protein 1.

Performance Tests
These tests, although far from any biological meaning, have the only purpose of comparing performances using a large dataset on different environments. In details several application tests have been performed on both cluster and Grid environments using different values of parallelization rate and final results have been compared.
In order to create a large data set for testing purposes, the on-line public repository GEO [15] has been used. A dataset of 1000 HG-U133A Breast Cancer microarrays has been made available. It shows the following features: • ChipType: Affimetrix GeneChip HU-133A Trend of (mean) expression values of baseline HUVEC arrays using R/Bioconductor and dChip algorithms Figure 2 Trend of (mean) expression values of baseline HUVEC arrays using R/Bioconductor and dChip algorithms. A graphical representation of results presented on Table 1 and 3 is shown. It's worth noting that dChip versions results are overlapped and they have a similar trend compared to RMA algorithm.
bioinformatics applications requiring great computational efforts.
Grid tests have been performed using the gLite middleware on the BIOMED Virtual Organization [17] of the EGEE (Enabling Grids for E-sciencE) infrastructure [18]. In this case data had been previously uploaded on several remote and distributed storage elements and have been analyzed submitting more parallel jobs through opportune strategies.
Concerning Grid tests, a not-MPI parallel implementation has been used, because MPI jobs are unstable on the gLite middleware, at the present time. More parallel jobs can be submitted and monitored using ad hoc scripts.
Two kinds of test have been carried out: (i) scalability on the number of parallel jobs, (ii) scalability on the number of microarrays.
As a first test, a subset of 100 microarrays has been analyzed to compare the performances of the two parallel modules in changing the parallelization rate, both on cluster and grid implementation. In detail, four tests have been run using respectively 5, 10, 15 and 20 parallel jobs. By comparing results, we can observe that Module 1 has a better scalability ( Figures 4A and 5A) due to the different parallelization strategy adopted. In fact, while in the first module there is a reduction in time for all the three execution steps (file opening, normalization and output writing), in the Module 2 ( Figures 4B and 5B) we have a reduction only for gene expression calculation whereas file opening and output writing remain constant. Through the comparison of cluster and Grid executions we can notice that this trend is approximately the same in both conditions. Eventually, the speedup ratio (S(N) = T(1)/ T(N), where T(1) is the execution time on a single processor and T(N) the execution time on N processors), has been calculated for the cluster tests with the purpose to estimate the parallelization efficiency. As shown in Table  6, it is worth noting that for 100 microarrays the parallelization has a quite good result up to 15 parallel jobs.
As a second test, the whole 1000 microarrays dataset has been analyzed on the Grid, by running the parallel dChip version using 10 parallel jobs, in order to investigate the trend of performances according to the dataset dimensions. In Figure 6 we can see the results concerning respec- Figure 3 Trend of (mean) expression values of experiment HUVEC arrays using R/Bioconductor and dChip algorithms.

Trend of (mean) expression values of experiment HUVEC arrays using R/Bioconductor and dChip algorithms
A graphical representation of results presented on Table 2 and 4 is shown. It's worth noting that all dChip versions results are overlapped and they have a similar trend compared to RMA algorithm.   The induction of IFN-α on HUVEC respect to Human Fibroblasts is shown. In particular the up-regulated HUVEC genes with fold induction more then 3 higher compared with Fibroblast's are shown. The genes associated with angiogenesis induction are marked with bold text. As we expected, results show that CXCL11 gene is the most discriminator.
tively the Module 1 and Module 2 executions. Cluster performances show better results than the Grid ones.
Actually the great advantage for researchers in using Grid is the possibility to store a large amount of data, to run complex algorithms and to access all data shared by Grid virtual communities, using remote resources.
This feature has a great relevance especially for small laboratories or for researchers that, due to the high cost of producing microarray data, cannot perform analyses using large datasets (a necessary condition to get better results). The access to a Grid environment makes it possible to get access to all the datasets made available by the community and to perform more accurate analyses. As previously said, the Grid provides several advantages related to security aspects as well. Using the Grid certificate-based authentication data are safe from possible attacks to privacy and security.

Scalability on cluster environment by increasing parallelization rate
Finally, the integration of this application into a Grid-enabled portal provides a simple graphical user interface to run the application. In this way, researchers do not need any particular hardware or software installed locally but only a web connection to the portal.
Actually, besides the cost of producing data, another relevant issue concerns the analysis of large datasets using standalone applications running on local hardware. The use of such applications implies the existence powerful computers available locally, and, often, this is not possible, even in large laboratories. The previously explained analysis related to 1000 microarrays is an example of experiment that could not be performed using standalone applications, even on the most recent powerful computers.
Our solution resolves this problem and provides users with a web-based service able to launch more analyses in a parallel way, easily monitoring the status of executions directly from the portal.

Conclusion
A scalable way to analyze large microarray datasets has been presented. To do that, we have ported existing tools to High Performance and Grid Computing environments. dChip software has been ported on Linux platforms and modified, by using appropriate parallelization strategies, to permit its execution on both cluster environments and The table shows the mean execution times coming from three different execution of dChip MPI version run on LITBIO cluster using respectively 1, Grid infrastructures. The added value provided by the use of Grid technologies is the possibility of exploiting both computational and data Grids infrastructures to analyze large datasets of distributed data. The software has been successfully validated through the comparison with the original standalone Windows version of dChip. Performance tests were performed in order to investigate the improvements on performances related to the adopted strategies for parallelization. Moreover these tests have been used to compare cluster and Grid performances too. As result we found that parallelization gives quite good results in terms of execution times, especially for the first module. Furthermore we found that Grid executions have longer execution times rather than cluster ones. But it is worth noting that the relevance related to the use of Grid computing for the presented application is principally focused on the opportunity of sharing data. This is done through different research groups and exploiting distributed computational resources rather than on the improvement of performances.