ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization

Background Statistical analysis of DNA microarray data provides a valuable diagnostic tool for the investigation of genetic components of diseases. To take advantage of the multitude of available data sets and analysis methods, it is desirable to combine both different algorithms and data from different studies. Applying ensemble learning, consensus clustering and cross-study normalization methods for this purpose in an almost fully automated process and linking different analysis modules together under a single interface would simplify many microarray analysis tasks. Results We present ArrayMining.net, a web-application for microarray analysis that provides easy access to a wide choice of feature selection, clustering, prediction, gene set analysis and cross-study normalization methods. In contrast to other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ensemble feature selection, ensemble prediction, consensus clustering and cross-platform data integration. By interlinking different analysis tools in a modular fashion, new exploratory routes become available, e.g. ensemble sample classification using features obtained from a gene set analysis and data from multiple studies. The analysis is further simplified by automatic parameter selection mechanisms and linkage to web tools and databases for functional annotation and literature mining. Conclusion ArrayMining.net is a free web-application for microarray analysis combining a broad choice of algorithms based on ensemble and consensus methods, using automatic parameter selection and integration with annotation databases.

Example analysis of the leukemia microarray data set by Armstrong et al.
To illustrate the features available on ArrayMining.net, we have applied algorithms from different analysis modules to the well-known leukemia data set by Armstrong et al. In the following sections we will present the data set and pre-processing methods, discuss gene selection and clustering results obtained with our ensemble and consensus methods, as well as an example combination of the gene set analysis module with the class assignment module.

Data set and pre-processing
The leukemia data set by Armstrong et al. [1] contains expression values for 12,626 genes and 72 microarray samples, which are subdivided into three leukemia subtypes: Acute lymphoblastic leukemia (ALL, 24 samples), acute myelogenous leukemia (AML, 28 samples) and ALL with mixed-lineage leukemia gene translocation (MLL, 20 samples). Affymetrix U95A or U95A V2 oligonucleotide arrays [2] had been used in the study to obtain the experimental data.
To pre-process the raw data we applied the variance stabilizing normalization [3] using the expressopackage in the R statistical learning environment [4]. Moreover, we imposed thresholds based on the suggestions in the supplementary material of the original publication [1] and applied a fold-change filter to remove features with low variance (all gene vectors with less than a 5-fold change between the maximum and minimum expression value were discarded).

Gene selection results
In this section we present the results for a gene selection analysis on ArrayMining.net using the Armstrong et al. leukemia data and discuss the selected genes in detail. Table 1 shows the top 30 genes chosen by our ENSEMBLE method (the annotations in column 2 have been extracted from the DAVID [5] database; for two genes, SMAD1 and LGALS1, two different genetic probes matched to the same gene). We first identified those genes which were already identified as significantly differentially expressed in the original study on the Armstrong et al. data set [1] and then investigated the functional annotations of the genes based on the GO and KEGG data bases, ignoring very general annotations like "cell development" or "cell communication". Finally, we searched for known functional associations between the genes and leukemia development in the biomedical literature.
As already stated in the study by Armstrong et al. many under-expressed genes in the MLL subtype have a function in early B-cell development. Among the genes belonging to this group, we identified MME, CD24 and DNTT, POU2AF1 and LIG4 as significantly differentially expressed, which were already detected and discussed in the original study [1]. Similarly, our results confirm the finding by Armstrong et al. that certain adhesion molecules (LGALS1, ANXA1, CD44) are significantly over-expressed in MLL, as well as the myeloid-specific gene CCNA1. Other genes from our ranking which were already mentioned as distinguishing MLL from ALL by Armstrong et al. are MYLK, FOXO1, MYH10 and VAMP5 (see individual gene discussion below). When investigating the available annotation data for the selected genes, we found that five genes are known to be involved in immune system processes (associated to the Gene Ontology terms "immune response" and "immune system process"). These are CD24, LIG4, CFD, POU2AF1 and CD27.
• CD24 is a glycoprotein known to be involved in metastasis and highly expressed in many tumours, mediating apoptosis in precursor-B acute lymphoblastic leukemia cell lines [6].
• LIG4 is the gene encoding the DNA Ligase IV protein, which joins double-strand breaks in the DNA, and a mutation in LIG4 has been suggested to confer a pre-disposition to leukemia [7].
• CFD (adipsin) is a serine protease involved in the alternative complement pathway as part of the innate immune system. The gene is located in a chromosomal region which is known to be associated with myeloid cell differentiation by means of changes in chromatin organization [8].
• The gene POU2AF1 (see Fig. 2) encodes a transcriptional regulator required by the immune system for the formation of germinal centers in lymph follicles after antigen contact and binding specifically to either the transcription factor OCT1 or OCT2 in the B-cell response to antigens [9,10]. Deregulation of POU2AF1 by means of translocation has been implicated in lymphoma and leukemia development [11] and the gene's expression levels have previously been shown to vary across different lymphoma types by means of real-time quantitative PCR analysis [12].
• CD27 is a tumor necrosis factor receptor which has been implicated in B-cell activation and found to be differentially expressed in normal B-cells and neoplastic B-cells [13].
Six genes in the ranking were found to be involved in apoptosis. Apart from the already discussed genes CD24, LIG4 and CD27 these include FOXO1, ANXA1 and LGALS1.
• FOXO1 is a member of the forkhead family of transcription factors which has been associated with cell cycle arrest and apoptosis of hematopoietic cells [14].
• ANXA1, belonging to the adhesion molecules over-expressed in MLL (see above), is a phospholipidbinding protein with anti-inflammatory functions which might result from its inhibitory effect on the inflammation mediator phospholipase 2 [15]. It has been reported to be differentially expressed in various cancers and is used as marker in an assay to differentiate between hairy cell leukaemia and other B-cell malignant diseases [16,17].
• LGALS1 is another adhesion molecule over-expressed in MLL, which has been reported to induce apoptosis of human thymocytes [18] and interact with the oncogene H-Ras [19].
Three other genes from the ranking, ZNF423, LEF1 and VAMP5, are associated with the GO-term for cell differentiation.
• ZNF423 is a transcription factor whose deregulated expression has been shown to contribute to the induction of the terminal phase of chronic myelogenous leukemia, known as "blast crisis", which is clinically similar to an acute leukemia [20].
• LEF1, Lymphoid enhancer-binding factor 1, is a transcription factor expressed in pre-B and T-cells. It has been implicated in leukemogenesis based on experiments in which mice, transplanted with bone marrow retrovirally transduced to express LEF1, developed B lymphoblastic and acute myeloid leukemia [21].
• VAMP5 is a member of the family of vesicle-associated membrane proteins. Since the mRNA and protein levels of VAMP5 have been shown to be increased during in vitro myogenesis, it has been suggested that VAMP5 could be involved in vesicle trafficking events that are associated with myogenesis [22].
Among the top-ranking genes not discussed so far, four are involved in DNA-dependent regulation of transcription (GO-term 6355): HOXA10, SMAD1, MLXIP and SETBP1.
• HOXA10 is transcription factor whose over-expression in murine hematopoietic cells has been reported to perturb myeloid and lymphoid differentiation leading to acute myeloid leukemia [23].
• SMAD1 (see Fig. 2) is a transcriptional modulator and a component of the transforming growth factor (TGF)-beta signaling pathway, which plays a key role in cell differentiation and apoptosis pathways [24]. Different studies have revealed that mutations in SMAD-genes can cause a disruption of this pathway leading to various types of leukemia [25,24].
• MLXIP is a protein interacting with MAX-like protein X (MLX), which plays a role in proliferation and differentiation. It has been shown that MAX and MAX-like proteins can form heterodimers with MAD family proteins which oppose the growth-promoting action of heterodimers between MAX and the oncogene MYC [26].
• SETBP1 is a transcription factor involved in hematopoietic stem cell (HSC) regulation, and fusion of SETBP1 with another gene (NUP98) has been reported in T-cell lymphoblastic leukaemia [27].
Among the remaining genes in the ranking, which were not found in specific cancer-related GO or KEGG terms, we first discuss the six genes that were already identified by Armstrong et al. as discriminators between ALL and MLL: MME, CCNA1, DNTT, CD44, MYLK, MYH10.
• MME stands for the enzyme "membrane metallo-endopeptidase" and is also known as "common acute lymphoblastic leukemia antigen" (CALLA). The protein is involved in the degradation of secreted peptides and has been suggested to play a role at an early stage of lymphoid differentiation [28]. Furthermore, MME has been demonstrated to be expressed in most acute lymphoblastic lymphomas and in some B-cell and T-cell lymphomas [29].
• CCNA1 is a cyclin protein involved in cell cycle regulation which is highly expressed in various myeloid leukemic cell lines and has therefore been implicated in germline meiotic cell cycle control [30].
• DNTT encodes a template-independent DNA polymerase which generates antigen receptor diversity by adding nucleotides at the junction of rearranged Ig heavy chainand T-cell receptor gene segments during the maturation of B-and T-cells [31,32]. DNTT has been suggested as a marker distinguishing subtypes of lymphoid leukemias of childhood [33].
• CD44 is a cell-surface protein with a great variety of functions resulting from a large number of splicing isoforms. It is involved in cell adhesion and migration processes and more specifically, lymphocyte activation, tumor metastasis and hematopoiesis [34]. The ligation of CD44 has been reported to reverse blockage of differentiation in human acute myeloid leukemia [35].
• MYLK (myosin light chain kinase, see Fig. 2) is an enzyme which phosphorylates myosin light chains to support the interaction of actin filaments with myosin and changed expression of MYLK leading to inhibition or potentiation of myosin II activation has been shown to delay or accelerate tumor necrosis factor-alpha (TNF)-induced apoptotic cell death [36]. Moreover, expression of MYLK is correlated with disease recurrence and distant metastasis in non-small cell lung cancer [37].
• MYH10 (myosin heavy chain 10, non-muscle) is a gene coding for a myosin protein with putative functions in cytokinesis and cell shape, and has been reported to be down-regulated in patients with T-lineage acute lymphoblastic leukemia for whom induction therapy fails [38].
Finally, we discuss the genes which were neither found in cancer-specific GO or KEGG terms nor discussed as being functionally related to leukemia development in the paper by Armstong et al. These are WFS1, RGL1, NT5E, LRIG1, AKAP12, AUTS2 and DDR1.
• WFS1 (see Fig. 2) has been investigated as a candidate glucocorticoid-response gene in childhood acute lymphoblastic leukemia (glucocorticoids mediate apoptosis of lymphoid cells and are therefore used in chemotherapy for lymphoid malignancies) [39].
• RGL1 is a guanine nucleotide exchange factor and a proposed effector of the oncogene ras and other ras-like proteins [40].
• NT5E is an enzyme catalyzing the dephosphorylation of AMP and other nucleoside monophosphates and is used as a marker for lymphocyte differentiation [41]. The activity of NT5E has been observed to be strongly reduced in peripheral blood lymphocytes from B-Cell chronic lymphocytic leukemia patients in comparison to normal cells [42].
• LRIG1 encodes a transmembrane protein which interacts with receptor tyrosine kinases of the epidermal growth factor receptor (EGFR) family and restricts growth factor signaling by enhancing receptor degradation [43]. The protein has been investigated as a potential tumor suppressor and was found to be down-regulated in conventional and papillary renal cell carcinomas [44].
• AKAP12 is a tumor suppressor gene encoding a cell growth-related protein binding to the regulatory subunit of protein kinase A [45]. Based on real-time quantitative PCR measurements on 162 samples, AKAP12 expression has been found to be decreased in samples of acute leukaemia as compared to healthy controls and to be associated with an inferior overall survival [46].
• The function of AUTS2 is currently unknown, but in patients with B-cell precursor acute lymphoblastic leukemia (BCP-ALL) fusions of AUTS2 with PAX5, an important regulator of B-cell development, have been reported [47].
• DDR1 is a receptor tyrosine kinase which is up-regulated by p53 oncoprotein [48] and has been identified as significantly over-expressed in many human tumors [49,50,51].
In summary, almost all of the selected genes are either known oncogenes or tumor suppressor genes or have been suggested to be functionally associated with cancer progression or immune response processes. Differences in the expression status of these genes across different cancer types or different stages or subtypes of specific cancer diseases are therefore to be expected, which matches to the observation of differential expression across different leukemia subtypes on the Armstrong et al. data set. Although false positives cannot be ruled out in this approach, the functional annotations of the selected genes and the related findings from the literature suggest that this gene selection scheme is useful in prioritizing genes and proteins for investigating their potential involvement in explaining differences between diverse conditions of the biological system of interest.

Clustering results
As an example application of the class discovery module on ArrayMining.net, we present results obtained on the leukemia data set by Armstrong et al. using the ensemble clustering method. Prior to the analysis we standardized the samples by subtracting the median from the expression values and dividing by the median absolute deviation. Moreover, we applied a variance filter, retaining only the 2000 genes with the highest variance across the samples (both of these options are available on the class discovery interface on ArrayMining.net).
When comparing the estimated optimal number of clusters for different pairs of clustering methods and cluster validity indices, the great majority of approaches estimate 2 as the optimal number of clusters.
In order to investigate this result in more detail, we manually inspected the optimal clustering results with regard to the validity indices for different algorithms and found that the methods could perfectly distinguish the Mixed Lineage Leukemia (MLL) subtype from the Acute Lymphoblastic Leukemia (ALL) subtype, but could not separate ALL from Acute Myeloid Leukemia (AML) and only partly separate MLL from AML samples. In most clusterings, all MLL samples were assigned to the same cluster and only some of the AML samples were assigned additionally to this cluster, but none of ALL samples. Although the original grouping of samples into ALL, AML and MLL subtypes is based on objective criteria, several meaningful cluster structures might exist in the data and there is no single objective criterion as to what the "real" or "optimal" clustering structure is. However, we notice that MLL was assigned to a separate cluster from the ALL samples and only some of the AML samples were additionally assigned to the MLL-cluster, which matches to the finding in the original study by Armstrong et al. according to which the MLL subtype is more similar to AML than to ALL (with regard to results from a principal component analysis and based on the expression of myeloid-specific genes in MLL).
The clustering results were also combined into a single representative clustering using our Simulated Annealing based consensus clustering method and the outcome was visualized using a principal component analysis plot (see Fig. 3). This plot confirms the observations from the manual inspection of the single algorithm clustering results according to which the MLL samples can perfectly be separated from the ALL group and are partly overlapping with the AML samples. A silhouette plots visualizes the silhouette widths for each sample as a reliability measure for the corresponding cluster assignment [52] (see Fig. 4). When inspecting a 3-dimensional visualization of pre-filtered data using an Independent Component Analysis, a better separation of the tumor subtypes was observed, with no overlap between the ALL/MLL groups and the AML/ALL groups and only a small overlap between AML and MLL samples (see the VRML-file in the Supplementary Material). Moreover, density estimation contour surfaces revealed three regions of high data density (colored green in the VRML-file) corresponding to the three leukemia types. On the whole, although only standard parameters had been used and all results were generated in an automatic process by the class discovery module, the clustering and visualization results enable the user to distinguish between major sample subgroups in the data (e.g. ALL vs. MLL) and provide other useful insights (e.g. MLL samples are more similar to AML samples than to ALL samples with regard to their expression profiles).

Gene set analysis results
To demonstrate the features of our gene set analysis module, we tested the enrichment of a collection of 37 cancer-related gene sets from the van Andel Institute in Michigan [53] in the leukemia data set by Armstrong et al. The module also provides access to Gene Ontology and KEGG gene sets, but these contain many very general gene sets (e.g. "cell cycle"-related genes) which are enriched in almost all microarray data sets and therefore only have a limited value for biological interpretation of the data. We applied the MDS-GSA method using multi-dimensional scaling to combine the expression vectors for the genes in a gene set into a single signature vector (the "meta-gene"). To asses whether the enriched gene sets were differentially expressed across pairs of different sample groups the empirical Bayes t-statistic was used [54]. The q-value significance scores in the ranked table of gene sets, computed based on the Benjamini-Hochberg method, suggest that many cancer-related gene sets are significantly differentially expressed across the sample groups. We therefore only discuss the three gene sets with the smallest q-values as example cases: VEGF down, ES M down and FH down.
• The VEGF down gene set contains genes associated with vasculogenesis and angiogenesis obtained from a microarray study in which human umbilical cord vein endothelial cell (HUVEC) isolates were treated with vascular endothelial growth factor-A (VEGF-A) in low or high serum media. Apart from vasculogenesis and angiogenesis, VEGF has also been associated with growth, dissemination, metastasis and poor outcome in solid tumors and has been shown to be an independent predictor of outcome in patients with acute myeloid leukemia (AML) [55]. Moreover, an analysis of VEGF expression in the bone marrow of AML patients showed that VEGF is restricted to certain stages of differentiation and correlates with AML sub-categories [56]. The role of VEGF in myeloid leukemias is also highlighted by another study which showed that a broad spectrum of VEGF receptors is expressed in various myeloid neoplasms [57].
• The ES M down gene set was obtained from a study comparing the expression levels of genes in multipotent mesenchymal embryonic stem cells (ESM) against adult mesenchymal stem cells (MSC). It is composed of the genes which are significantly down-regulated in ESM cells as compared to MSCs. Similarities between stem cells and some cancer cells, e.g. the potential for self-renewal, have been widely discussed in the literature [58]. Another example for this is the cytokine "leukemia inhibitory  factor" (LIF), which induces terminal differentiation of myeloid leukemia cells and also plays an important role in the regulation of signaling in embryonic stem cells [59].
• The FH down gene set is derived from microarray experiments studying differences in gene expression patterns in uterine fibroids caused by mutations in the fumarate hydratase (FH) gene. It contains genes which were significantly down-regulated in fibroids containing FH-mutations as opposed to fibroids with wild-type FH. Mutations and other defects in mitochondrial enzymes like FH have been reported to predispose to tumorigenesis. Mitochondrial DNA alterations have for example been implicated in the transformation of myelodysplastic syndromes into acute leukemia [60].
In addition to the ranking of gene sets, a heat map was generated to visualize the meta-gene expression values across different samples and sample groups (see Fig. 5). Although the interpretation of the metagene expression vectors, obtained from combining multiple genes into a single vector by means of multidimensional scaling, is not as straightforward as in the case of single gene expression vectors, the heat map suggests that the derived meta-genes can to certain extent distinguish between the three leukemia classes. Moreover, in agreement with our previous findings in the clustering analysis, the heat map indicates that the AML samples are more similar to the MLL than to the ALL samples. A subgroup of about 8 samples in the AML-group (the columns on the right in the heat map) appears to bear a particularly close similarity to the MLL group with regard to the meta-gene expression profile. These preliminary gene set analysis results suggest that this type of analysis can provide additional insights as compared to a so-called "singular enrichment analysis", in which the genes are first pre-selected using feature selection before testing the enrichment of certain functional gene groups (see gene selection results above). Moreover, by using cancer-related gene sets instead of Gene Ontology or KEGG derived gene sets, the annotations of the enriched gene sets are likely to be more specific and more informative for the biological interpretation of the data.

Prediction results
Using the class assignment module on ArrayMining.net we illustrate how sample classification results can be obtained in an automatic fashion by uploading data on this module and how the results can be improved by using features obtained from another analysis module (in this case, the gene set analysis module). We applied a 10-fold external cross-validation [61] analysis on the leukemia data by Armstrong et al. using our ENSEMBLE prediction method and the empirical Bayes t-statistic for feature selection [54]. Moreover, we applied the same cross-validation scheme and the same predictor to meta-gene expression values derived from a gene set analysis on the same data set using the GSA-MDS method and Gene Ontology derived gene sets (we chose the Gene Ontology data base here, because it contains a much larger number of gene sets than our collection of cancer-related gene sets). Based on single genes as features an average cross-validated accuracy of 80.9% (±14%) was obtained, whereas 87.5% (±11%) accuracy were reached when using the meta-genes as input. Although some studies have reported higher accuracies on this data set, we think these are reasonable results for a 3-class microarray classification problem with 72 samples, using an almost fully automated process and an external cross-validation scheme [61].
Since the classification results are affected by high variance, which is common in microarray studies with small sample sizes, we additionally report results for three other gene expression cancer data sets using the same methodology as above (see Tab. 4). Similar trends can be observed across the different data sets: Using meta-genes representing biological pathways as features similar accuracies and standard deviations were reached as based on single genes. This suggests that the user can generate models of similar predictive quality based on biological pathways and on single genes, enabling two different model-based interpretations of the data. In summary, these example results suggest that the class assignment module can be a helpful tool to compare cross-validated prediction accuracies for different data sets and methods. In combination with the gene set analysis module, prediction models based on biological pathways as features with high classification accuracies can be obtained based on an almost fully automated process.  [62] 92.3 9 89.5 10 DLBCL [63] 95.0 9 95.0 6 T-Cell Lymphoma [64] 81.0 13 82. 6 14 Comparison of 10-fold cross-validation sample classification results obtained on different microarray cancer data sets using both single genes and summarized gene sets as features