AMDA: an R package for the automated microarray data analysis

Background Microarrays are routinely used to assess mRNA transcript levels on a genome-wide scale. Large amount of microarray datasets are now available in several databases, and new experiments are constantly being performed. In spite of this fact, few and limited tools exist for quickly and easily analyzing the results. Microarray analysis can be challenging for researchers without the necessary training and it can be time-consuming for service providers with many users. Results To address these problems we have developed an automated microarray data analysis (AMDA) software, which provides scientists with an easy and integrated system for the analysis of Affymetrix microarray experiments. AMDA is free and it is available as an R package. It is based on the Bioconductor project that provides a number of powerful bioinformatics and microarray analysis tools. This automated pipeline integrates different functions available in the R and Bioconductor projects with newly developed functions. AMDA covers all of the steps, performing a full data analysis, including image analysis, quality controls, normalization, selection of differentially expressed genes, clustering, correspondence analysis and functional evaluation. Finally a LaTEX document is dynamically generated depending on the performed analysis steps. The generated report contains comments and analysis results as well as the references to several files for a deeper investigation. Conclusion AMDA is freely available as an R package under the GPL license. The package as well as an example analysis report can be downloaded in the Services/Bioinformatics section of the Genopolis

First af all for each sample the percentage of probesets with Detection call "Absent" (A) over the total number of probesets is determined, as well as for the probesets with Detection call "Present" (P).
Next for each sample the mean value of probeset called A and P respectively is determined. Obviously the values associated to P calls are expected to be greater than values associated to A calls.
Finally the ratios between the expression values for 3' and 5' end of gapdh and actin transcripts are determined. This is useful to control the degree of sample degradation, since for these probesets probes are available to bind both ends of the transcript. If the in vitro synthesis reactions performed well the signal on both ends of the transcript is expected to be similar. Therefore a dashed horizontal line at the value of 1 represents the optimal performance.

Array normalization
In order to make the different chips (Table 1) comparable, it is necessary to normalize signal intensities. Two different diagnostic plots are useful for this purpose: box plots and scatter plots. Boxplots are one of the most intuitive ways of visualizing and comparing relevant properties of distributions of values across different arrays, e.g. first quartile, median and third quartile. Note that after normalization the distributions are more similar. Scatter plots are simply log plots of the expression intensities between for all the non redundant pairs of replicated samples. The adjusted R squared (adj-R2) measures goodness of fit of the respective linear regression in the scatter plots. In case of technical replicates an adj-R2 of around 0.95 is expected; in case of biological replicates the correlation can eventually drop below 0.9. Depending on the quality of the raw data at least a slight improvement of the correlation coefficient is usually observed after normalization.
Since the analysis started with the analysis of CEL image files, it is worth to normalize the samples at that level, i.e. at the probe level, before that Figure 2: Box plot of probe-level array data before the normalization procedure the different probes for each gene are summarized in one unique probeset. The figure 2 shows the boxplots for the distribution of probe level data in all the samples before the normalization procedure. Affymetrix Signal is used in case it is specifically required or in case the number of arrays is less than 7. If this algorithm is used the qSpline method is chosen for the normalization of samples after the generation of the probeset intensity summary (Workman et al., 2002). Other methods can be used if specifically required or if the number of arrays is greater than 7. These methods use the quantile method for the normalization of samples (Bolstad, 2001).
For this report probeset intensity summaries have been generated using GCRMA algorithm. The figure 3 shows the boxplots for the distribution of probeset intensities in all the samples after the normalization procedure. The figure 4 shows the logarithmic scatter plots of probeset intensities in all the non redundant pairs of replicated samples after the normalization procedure.

Prefiltering
To filter out noisy data before the selection of differentially expressed genes a filter is applied based on Detection calls. As a first step probesets called "Absent" (A) over all conditions and replicates are removed (5245). Indeed expression values flagged with an A call identify probesets whose PM values are not significantly different from MM control values. This indicates that the corresponding gene is either not expressed or that its expression can not be distinguished from noise. As a second step the 95th percentile of all the signals of the entire dataset that are flagged with an absent call is determined and used as a threshold to remove all the remaining probesets whose expression values are always below this value in each sample (1689). Finally 5554 probesets remain for the next analysis steps.
The figure 5 represents the distribution of all expression values flagged with an "Absent" call (A, red) and of all expression values flagged with a "Present" call (P, blue). The relative counts are the number of expression values within each interval normalized by the total number of P and A expression values. The log of expression values is reported and the dashed vertical line indicates the threshold used for the selection. We use a hierarchical clustering method to verify the quality of replicates and to highlight possible outlier samples which can be eventually excluded. By default, all chips are included for the further steps of the pipeline. The resulting dendrogram (figure 6) can be interpreted similarly to a phylogenetic tree. The vertical scale indicates 1 -pearson correlation coefficients as a measure of similarity. Replicates should cluster closer to each other than chips from different experimental conditions.
In addition, the resulting tree structure provides an unbiased overview on the relationship between the different experimental conditions. For example if three experimental conditions are investigated and all samples of exp.cond. 1 are closer to all samples of exp. cond. 2 than to exp. cond. 3, this means that the transcriptome of the cells is more similar between the first two conditions in comparison to the third one. For this report the comparison of 2 timeCourses experimental design has been selected (figure 7). Once the experimental design has been chosen, the choice of the method depends on the number of available replicates. Limma was used for the selection of DEG.
Limma is an acronym for Linear Models for Microarray Data (part of the following description is drew from the limma vignette). It is a Bioconductor library developed by Gordon Smyth et al (based on Gordon K. Smyth, Stat Appl Genet Mol Biol 2004). The use of this method is based on the fitting of a linear model to estimate the variability in the data. In case of one-channel microarray data (like Affymetrix) this approach is the same as analysis of variance except that a model is fitted for every gene.
For the detection of the differential expression an empirical Bayes method is used to moderate the standard errors. Indeed the use of moderated statistics for the detection of differential expression is very useful expecially in case of experiments with small number of replicates. Limma allows for the computation of moderated t-statistics for any required comparison across the conditions of the dataset, taking into account all the experimental designs previously reported in this report. In particular, among the methods available in AMDA, limma is the only method that can deals with experimental designs 3 and 4.
In this case all the requested comparisons were performed selecting DEG with a threshold pValue of 0.05.
The figure 8 represents with log scatter plots the DEG identified with the selected method(s). In these graphs the y-axis reports the mean of the replicates (if available) of the considered condition and the x-axis is the mean across the replicates of the reference condition. This reference condition is the baseline in case the commonBaseline experimental design was selected, the previous time-point condition in case the timeCourse experimental design was chosen. Red points are the DEG. Totally 1101 unique probesets have been selected (part of them can be differentially expressed in more than one condition, i.e. this number could not reflect the sum of DEG of the individual panels).
If you selected the option of writing files you can find in the folder DEG the deg.universe html and txt files containing a html and txt table with the probesets that were selected in at least one experimental condition. Moreover, in the same folder, you can find for each experimental condition a .txt tab delimited table reporting the identified DEG. This tables can be viewed using Excel and includes the most commonly used functional annotations, the normalized expression values with the corresponding detection Call, as well as the calculated statistic of differential expression for the considered experimental condition (STN for the PLGEM method, d-statistic for SAM, t-statistic for Limma, log2Ratio for FoldChange). An other similar figure is provided for the second experiment.

13
The figure 10 is an heat map of the universe of DEG. For this figure the expression values of each probesets are divided for the median value and the log2 of this ratio is reported. Red colour indicates samples where the probeset is more expressed, blue colour samples where the expression is smaller. Probesets (rows) and samples (columns) are reordered based on their similarity, as indicated in the respective dendrograms.Since the experimental design 3 or 4 have been selected an additional colorbar for the rows is reported, where probesets differently differentially expressed between the two experiments are highlighted in gold colour. Figure 10: Heat map of DEG universe. Each probeset is differentially expressed in at least one condition. The log2 of the ratio between each value and the median of the row is reported on a red-blue (high-low) color scale.Probesets differently differentially expressed between the two experiments are highlighted in gold colour.

15
In experiments with at least three experimental conditions, clustering of the DEG is performed in order to group genes with similar expression profiles over all conditions. We use PAM-clustering method (Partitioning Around Medoid, Kaufman and Rousseeuw, 1990) to partition the gene expression profiles into k clusters. A fixed number of clusters can be chosen. Alternatively an estimate of the optimal k can be automatically determined, based on quality scores obtained from different trials of clustering.
Experimental conditions where replicates are available are averaged and the logarithm of the ratio between each exp. cond. and the condition chosen as the reference are calculated (log2Ratios). If the commonBaseline experimental design was selected, the reference is the baseline. Alternatively the log2 or the ratio with the median of the row for each probeset is computed.
The PAM clustering is repeatedly applied on the dataset testing various values of k between 2 and a number (6 or 10) that depends on the number of DEG. For each tested value of k a quality score (called "average mean silhouette") representing the averaged ratio between the within clusters similarity and the between-clusters dissimilarity is determined. The optimal number of clusters is generally determined as the one with the highest silhouette score. In case where there are more values of k that have very similar quality scores of silhouette close to the maximum one (0.85*max, this parameter can be set with or without the widget interface), the highest number of clusters is chosen. Anyway clusters with less than 10 probesets are not admitted.
The first panel ( figure 11) represents the silhouette scores as a function of k. Based on this analysis 3 clusters were selected to partition the dataset. The remaining panels visualize the expression profiles of genes belonging to each cluster (log2Ratios). The red line represents the general pattern that characterizes each cluster (medoid). The similarity of the expression profiles of genes within a cluster suggests a similarity in their regulation. Even for poorly annotated genes (e.g. ESTs) a functional hint can be drawn from the occurrence of known genes within the same cluster.
If a functional analysis was requested , the clusters are evaluated for every available source of functional families (i.e. GeneOntology, KEGG or user-provided ones). A list of tables are embedded in the report containing the relevant annotation terms for each cluster and for each annotation source. If the option of writing files was chosen, the list of genes contained in each cluster is saved on the directory "Clusters". Moreover, for each annotation term in the tables a file is written in the directory "Clusters" and in the subdirectory corresponding to the annotation source (e.g. GO, KEGG, USER). The name of the file is composed of the number of cluster and the id of the annotation term. The txt tab delimited files can be read using a spreadsheet program and contain the list of probesets in that cluster matching the selected annotation term. Moreover a useful list of external annotations and DB references is added, and also the normalized values and detection calls are provided. In this tables for each annotation term the number of annotatedprobesets contained in the considered cluster is reported (id on list). the next column reports the number of probesets matching the considered annotation term on the whole array (id on chip). These two numbers give an idea on the significance of the functional enrichment of the cluster list. The last column reports the pValue generated by a statistical test applied with the null hypothesis that this enrichment is random. The more this pValue is low the more the enrichment is statistically significative. Generally only pValue very low are considered interesting in this test, e.g. 1e-5 or lower.

Functional Annotation
A functional annotation of DEG is performed on the basis of a subset of the annotation provided by the Bioconductor project (mgu74av2 1.10.0 from www.bioconductor.org). The annotation resources considered depend on the organism and generally include GeneOntology and KEGG (www.geneontology.org, www.genome.jp/kegg). The analysis of user defined functional families is always strongly suggested. These lists of probesets can be easily added to the analysis starting with the advanced widget interface. More files each one with a column of probesets ids of the right chip are expected. Their analysis should provide information about functional terms known to be (or not to be) involved in the biological process under investigation. The user defined functional families can be some lists of genes that can not be easily generated with the common annotation databases (e.g. GeneOntology and Kegg are under continuous update..), or can even be list of genes taken from a review or a table of an interesting paper, that you want to evaluate in your experiment.
On the other side KEGG and GO terms could provide new unexpected functional hints since they are evaluated without a priori selection.
The most representative functional annotations for DEG from each experimental condition are identified by determining the probability of random occurrence of functional terms (functional enrichment). Based on this probability ranking only the top 40 statistically most significant annotation terms are reported, and the most interesting are red highlighted. Each figure reports the selected annotation terms for the considered experimental condition. For each annotation term the following information is provided: annotation term ID, annotation term description, number of DEG in that condition annotated with the term, number of probesets on the whole chip annotated with the term, pValue of functional enrichment. The modulation of genes for the selected annotation terms is visualized for each experimental condition and the x-axis of each figure. To perform a statistical test not biased by the redundancy of the probesets (more probesets for the same gene), the computation of pValue of functional enrichment is based on entrezGene IDs. Since more probesets could match to a single entrezGene ID, the number of DEG annotated with the term can be less than the number of probesets visualized (the blue circles).
If at least two experimental conditions are provided a functional summary is provided reporting the enrichment pvalues (log10 of pValues) for the different annotation terms across the different conditions. This plot can be useful to compare the functional characterization of DEG in the different conditions. In particular annotation terms specific for a subset of conditions could be identified, as well as annotation terms that are equally relevant for all the conditions. Note that the colours of this graph reflect only the enrichment pvalues (highly significative is red), they are not representative of the direction of the modulation (up, down-modulated). Therefore genes with annotation terms equally significant in more conditions could be differently modulated (this can be evaluated with the figures specific for each condition).
If the write files option was chosen, for each annotation term in each figure, the considered DEG are reported on a tab delimited .txt file structured as described in the section reporting the selection of DEG.

KEGG annotation terms
Kyoto Encyclopedia of Genes and Genomes (KEGG) provides the reference collection of metabolic pathways. In the currently used database built of the Bioconductor annotation package about hundred pathways are considered (KEGG library, version 1.10.0). Only pathways represented with at least 2 DEG are ranked according to their relevance. Based on this ranking only the first 40 (or less whenever this number is not reached) are reported, for plotting reason. Figures from 12 to 16. If at least two experimental conditions are provided a functional summary is provided (17) reporting the enrichment pvalues (log10 of pValues) for the different annotation terms and their similarity.

GeneOntology annotation terms
GeneOntology (GO) provides three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a speciesindependent manner. Given the GO data structure it is possible that different GO terms functionally characterize the same or nearly the same set of probesets. This is not an artifact and it is due to the fact that the selected probeset list map on the GO tree with terms that are close to each other (i.e. parents or children). Anyway an attempt of reducing this redundancy is made using the GOselection function provided in the AMDA library. This reduction of redundancy can be controlled starting AMDA via the advanced widgets interface and can be also excluded. Figures from 18 to 22. If at least two experimental conditions are given a functional summary is provided (23) reporting the enrichment pvalues (log10 of pValues) for the different annotation terms and their resulting similarity on both direction of the heatmap.
In the currently used database built of Bioconductor annotation package (GO library, version 1.10.0) several thousands of GO terms are considered. Only GO terms represented with at least 3 DEG are ranked according to their relevance. Based on this ranking only the first 40 (or less whenever this number is not reached) are reported for plotting reasons.

User-provided annotation terms
User defined functional families are used to functionally characterize DEG of each experimental condition. This evaluation should provide to the user the information about the involvement of a priori expected functional families. Functional families are not filtered but only ranked based on their statistical relevance. Figures (from 24 to 28) are provided only for those functional families for which at least one DEG was found in the considered experimental condition. The observation could either confirm the expected involvement or provide unexpected insights (for example a gene family could be up-regulated when down-regulation was expected). If at least two experimental conditions are provided a functional summary is provided (29) reporting the enrichment pvalues (log10 of pValue) for the different annotation terms and their similarity.

Correspondence analysis
If a sufficient number of experimental conditions is provided (at least 3 conditions) a correspondence analysis is performed. This multivariate analysis technique aims to reveal relationships between genes and samples (Fellenberg et al., PNAS 2001).
Genes could be plotted in a n-multidimensional space (n the number of conditions or arrays), as well as contions or arrays could be plotted in a m-multidimensional space (m the number of genes). These multidimesional spaces can't obviously be graphically represented. Roughly, this technique allows for the reduction of the dimensionality of these spaces. In this way samples and genes can be represented in one bi-dimensional space.
The figure 30 represents the scores of arrays and DEG in the space of the two first "principal components". These new dimensions are the two directions in the original multidimensional space with the highest variance, and the scores can be imagined as the coordinates on these axis. Closeness of array labels in this space denotes similarity (like the hierarchical clustering of arrays). Each number represents an individual probeset detected as DEG in at least one experimental condition. Numbering follows the same ranking as in the DEG.universe html or txt file. Interesting genes can be identified also using the accessory function geneFilterCA provided in the AMDA library. The numbers close to the group of the interesting array labels are likely to be the most important in determining the similarity between those arrays.
It is possible to select the more representative genes that determine the specific features of different conditions. The annotation derived from these characteristic genes could give insights in to specific functional behavior of the corresponding samples.