CARMA: A platform for analyzing microarray datasets that incorporate replicate measures
© Greer et al. 2006
Received: 29 April 2005
Accepted: 17 March 2006
Published: 17 March 2006
Skip to main content
© Greer et al. 2006
Received: 29 April 2005
Accepted: 17 March 2006
Published: 17 March 2006
The incorporation of statistical models that account for experimental variability provides a necessary framework for the interpretation of microarray data. A robust experimental design coupled with an analysis of variance (ANOVA) incorporating a model that accounts for known sources of experimental variability can significantly improve the determination of differences in gene expression and estimations of their significance.
To realize the full benefits of performing analysis of variance on microarray data we have developed CARMA, a microarray analysis platform that reads data files generated by most microarray image processing software packages, performs ANOVA using a user-defined linear model, and produces easily interpretable graphical and numeric results. No pre-processing of the data is required and user-specified parameters control most aspects of the analysis including statistical significance criterion. The software also performs location and intensity dependent lowess normalization, automatic outlier detection and removal, and accommodates missing data.
CARMA provides a clear quantitative and statistical characterization of each measured gene that can be used to assess marginally acceptable measures and improve confidence in the interpretation of microarray results. Overall, applying CARMA to microarray datasets incorporating repeated measures effectively reduces the number of gene incorrectly identified as differentially expressed and results in a more robust and reliable analysis.
High-density microarrays[1, 2], in combination with high-throughput sequencing efforts[3, 4], are proving invaluable in the investigation of complex systems. Researchers, however, struggle with the challenges associated with analyzing and interpreting the enormous amounts of data generated by microarray experiments. While initial analysis techniques focused on individual hybridizations and selected differentially expressed genes based on the ratio of the measured levels of gene expression between two samples, the results of any single microarray hybridization is subject to substantial variability, and consequently are unreliable. Therefore, as with any biological assay, robust microarray experiments rely on replication.
Kerr et al.  first described the use of analysis of variance (ANOVA) in combination with optimal experimental designs incorporating replicate measures, for microarrays. This technique uses a basic additive linear model to account for known sources of variability, thereby improving estimates of differences in gene expression and providing a measure of confidence for those estimates. Others have expanded these techniques to incorporate mixed models, multi-factorial designs, and biological replication strategies. In addition, advancements in data transformation [12–14] and normalization[13, 15, 16] have improved the quality of the data to which subsequent analyses are applied.
In an effort to apply many of these powerful statistical techniques to our microarray datasets, we have developed an analysis platform with supporting software named CARMA (Computational Analysis of Replicate Measures for Arrays). In addition to performing ANOVA on microarray datasets that incorporate replication, CARMA also performs all of the necessary steps for calculating differential expression in a microarray data set including importing, transforming, and normalizing the raw data files. The analysis is designed to be easy to apply, require only a basic understanding of the underlying principles, and provides output in both an easy to interpret graphical format and a delimited text file. Each step in the process was chosen for its broad applicability to microarray data, with user-defined parameters tailoring the analysis to each experiment. To demonstrate the utility of our approach, we present an example analysis of a microarray experiment designed to characterize gene expression in an aquaporin-1 knockout mouse model.
CARMA was implemented using the R programming language and environment because of its "wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc) and graphical techniques". In addition R is available at no cost and runs on a variety of computing platforms. Under CARMA, analysis begins with data files and user-defined parameters being read from delimited files. Normalization between the two channels of each array and between both channels of all arrays is achieved using the loess function, which was chosen because of its ability to perform simultaneous location and intensity dependent lowess normalization. ANOVA is implemented using the aov function utilizing partitioned error for replicates, or the lme function for more complicated models, including mixed models. Graphical output and delimited text files are generated to present the results of the normalization, analysis of variance, and the ANOVA contrasts. CARMA is available as an R package for Microsoft Windows [Additional file 2], and as an R source file [Additional file 3]. A user manual is also available [Additional file 1].
The microarray dataset used for this manuscript contains measurements of gene expression in kidneys from adult aquaporin-1 knockout and wild type mice (GEO Accession GSE2402) [additional files 4, 5, 6, 7, 8]. In brief, RNA from kidney medullae of three aquaporin-1 knockout mice and three wild type mice was reverse-transcribed incorporating an amino modified dUTP, labeled using Alexa Fluor 546 and 647 ester dyes, and hybridized to a custom microarray containing the NIA Mouse 15 K cDNA clone set with each clone printed in duplicate. Slides were scanned using an Applied Precision arrayWoRx Biochip Reader and image analysis was accomplished using softWoRx Tracker software. Because this experiment is balanced for both mouse (each mouse sample is hybridized twice with each dye) and genotype (3 mice of each genotype were used), it was possible to perform two different analyses; one to determine differences in gene expression between mice and the other to determine differences in gene expression between the aquaporin-1 knockout and wild type groups. The results of the comparison between the aquaporin-1 knockout and wild type groups have been reported previously , therefore this paper presents the comparison of gene expression between individual mice in order to demonstrate the functionality of CARMA.
In the simplest sense, microarray data is the measured intensities, at defined wavelengths, of elements (or "spots"), which have been arrayed on a glass slide. Included in these measurements are numerous sources of variability. Given that a two-channel microarray experiment consists of multiple samples labeled with two dyes hybridized to multiple arrays containing multiple spots (that represent genes), there are four main sources of variability termed: Array, Dye, Gene, and Variety. The Array term refers to the variability in the measured intensities associated with a specific hybridization, due to either variability between slides, variability between hybridizations (e.g. differences in the amount of cDNA used in each hybridization), or both. The Dye term refers to variability caused by differences in fluorochrome chemistry, coefficient of extinction, incorporation efficiency, photobleachability, scanner sensitivity, etc. The Gene term refers to each element (or replicate elements) on the microarray. Since each element will hybridize to its complementary sequence in each sample, the intensity of each spot will depend on its nucleotide sequence, which is considered unique. The Variety term refers to the distinguishing feature of interest (such as time, treatment, or dosage) in the experimental samples. The goal of most microarray experiments is to determine the effect of this Variety term on the measured intensity for each element. In other words, how does dosage (or time, treatment, genotype, etc.) affect the expression of each gene?
Based on the four main effects (Array, Dye, Gene, Variety) and allowing for all interactions between those effects, there are 16 possible terms that could be included in the model, however many of the interaction effects do not make practical sense. In addition, performing a log-based transformation (see later sections) on the microarray dataset before applying the linear model allows the use of an additive linear model. Equation 1.1 describes the collection of mathematical equations that can be used to calculate values for each of the known factors that contribute to the transformed measured intensities:
Equation 1.1 defines a relatively complete model for the sources of variability in a microarray experiment; however there are practical limitations to its implementation. Utilizing a least squares approach to solve equation 1.1, even for reasonably small microarray datasets (e.g. 4 hybridization of a 10,000 element array), requires gigabytes of memory, precluding the use of a personal computer. In addition, it assumes equal variance between all genes. Splitting equation 1.1 into two equations, one containing all Gene independent terms, and another equation containing all gene-dependent terms, yields equations 1.2 and 1.3 respectively:
Applying these two models sequentially to a dataset significantly reduces the memory requirements and time required for computation, and allows for gene specific variances. When using these two models, equation 1.2 effectively serves to perform a linear global normalization between the two channels for each array and between arrays. Instead, because only the Gene × Variety effect is usually of interest, and in order to allow for non-linear global normalization, CARMA performs a lowess normalization between the two channels of each array and between both channels of all arrays, followed by a gene-by-gene ANOVA implementing equation 1.3 alone. Limiting equation 1.3 to the subset of data for each gene yields equation 1.4:
Implementing this form of the equation reduces computation times to a few hours on a basic personal computer (500 MHz Pentium III with 512 MB of memory) for even relatively large data sets (e.g. an experiment involving 32,000 elements and 12 hybridization).
Most microarray scanners generate 16 bit numbers, resulting in measurements ranging from 0 to 65,535 for each pixel in the generated images. Microarray image processing software is then used to analyze each image, generating a multitude of measures for each spot, which are usually further processed to produce one measure of intensity for each channel for each spot. The most controversial part of this process is whether to subtract some measure of background (defined as the pixels around the areas delineated as spots) from each spot (defined as the pixels within each area delineated as a spot). Background subtraction is often used to reduce the negative impact of local image artifacts, and provide better estimates of gene expression by subtracting fluorescent signal that is unrelated to the fluorescently labeled hybridized target cDNA. Nevertheless, some researchers have concluded that background subtraction increases variance and degrades the performance of subsequent analyses. In most cases, however, this increase in variance is not simply due to background subtraction, but the combination of background subtraction and a log-based transformation of the data. In effect, subtracting the background intensities from the spot intensities reduces the values for low intensity measurements to the point that error is no longer multiplicative, but additive, causing a log-based transformation to inflate the variance of these small values. In other words, the measurement error associated with small values is not proportional to the true value, but is a combination of some random error added to the true value. For instance, a log base 2 transformation assigns the same significance to the difference between 2 and 32 (log2(32) - log2(2) = 4) as the difference between 2000 and 32000 (log2(32000) - log2(2000) = 4), even though a difference of 32 is well within the noise of current microarray scanning technology and of no real significance.
The logarithm base 2 of the expression ratio is the most widely used transformation for microarray data due to its continuous range of values and similar treatment of up- and down-regulation. It has the added effects of improving linearity and variance homogeneity, and converting multiplicative error into additive error, thus allowing the application of a linear additive model. While log transformations work well for moderate to high signal intensities, they have the undesirable effect of magnifying small differences in intensity at low signal intensities. In addition, log transformations cannot be applied to negative numbers, which can result from background subtraction, and they transform numbers between 0 and 1 into negative values. To address these problems some researchers have proposed discarding data below a threshold, while others have proposed variance-stabilizing transformations [12–14]. CARMA implements a version of the linlog transformation that has been adapted to better cope with large negative numbers (where the local background signal intensity is significantly higher than the spot signal intensity) as follows.
where Zik represents the transformed intensities, Yik represents the untransformed intensities, and di represents the threshold between the log and linear portions of the transformation. The subscripts i and k denote the array and element (spot) for each intensity, respectively. This modified linlog transformation is symmetrical around 0 (raw signal) and both it and its first derivative are continuous. It is similar to a log2 transformation for both large positive and large negative intensities, and a linear transformation at low intensities (positive or negative). For the example analysis in this manuscript, the crossover point (d i ) between the linear and log portions of the transformation was calculated based on the median of one standard deviation of the local background of all spots. This method of calculating the cutoff for the linlog function was based on the observation that the variability in the measured intensities at background levels provides a good indication of the minimum intensity that the scanner can accurately measure (the point at which error becomes multiplicative). This approach has worked well in practice, and has the advantage of not assuming any distribution for the data. Recently CARMA has been enhanced to include the capability of calculating the linlog crossover point based on minimizing the absolute deviation of the inner quartile range (IQR) for each bin from the median IQR, for 20 bins spanning the range of intensities for each hybridization.
Comparison of background and transformation techniques
Number of Genes
Background Subtracted Linlog Transformed
Background Subtracted Log2 Transformed
Non-Background Subtracted Log2 Transformed
Non-Background Subtracted Linlog Transformed
Differences in fluorochrome characteristics, scanner wavelength sensitivities and settings, dye incorporation, and other non-biological effects all contribute to differences in the intensities between the two channels of any hybridization. And while global normalization techniques apply the same adjustment to every spot for a hybridization, it is often necessary to correct for intensity and location specific effects[13, 15, 27–29]. CARMA implements a locally weighted regression (lowess)[15, 16] transformation that can adjust for either intensity or location (or both) dependent effects. In addition, CARMA normalizes between both channels of all arrays, serving to minimize the Array term in the linear model and aiding in the visualization of the normalized data.
Microarray datasets usually contain thousands of elements, each representing one gene, but only a few measurements for each element. Whereas performing a global ANOVA on the entire dataset assumes equal variance within the data for each gene, it is generally accepted that independent analysis of the subset of data for each gene does not utilize sufficient data to determine an adequate representation of the variance associated with each gene. On the other hand, the major consequence of performing a gene-by-gene ANOVA is the overestimation of the significance of the calculated differences in expression for genes with abnormally small variances, and the underestimation of the significance of the calculated differences in expression for genes with abnormally large variances. Researchers have addressed these issues by including information from all genes on the array when assessing the significance of differences in expression for each gene [30–34]. CARMA calculates four variances and associated p-values for each gene: gene specific variance, pooled variance (average for all genes), half of the gene and half of the pooled variance, and an estimator based on the James-Stein-Lindley shrinkage concept that uses a formula to calculate the variance based on both the gene and pooled variances.
Given that the samples hybridized to most large microarrays will contain transcripts for only a subset of the genes represented on the microarray, removing spots that exhibit low intensities for all samples can both reduce the number of genes incorrectly identified as differentially expressed between samples and decrease the computing time required for subsequent analyses. As implemented in CARMA, the ANOVA is usually (based on typical user settings) performed on only those genes that have background subtracted intensities greater than 1 or 2 background standard deviations for at least 51% of the measurements for at least one sample in the hybridization scheme. Furthermore, in the case of microarrays with replicate elements, usually at least 51% of the replicates must have background subtracted intensities greater than 1 or 2 background standard deviations in order for any of the replicates to be included in the ANOVA. In other words, the ANOVA is only performed on genes that are consistently expressed at measurable levels in at least one of the samples.
CARMA also has the capability to remove anomalous measurements through outlier detection. Dust, impurities, surface inhomogeneities, fluorochrome-specific effects, local hybridization effects, technician effects, etc. all contribute to non-systematic variability in the measured intensities. Anomalous measurements can be identified by their incongruity with other measurements within a hybridization and through inconsistencies between replicated measures[22, 27]. Following the ANOVA of the normalized data for a gene, CARMA applies the following formula to identify outliers:
rti> |quantile((OP/2)/n, df -1)|
Most microarray hybridizations will have regions that are obviously problematic. While it may not be worthwhile to flag small abnormalities such as fine dust particles, large abnormalities should be flagged for exclusion from analysis. Missing data, whether flagged manually, filtered out, removed through outlier detection, or unavailable because of a failed hybridization may result in some experimental imbalance. This imbalance, however, is less problematic than including erroneous data. Therefore, CARMA utilizes functions that can accommodate missing data, to apply the linear model and perform the ANOVA.
Any analysis is only as good as its ability to portray relevant information to the researcher. In addition to generating tab delimited output files, CARMA creates an Adobe Portable Document Format (pdf) file containing easy to interpret graphical output (Figure 4) including an estimate, and its standard error, for each level (possible value) of the Variety effect, as well as plots of the normalized data and other statistical quality control information. This graphical output allows the researcher to refine and prioritize the list of differentially expressed genes by helping to identify cases of non-normality, outliers, unexpected patterns, etc. Visualizing the normalized data and the results of the ANOVA also helps to identify and correct cases of mislabeled samples and misaligned grids. All of the numbers used and generated by the ANOVA, and a file containing the contrasts between all pair-wise combinations of the levels of the Variety effect are also provided in delimited text files.
Microarray experiments are intended to determine relative differences in gene expression between various treatments or conditions. As with nearly all experiments, and exacerbated by the number of measures obtained from microarray hybridizations, experimental noise can confound measurements and lead to the incorrect identification of random variations as significant differences in gene expression. We have employed a generalized approach and developed supporting software (CARMA) for performing ANOVA on microarray datasets that is easy to implement, generates readily interpretable graphical results, and accounts for experimental sources of variability. Applying ANOVA to microarray datasets incorporating replicated measures improves estimates of differences in gene expression between samples and provides a statistical basis for determining the significance of these differences. Also, because each sample is involved in multiple hybridizations, it is possible to identify and remove incongruous data points that are caused by dust particles, local background, etc, as well as allow for missing data caused by occasional failed or abnormal hybridizations. CARMA provides a clear quantitative and statistical characterization of each measured element on the microarray that can be used to assess marginally acceptable measures and improve confidence in the interpretation of microarray results. Overall, applying CARMA to microarray datasets incorporating replicated measures effectively reduces the number of gene incorrectly identified as differentially expressed and results in a more robust and reliable analysis.
In some situations it is advantageous to study the effects of more than one Variety on gene expression. For example, in gene knockout studies both genotype and treatment dependent effects may be important. Also, in cases where subjects are exposed to multiple treatments, accounting for individual-specific effects may expose differences in gene expression due to treatment that might otherwise be obscured. In these cases equation 1.4 can be modified to include terms for each of the varieties of interest. For example, the following model could be used to examine the effects of genotype (V1), drug treatment (V2), and the interaction between genotype and drug treatment (V1V2):
Theoretically equation 1.5 could be expanded to include any number of varieties, however because of an exponential increase in the number of hybridization that must be performed most researchers limit their experiments to studying a maximum of two varieties.
Implementation of CARMA depends on a few assumptions. First, the lowess normalization assumes that the expression levels for the majority of genes in each sample are the same. While this is usually the case, even under conditions where the expression levels of the majority of genes are affected, lowess normalization produces the desired result of adjusting the dataset such that genes that behave dissimilarly from the majority are more likely to be identified as differentially expressed. Second, when determining the significance of differences in gene expression between samples, it is assumed that the data is normally distributed after the application of the linear model. We make this assumption in order to utilize the readily available R packages that perform ANOVA and their superior computational efficiency over non-parametric approaches. Lastly, in order to apply the ANOVA to each gene individually, it is assumed that after transformation and normalization each gene is independent of the other genes on each array. Theoretically, because each element (gene) is on every array there is not complete independence between genes, however we chose to implement the ANOVA on each gene independently in order to allow genes to have dissimilar variances, and to efficiently implement the ANOVA.
Background subtraction has been criticized for inflating the variance of microarray datasets, but this increased variability is almost always limited to elements that exhibit low intensity measures. This increased variability is not simply due to the smaller numbers, but rather the methods by which these numbers are measured and transformed. Both photomultiplier tubes (PMT) and charge-coupled devices (CCD) cannot accurately distinguish small differences in intensity. Therefore the assumption that error is multiplicative for these values is inaccurate and thus the application of a basic log-based transformation is inappropriate. The commonly seen flaring of the data at lower intensities, in either a simple green channel vs. red channel or a ratio-intensity plot, is not usually an indication of the inappropriate application of background subtraction, but rather an inappropriate transformation. In addition, not removing the relatively large uniform background associated with some datasets, such as those generated by CCD based scanners, can completely obscure the magnitude of any differences in gene expression. As illustrated by this dataset, different approaches to background subtraction and transformation can have a significant effect on the identification of differentially expressed genes. In fact, at most only 25 % the top 100 genes were shared between the background subtracted and non-background subtracted results.
Many statisticians are hesitant to remove outliers based solely on statistical criteria. In the case of microarray datasets, however, we know that there are invalid measurements in every dataset; yet it is often impractical to manually flag each instance. CARMA not only automatically detects and removes outliers, but also provides supporting graphs to assist in the final determination of the validity of the measurements that were removed. For example, in Figure 4, panels B and C, the green 'x' that indicates one of the excluded outliers is clearly separated from the rest of the measures (the corresponding red measurement is also excluded). Of course the researcher can also investigate each spot on the images from which the measurements were derived to further authenticate measurements flagged as outliers, or turn off outlier detection altogether if so desired. In addition to removing sporadic anomalous measurements, outlier detection has proven invaluable for detecting mislabeled samples and misaligned grids. In these cases, without outlier detection, all or a part of a microarray dataset is often labeled as too variable and of little value. A quick inspection of CARMA's graphical output reveals these problems as a series of genes for which the same hybridization's values were dropped as outliers, providing not only an indication that there is a problem, but also the exact hybridization that is the cause of the problem.
Because of the balanced design of this experiment we were able to assess the affects of inter-individual variability in the aquaporin-1 knockout vs. wildtype microarray dataset. The fact that 3% (129 out of 4361) of the confidently measured genes were identified as differentially expressed between mice of the same genotype highlights the significant amount of variability in gene expression between even genetically similar mice. This finding corroborates the results of an earlier study, underscoring the need for including biological replicates in any study, especially those addressing gene expression and molecular activities.
CARMA was designed to be an integrated, easy to use, analysis platform that researchers can apply to their microarray datasets without preprocessing their data or writing any computer code, with an emphasis on providing results in an easily interpretable format. In particular, we have attempted to identify and address issues such as the relatively high background and scanning related photobleaching that can occur with CCD based imaging systems, and include means to address many of the real-world problems associated with microarray experiments. There are already a number of existing microarray analysis tools that prove useful in analyzing microarray datasets [34, 38, 39]. CARMA implements many of the same statistical and analytical processes employed in these packages and includes the following additional features:
Ability to read data files generated by most microarray image processing software
No need to preprocess or combine raw data files
Automatic exclusion of genes with low confidence measures for all samples
Modified linlog transformation that better handles large negative numbers (that can result from background subtraction)
Automatic computation of linlog crossover point
Simultaneous intensity and location lowess normalization
Ability to process incomplete datasets for fixed effect models
Automatic outlier detection and removal
Detailed graphical output for each gene
Ability to detect and identify misaligned grids or mislabeled samples
The relative expression values generated by CARMA (i.e. the Variety value for each gene relative to a reference sample), can be further processed by commercial and freeware software packages designed to organize, cluster and display microarray data. In this regard, the relative expression values are analogous to the more traditional ratio-metric measures in that both provide a relative difference in expression between samples. Of course more advanced users can modify or extract portions of CARMA to integrate into their own analyses if so desired. Future development of CARMA may include developing a graphical user interface, improving the implementation of mixed models, adding new methods of normalization, implementing bootstrapping to calculate significance, and adapting CARMA to function as a Bioconductor package.
Project name: CARMA
Project home page: http://www.u.arizona.edu/~jhoying/carma.html
Operating system(s): Platform independent, only tested on Microsoft Windows 2000/XP
Programming language: R
Other requirements: R version 1.9 or higher
License: GNU General Public License
Restrictions to use by non-academics: Permission from University of Arizona required
This research was supported in part by a University of Arizona Foundation Award (HLB), the Diamond Family Foundation, PANDA (People Acting Now Discover Answers) (JBH), University of Arizona BIO5 Graduate Research Award, and NIH awards #DK064706 (HLB) and #HL67067 (JBH).
We thank Dave Henderson for valuable discussions during the preparation of this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.