For all "caCORRECT" in this manuscript, version 2.1 of caCORRECT was used, as provided at http://cacorrect.bme.gatech.edu. The canonical bioconductor Rimplementations of Harshlighting, RMA, MAS5.0, PLIER, and affycomp were also used. What follows in this section is a brief review of the previously published caCORRECT procedures which are pertinent to this study, as well as a description of the improvements which have been made since previous publication [12]. Much of the following text has been reproduced with modification from RAM's doctoral dissertation [23].
Variance scoring
The cornerstone of caCORRECT's outlier detection is the concept of variance scoring, which is a description of each probe's tendency to be an outlier. Calculation of this score, h, is similar to conducting a ttest for whether or not the observed probe intensity for a given chip belongs to the observed distribution of probe intensities for all other chips in the dataset. A key feature of caCORRECT is that this distribution is estimated and updated multiple times during the course of a single caCORRECT session. Because of this dynamic updating, it is possible to identify subtle artifacts or pardon false artifacts that may have been misdiagnosed initially. Please refer to File01_supplement.pdf, "Supplemental Methods", for a more detailed description of how the variance score is calculated.
Artifact segmentation
Once the variance statistic, h, has been calculated for each probe on each chip, falsecolor heat maps of h, showing probes in their original spatial layout, are generated to display regions of high noise. For a good quality microarray chip, h will represent biological variation in RNA expression for the sample. In this case, h will be distributed independently and nearlyuniform in magnitude throughout the chip. More commonly, however, protocols do not achieve uniform hybridization due to uneven drying, formation of salt streaks, scratching of the microarray surface due to contact with skin or dust, miscalculated hybridization times, or failure to control environmental variables such as ozone [24]. All of these most common mistakes result in localized regions of large h (artifacts) on the heat map.
caCORRECT uses a simple sliding window method to flag probes that meet two conditions: (1) they exist in regions of other high hscoring probes, and (2) they have high h scores themselves. These two conditions ensure that most of the obvious artifacts are caught, but that most of the naturally occurring biological variance goes unnoticed. Because the intended platform for caCORRECT is a webbased grid service, artifact identification has been streamlined for speed and memory efficiency. More computationally intense methods such as active contours, PDEbased methods, or shape matching have been excluded in favor of a quick marching window algorithm that seems to work well for a wide range of data. To remove any global chip effects that arise from sample preparation or amplification, normalization is performed as described in the following section.
Artifactaware normalization
Quantile normalization reduces noise in microarray experiment replicates by forcing the intensity distribution of each chip to be identical [Bolstad B: Probe level quantile normalization of high density oligonucleotide array data, 2001]. The critical assumption behind quantile normalization is that for large genomewide studies such as microarrays, the number of genes that are invariant to the experimental variables far outnumbers the number of biomarkers genes that respond to or predict experimental variables. Quantile normalization is generally good for the microarray problem, where the distributions are poorly defined and parametric methods such as median centering or Zscore normalization have their underlying assumptions violated. The power of quantile normalization comes with a major caveat. If the probe intensities of the chips are not distributed similarly, the algorithm will indiscriminately warp all the distributions to be the same, including any that may have been correct initially. Fortunately, it is a reasonable assumption that highquality microarray data from a single source on a single platform follow the same distribution. Unfortunately, this high quality assumption is not valid for much realworld data, where chip artifacts can significantly alter the distribution of intensities on a chip. One bad chip can warp the others when quantile normalization is performed, thus compromising the reproducibility of the entire data set. A way to alleviate this problem is to identify artifacts before quantile normalization, and set them aside temporarily. In theory, perfect knowledge of artifacts would allow for perfect correction. This process is called "artifactaware quantile normalization." caCORRECT uses four iterations of artifactaware normalization and artifact identification in order to achieve a near steadystate normalization result with a relatively small amount of computation time.
To illustrate the invasive effect that artifacts can have on a dataset when quantile normalization is performed, synthetic microarray data were generated in the following manner: Six highquality chips from the Schuetz et al. dataset [25] were chosen, one of which was set aside to receive artifacts. One third of the selected chip was modified by a multiplicative factor of 0.5, representing a lowintensity artifact. A different third of the selected chip was modified by a multiplicative factor of 10, representing a highintensity artifact. These six chips were then processed using caCORRECT, and the probe intensities were monitored for warping at each intermediate step.
Artifact replacement and probe intensity model
Once artifacts have been identified, a clean dataset is generated with the artifactual data appropriately replaced. The current version of caCORRECT uses a data imputation that is mathematically similar to the model used by RMA and others [5, 9, 10]. Notably, socalled perfectmatch and mismatch probes are treated identically, i.e. we do not use PMMM. In this scheme, the artifactflagged probe level data is replaced with the bestfit estimate for that probe, given the model and data from nonartifactual probes on all of the chips being processed.
Observed microarray intensity values after global normalization are modeled as a multiplicative combination of target RNA abundance (gene expression) and probespecific effect (probe affinity). The model is given as:
{x}_{b,p,j}={\theta}_{p,j}{a}_{b,p}+{\epsilon}_{b,p,j},
where x_{
b,p,j
}is the observed intensity for the b^{th}probe in the p^{th}probe set on the j^{th}chip, θ_{
p,j
}is the gene expression term, a_{
b,p
}is the probe affinity term, and ε_{
b,p,j
}is the additive error term.
The set of equations for the p^{th} probe set using an additive model of error can be represented in the following matrix form, given a set of N chips and B_{
p
}probes in the p^{th} probe set.
{X}_{p}={\theta}_{p}{a}_{p}+{\epsilon}_{p},
where
{X}_{p}=\left[\begin{array}{ccc}\hfill {x}_{1,p,1}\hfill & \hfill \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \hfill {x}_{{B}_{p},p,1}\hfill \\ \hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ \hfill {x}_{1,p,N}\hfill & \hfill \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \hfill {x}_{{B}_{p},p,N}\hfill \end{array}\right],
{\theta}_{p}=\left[\begin{array}{c}\hfill {\theta}_{p,1}\hfill \\ \hfill \vdots \hfill \\ \hfill {\theta}_{p,N}\hfill \end{array}\right],
{a}_{p}=\left[\begin{array}{ccc}\hfill {a}_{1,p}\hfill & \hfill \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \hfill {a}_{{B}_{p},p}\hfill \end{array}\right],
and
{\epsilon}_{p}=\left[\begin{array}{ccc}\hfill {\epsilon}_{1,p,1}\hfill & \hfill \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \hfill {\epsilon}_{{B}_{p},p,1}\hfill \\ \hfill \vdots \hfill & \hfill \ddots \hfill & \hfill \vdots \hfill \\ \hfill {\epsilon}_{1,p,N}\hfill & \hfill \cdots \phantom{\rule{0.3em}{0ex}}\hfill & \hfill {\epsilon}_{{B}_{p},p,N}\hfill \end{array}\right].
We define the solution to this matrix equation as that which minimizes the Frobenius norm of the above error matrix ε_{
p
}as defined below.
{\u2225{\epsilon}_{p}\u2225}_{F}=\sqrt{\sum _{j=1}^{N}\sum _{b=1}^{{B}_{p}}{\left({\epsilon}_{b,p,j}\right)}^{2}}
The solution which satisfies the above condition can be derived from the singular value decomposition (SVD) of X_{
p
}. Here, the SVD of X_{
p
}is given in the form of X_{
p
}= USV^{T}, such that,U\in {\Re}^{N\times N}, S\in {\Re}^{N\times {B}_{\mathsf{\text{p}}}} and V\in {\Re}^{{B}_{\mathsf{\text{p}}}\times {B}_{\mathsf{\text{p}}}}. If the largest singular value in S, s_{1}, is arranged as the first diagonal element of S, then θ_{
p
}is s_{1} times the first column of U and {a}_{p}^{\mathsf{\text{T}}} is the first column of V.
We chose to introduce the additional constraint that the geometric mean of the lumped probe affinity terms a_{
b,p
}equals one. The number one is arbitrary here, but it allows the convenient interpretation that the values of gene expression, θ_{
p,j
}, are on the same scale as the probe intensities, x_{
b,p,j
}. To satisfy this constraint, the earlier solution for a_{
p
}and θ_{
p
}can simply be scaled by respective multiplication and division.
Imputation of artifact values
With values of θ_{
p
}and a_{
p
}learned from SVD, the probe intensities which are expected from the model can be generated via the multiplication θ_{
p
}a_{
p
}Incorporation of artifacts into this model is done in the following 2step ExpectationMaximization algorithm until θ_{
p
}and a_{
p
}converge.

1)
Estimate model parameters θ _{
p
}and a _{
p
}from the SVD of the observed data X _{
p
}.

2)
Replace known artifact values in X _{
p
}with information from the corresponding elements of θ _{
p
} a _{
p
}.
This procedure of replacing values in X_{
p
}with values from θ_{
p
}a_{
p
}has the effect of reducing corresponding values of ε_{
p
}to zero, and thus has the property of never increasing the Frobenius norm of ε_{
p
}. Since step 1 is a global minimization of the Frobenius norm of ε_{
p
}given X_{
p
}, and step 2 alters X_{
p
}in a way that can only further decrease this error, the entire procedure is guaranteed to converge due to the nonincreasing nature of the error function, which is naturally bounded by zero. Troyanskaya et al. have previously used a similar "SVDImpute" procedure for missing gene expression data [22]. Please see additional file 1, "Supplemental Methods", for further details.
Datasets and synthetic artifact generation
In order to quantify the ability of caCORRECT to improve data quality, we altered public microarray datasets with a variety of randomized synthetic artifacts, and then processed the altered data with caCORRECT or Harshlighting. Datasets were generated from a variety of clinical cancer studies using different microarray platforms. To date, the caCORRECT website has been tested with data from 18 different Affymetrix platforms, but the results of our synthetic artifact analysis which are presented here are limited to our indepth study of two key data sets. Three separate experiments were performed involving application of synthetic artifacts.
Third party artifacts on Hess et al. data
First, we obtained a large data set that was originally generated by Hess et al. [26] in the study of breast cancer, and then later used as part of MAQC phaseII study on classifier performance. This original data set consisted of 130 highquality samples assayed on the Affymetrix HGU133A platform. The Hess et al. study divided samples into training (n = 81) and validation (n = 49) sets and we retain this distinction. Chips in the validation set were selected for synthetic artifact manipulation by an independent team lead by Wendell Jones of Expression Analysis, (previously unaffiliated with caCORRECT). Two types of artifacts were investigated here: (1) a "black hole" artifact in which an elliptical region of the microarray had probe intensities lowered severely, and (2) a "hot spot" artifact in which an elliptical region of the microarray had probe intensities raised severely. Twenty digitally altered copies of each of the original 49 chips were prepared as follows: A single artifact with random orientation and location was applied to each chip (ten chips received "black holes", and ten received "hot spots"). For each of the altered chips, gene expressions were calculated both before and after caCORRECT's complete artifact detection and value imputation. Expression data for all probes were determined both using MAS5 in Expression Console and the R implementation of RMA. Each of the estimated gene expressions from the altered chips was then compared to the "true" gene expression values obtained from the respective original, unaltered chip to yield an error value representing the deleterious effect of the artifact on gene expression estimation. The errors for each probe set (22283), each chip (49), and each artifact replicate of a given type (10) were then pooled together to form distributions of errors of size n = 10,918,670. Eight such distributions were created in total, representing the combinations of two gene expression methods, two artifact types and either unprocessed data, or for data cleaned with caCORRECT.
Artifacts generated on data from Hess et al
A common criticism of our synthetic artifact work is that the size and severity of tested artifacts (for example those provided by Jones' team) are rarely observed in practice. While we have encountered many chips plagued by large, severe artifacts, (see http://arraywiki.bme.gatech.edu/index.php/Hall_of_Fame for some examples [27]), we set out here to investigate how the magnitude and size of artifacts may affect our previous conclusions as to the usefulness of caCORRECT. Thus, we have created a second set of synthetic artifact data specifically for this purpose by our own application of artifacts to the same breast cancer dataset from Hess et al that Jones' team used. Only the 15 highest quality arrays (via visual inspection of heat map images) from this dataset were used in order to both speed up computation as well as to more precisely isolate the effects of our synthetic artifacts. For a variety of sizes and magnitudes, circular regions of the array were altered multiplicatively. Care was taken so that no more than 1/2 of the radius of the circular insult appeared off of the chip. The final gene expression obtained from the altered array was then compared to the gene expression obtained from the original unaltered array, and the average of the relative error for each probe set on the array was stored. For each pair of size and magnitude, this process was repeated a total of 3 times.
Artifacts observed and generated on data from Schuetz et al
Finally, a realistic mixture of lesssevere artifacts were applied to the Schuetz et al. [25] dataset in order to monitor the effect of typical artifacts on differential gene finding, and the ability of caCORRECT to ameliorate these effects. This data set consisted of 20 Renal Cell Carcinoma (RCC) samples assayed on the Affymetrix HGFocus Platform by Schuetz et al. Samples were classified by tumor subtype: Clear Cell (CC), Oncocytoma (ONC), and Chromophobe (CHR). For biomarker selection purposes, samples were combined into two classes: seven CHR or ONC versus thirteen CC.
Artifacts observed on data from West et al
The dataset which was used to showcase realworld artifact removal and replacement was a set of 49 Hu6800 Affymetrix microarrays from the study by West et al. This study investigated Estrogen Receptor and lymph node metastatic status [28]. This data set is chosen because it used an older version of Affymetrix chip in which the properties of artifacts are more easily visualized than on modern chips.
PCR Validation
To determine the effect that caCORRECT had on the ability to correctly identify biomarkers of disease from microarray data, a panel of 96 genes of interest for RCC was assembled for PCR study in two phases. These genes were identified from a combination of genes previously identified in the literature as well as a set of genes whose biomarker status was disagreed upon between the caCORRECT and noncaCORRECT versions of the Schuetz et al. data sets using omniBiomarker (http://omniBiomarker.bme.gatech.edu). All PCR analysis was performed on independent patient tissue samples with respect to those used for the microarray analysis.
Gene expression was assessed by quantitative RTPCR, using total RNA from fixed tissues of 17 clear cell and 7 chromophobe RCC patients. Duplicate experiments were performed according to published protocols with minor modifications: Histological sections were deparaffinized with ethanol and xylene, and cells of interest were microdissected with a sterile scalpel. Tissues were digested in buffer containing proteinase K at 60°C overnight. RNA was extracted with phenol/chloroform, and genomic DNA was removed with DNase. RNA quality and quantity were assessed with a Bioanalyzer (Agilent Technologies). Up to 3 μg of RNA was used for first strand cDNA synthesis with Superscript III (Invitrogen). PCR was performed with a customdesigned Taqman Low Density Array (LDA, Applied Biosystems) in a 96well microfluidic card format, using the ABI PRISM 7900HT Sequence Detection System (highthroughput realtime PCR system). Gene expression data were normalized relative to the geometric mean of two housekeeping genes (18S, ACTB). LDA runs were analyzed by using Relative Quantification (RQ) Manager (Applied Biosystems) software.
Relative normalized PCR gene expression was compared in renal tumor subtypes. Genes were declared as being "validated by PCR" if they had an average fold change between classes of magnitude greater than 2, corresponding to an average C_{t} (threshold cycle) value difference of 1.