For all "caCORRECT" in this manuscript, version 2.1 of caCORRECT was used, as provided at http://cacorrect.bme.gatech.edu. The canonical bioconductor R-implementations of Harshlighting, RMA, MAS5.0, PLIER, and affycomp were also used. What follows in this section is a brief review of the previously published caCORRECT procedures which are pertinent to this study, as well as a description of the improvements which have been made since previous publication [12]. Much of the following text has been reproduced with modification from RAM's doctoral dissertation [23].
Variance scoring
The cornerstone of caCORRECT's outlier detection is the concept of variance scoring, which is a description of each probe's tendency to be an outlier. Calculation of this score, h, is similar to conducting a t-test for whether or not the observed probe intensity for a given chip belongs to the observed distribution of probe intensities for all other chips in the dataset. A key feature of caCORRECT is that this distribution is estimated and updated multiple times during the course of a single caCORRECT session. Because of this dynamic updating, it is possible to identify subtle artifacts or pardon false artifacts that may have been misdiagnosed initially. Please refer to File01_supplement.pdf, "Supplemental Methods", for a more detailed description of how the variance score is calculated.
Artifact segmentation
Once the variance statistic, h, has been calculated for each probe on each chip, false-color heat maps of h, showing probes in their original spatial layout, are generated to display regions of high noise. For a good quality microarray chip, h will represent biological variation in RNA expression for the sample. In this case, h will be distributed independently and nearly-uniform in magnitude throughout the chip. More commonly, however, protocols do not achieve uniform hybridization due to uneven drying, formation of salt streaks, scratching of the microarray surface due to contact with skin or dust, miscalculated hybridization times, or failure to control environmental variables such as ozone [24]. All of these most common mistakes result in localized regions of large h (artifacts) on the heat map.
caCORRECT uses a simple sliding window method to flag probes that meet two conditions: (1) they exist in regions of other high h-scoring probes, and (2) they have high h scores themselves. These two conditions ensure that most of the obvious artifacts are caught, but that most of the naturally occurring biological variance goes unnoticed. Because the intended platform for caCORRECT is a web-based grid service, artifact identification has been streamlined for speed and memory efficiency. More computationally intense methods such as active contours, PDE-based methods, or shape matching have been excluded in favor of a quick marching window algorithm that seems to work well for a wide range of data. To remove any global chip effects that arise from sample preparation or amplification, normalization is performed as described in the following section.
Artifact-aware normalization
Quantile normalization reduces noise in microarray experiment replicates by forcing the intensity distribution of each chip to be identical [Bolstad B: Probe level quantile normalization of high density oligonucleotide array data, 2001]. The critical assumption behind quantile normalization is that for large genome-wide studies such as microarrays, the number of genes that are invariant to the experimental variables far outnumbers the number of biomarkers-- genes that respond to or predict experimental variables. Quantile normalization is generally good for the microarray problem, where the distributions are poorly defined and parametric methods such as median centering or Z-score normalization have their underlying assumptions violated. The power of quantile normalization comes with a major caveat. If the probe intensities of the chips are not distributed similarly, the algorithm will indiscriminately warp all the distributions to be the same, including any that may have been correct initially. Fortunately, it is a reasonable assumption that high-quality microarray data from a single source on a single platform follow the same distribution. Unfortunately, this high quality assumption is not valid for much real-world data, where chip artifacts can significantly alter the distribution of intensities on a chip. One bad chip can warp the others when quantile normalization is performed, thus compromising the reproducibility of the entire data set. A way to alleviate this problem is to identify artifacts before quantile normalization, and set them aside temporarily. In theory, perfect knowledge of artifacts would allow for perfect correction. This process is called "artifact-aware quantile normalization." caCORRECT uses four iterations of artifact-aware normalization and artifact identification in order to achieve a near steady-state normalization result with a relatively small amount of computation time.
To illustrate the invasive effect that artifacts can have on a dataset when quantile normalization is performed, synthetic microarray data were generated in the following manner: Six high-quality chips from the Schuetz et al. dataset [25] were chosen, one of which was set aside to receive artifacts. One third of the selected chip was modified by a multiplicative factor of 0.5, representing a low-intensity artifact. A different third of the selected chip was modified by a multiplicative factor of 10, representing a high-intensity artifact. These six chips were then processed using caCORRECT, and the probe intensities were monitored for warping at each intermediate step.
Artifact replacement and probe intensity model
Once artifacts have been identified, a clean dataset is generated with the artifactual data appropriately replaced. The current version of caCORRECT uses a data imputation that is mathematically similar to the model used by RMA and others [5, 9, 10]. Notably, so-called perfect-match and mismatch probes are treated identically, i.e. we do not use PM-MM. In this scheme, the artifact-flagged probe level data is replaced with the best-fit estimate for that probe, given the model and data from non-artifactual probes on all of the chips being processed.
Observed microarray intensity values after global normalization are modeled as a multiplicative combination of target RNA abundance (gene expression) and probe-specific effect (probe affinity). The model is given as:
where x
b,p,j
is the observed intensity for the bthprobe in the pthprobe set on the jthchip, θ
p,j
is the gene expression term, a
b,p
is the probe affinity term, and ε
b,p,j
is the additive error term.
The set of equations for the pth probe set using an additive model of error can be represented in the following matrix form, given a set of N chips and B
p
probes in the pth probe set.
where
and
We define the solution to this matrix equation as that which minimizes the Frobenius norm of the above error matrix ε
p
as defined below.
The solution which satisfies the above condition can be derived from the singular value decomposition (SVD) of X
p
. Here, the SVD of X
p
is given in the form of X
p
= USVT, such that,, and . If the largest singular value in S, s1, is arranged as the first diagonal element of S, then θ
p
is s1 times the first column of U and is the first column of V.
We chose to introduce the additional constraint that the geometric mean of the lumped probe affinity terms a
b,p
equals one. The number one is arbitrary here, but it allows the convenient interpretation that the values of gene expression, θ
p,j
, are on the same scale as the probe intensities, x
b,p,j
. To satisfy this constraint, the earlier solution for a
p
and θ
p
can simply be scaled by respective multiplication and division.
Imputation of artifact values
With values of θ
p
and a
p
learned from SVD, the probe intensities which are expected from the model can be generated via the multiplication θ
p
a
p
Incorporation of artifacts into this model is done in the following 2-step Expectation-Maximization algorithm until θ
p
and a
p
converge.
-
1)
Estimate model parameters θ
p
and a
p
from the SVD of the observed data X
p
.
-
2)
Replace known artifact values in X
p
with information from the corresponding elements of θ
p
a
p
.
This procedure of replacing values in X
p
with values from θ
p
a
p
has the effect of reducing corresponding values of ε
p
to zero, and thus has the property of never increasing the Frobenius norm of ε
p
. Since step 1 is a global minimization of the Frobenius norm of ε
p
given X
p
, and step 2 alters X
p
in a way that can only further decrease this error, the entire procedure is guaranteed to converge due to the non-increasing nature of the error function, which is naturally bounded by zero. Troyanskaya et al. have previously used a similar "SVDImpute" procedure for missing gene expression data [22]. Please see additional file 1, "Supplemental Methods", for further details.
Datasets and synthetic artifact generation
In order to quantify the ability of caCORRECT to improve data quality, we altered public microarray datasets with a variety of randomized synthetic artifacts, and then processed the altered data with caCORRECT or Harshlighting. Datasets were generated from a variety of clinical cancer studies using different microarray platforms. To date, the caCORRECT website has been tested with data from 18 different Affymetrix platforms, but the results of our synthetic artifact analysis which are presented here are limited to our in-depth study of two key data sets. Three separate experiments were performed involving application of synthetic artifacts.
Third party artifacts on Hess et al. data
First, we obtained a large data set that was originally generated by Hess et al. [26] in the study of breast cancer, and then later used as part of MAQC phase-II study on classifier performance. This original data set consisted of 130 high-quality samples assayed on the Affymetrix HG-U133A platform. The Hess et al. study divided samples into training (n = 81) and validation (n = 49) sets and we retain this distinction. Chips in the validation set were selected for synthetic artifact manipulation by an independent team lead by Wendell Jones of Expression Analysis, (previously unaffiliated with caCORRECT). Two types of artifacts were investigated here: (1) a "black hole" artifact in which an elliptical region of the microarray had probe intensities lowered severely, and (2) a "hot spot" artifact in which an elliptical region of the microarray had probe intensities raised severely. Twenty digitally altered copies of each of the original 49 chips were prepared as follows: A single artifact with random orientation and location was applied to each chip (ten chips received "black holes", and ten received "hot spots"). For each of the altered chips, gene expressions were calculated both before and after caCORRECT's complete artifact detection and value imputation. Expression data for all probes were determined both using MAS5 in Expression Console and the R implementation of RMA. Each of the estimated gene expressions from the altered chips was then compared to the "true" gene expression values obtained from the respective original, unaltered chip to yield an error value representing the deleterious effect of the artifact on gene expression estimation. The errors for each probe set (22283), each chip (49), and each artifact replicate of a given type (10) were then pooled together to form distributions of errors of size n = 10,918,670. Eight such distributions were created in total, representing the combinations of two gene expression methods, two artifact types and either unprocessed data, or for data cleaned with caCORRECT.
Artifacts generated on data from Hess et al
A common criticism of our synthetic artifact work is that the size and severity of tested artifacts (for example those provided by Jones' team) are rarely observed in practice. While we have encountered many chips plagued by large, severe artifacts, (see http://arraywiki.bme.gatech.edu/index.php/Hall_of_Fame for some examples [27]), we set out here to investigate how the magnitude and size of artifacts may affect our previous conclusions as to the usefulness of caCORRECT. Thus, we have created a second set of synthetic artifact data specifically for this purpose by our own application of artifacts to the same breast cancer dataset from Hess et al that Jones' team used. Only the 15 highest quality arrays (via visual inspection of heat map images) from this dataset were used in order to both speed up computation as well as to more precisely isolate the effects of our synthetic artifacts. For a variety of sizes and magnitudes, circular regions of the array were altered multiplicatively. Care was taken so that no more than 1/2 of the radius of the circular insult appeared off of the chip. The final gene expression obtained from the altered array was then compared to the gene expression obtained from the original unaltered array, and the average of the relative error for each probe set on the array was stored. For each pair of size and magnitude, this process was repeated a total of 3 times.
Artifacts observed and generated on data from Schuetz et al
Finally, a realistic mixture of less-severe artifacts were applied to the Schuetz et al. [25] dataset in order to monitor the effect of typical artifacts on differential gene finding, and the ability of caCORRECT to ameliorate these effects. This data set consisted of 20 Renal Cell Carcinoma (RCC) samples assayed on the Affymetrix HG-Focus Platform by Schuetz et al. Samples were classified by tumor subtype: Clear Cell (CC), Oncocytoma (ONC), and Chromophobe (CHR). For biomarker selection purposes, samples were combined into two classes: seven CHR or ONC versus thirteen CC.
Artifacts observed on data from West et al
The dataset which was used to showcase real-world artifact removal and replacement was a set of 49 Hu-6800 Affymetrix microarrays from the study by West et al. This study investigated Estrogen Receptor and lymph node metastatic status [28]. This data set is chosen because it used an older version of Affymetrix chip in which the properties of artifacts are more easily visualized than on modern chips.
PCR Validation
To determine the effect that caCORRECT had on the ability to correctly identify biomarkers of disease from microarray data, a panel of 96 genes of interest for RCC was assembled for PCR study in two phases. These genes were identified from a combination of genes previously identified in the literature as well as a set of genes whose biomarker status was disagreed upon between the caCORRECT and non-caCORRECT versions of the Schuetz et al. data sets using omniBiomarker (http://omniBiomarker.bme.gatech.edu). All PCR analysis was performed on independent patient tissue samples with respect to those used for the microarray analysis.
Gene expression was assessed by quantitative RT-PCR, using total RNA from fixed tissues of 17 clear cell and 7 chromophobe RCC patients. Duplicate experiments were performed according to published protocols with minor modifications: Histological sections were deparaffinized with ethanol and xylene, and cells of interest were microdissected with a sterile scalpel. Tissues were digested in buffer containing proteinase K at 60°C overnight. RNA was extracted with phenol/chloroform, and genomic DNA was removed with DNase. RNA quality and quantity were assessed with a Bioanalyzer (Agilent Technologies). Up to 3 μg of RNA was used for first strand cDNA synthesis with Superscript III (Invitrogen). PCR was performed with a custom-designed Taqman Low Density Array (LDA, Applied Biosystems) in a 96-well microfluidic card format, using the ABI PRISM 7900HT Sequence Detection System (high-throughput real-time PCR system). Gene expression data were normalized relative to the geometric mean of two housekeeping genes (18S, ACTB). LDA runs were analyzed by using Relative Quantification (RQ) Manager (Applied Biosystems) software.
Relative normalized PCR gene expression was compared in renal tumor subtypes. Genes were declared as being "validated by PCR" if they had an average fold change between classes of magnitude greater than 2, corresponding to an average Ct (threshold cycle) value difference of 1.